20
Astronomy & Astrophysics manuscript no. 32964_Arxiv c ESO 2018 June 12, 2018 Gaia Data Release 2: Using Gaia parallaxes X. Luri 1 , A.G.A. Brown 4 , L.M. Sarro 6 , F. Arenou 2 , C.A.L. Bailer-Jones 3 , A. Castro-Ginard 1 , J. de Bruijne 5 , T. Prusti 5 , C. Babusiaux 7, 2 , and H.E. Delgado 6 1 Dept. Física Quàntica i Astrofísica, Institut de Ciències del Cosmos (ICCUB), Universitat de Barcelona (IEEC-UB), Martí i Fran- quès 1, E08028 Barcelona, Spain 2 GEPI, Observatoire de Paris, Université PSL, CNRS, 5 Place Jules Janssen, 92190 Meudon, France 3 Max Planck Institute for Astronomy, Königstuhl 17, 69117 Heidelberg, Germany 4 Sterrewacht Leiden, Leiden University, P.O. Box 9513, 2300 RA, Leiden, The Netherlands 5 Science Support Oce, Directorate of Science, European Space Research and Technology Centre (ESA/ESTEC), Keplerlaan 1, 2201 AZ, Noordwijk, The Netherlands 6 Dpto. Inteligencia Artificial, UNED, Juan del Rosal, 16, 28040 Madrid 7 Univ. Grenoble Alpes, CNRS, IPAG, 38000 Grenoble, France Received date / Accepted date ABSTRACT Context. The second Gaia data release (Gaia DR2 ) provides precise five-parameter astrometric data (positions, proper motions, and parallaxes) for an unprecedented number of sources (more than 1.3 billion, mostly stars). This new wealth of data will enable the undertaking of statistical analysis of many astrophysical problems that were previously infeasible for lack of reliable astrometry, and in particular because of the lack of parallaxes. However, the use of this wealth of astrometric data comes with a specific challenge: how can the astrophysical parameters of interest be properly inferred from these data? Aims. The main focus of this paper, but not the only focus, is the issue of the estimation of distances from parallaxes, possibly combined with other information. We start with a critical review of the methods traditionally used to obtain distances from parallaxes and their shortcomings. Then we provide guidelines on how to use parallaxes more eciently to estimate distances by using Bayesian methods. In particular we also show that negative parallaxes, or parallaxes with relatively large uncertainties still contain valuable information. Finally, we provide examples that show more generally how to use astrometric data for parameter estimation, including the combination of proper motions and parallaxes and the handling of covariances in the uncertainties. Methods. The paper contains examples based on simulated Gaia data to illustrate the problems and the solutions proposed. Further- more, the developments and methods proposed in the paper are linked to a set of tutorials included in the Gaia archive documentation that provide practical examples and a good starting point for the application of the recommendations to actual problems. In all cases the source code for the analysis methods is provided. Results. Our main recommendation is to always treat the derivation of (astro-)physical parameters from astrometric data, in particular when parallaxes are involved, as an inference problem which should preferably be handled with a full Bayesian approach. Conclusions. Gaia will provide fundamental data for many fields of astronomy. Further data releases will provide more data, and more precise data. Nevertheless, to fully use the potential it will always be necessary to pay careful attention to the statistical treatment of parallaxes and proper motions. The purpose of this paper is to help astronomers find the correct approach. Key words. astrometry – parallaxes – Methods: data analysis – Methods: statistical – catalogues 1. Introduction The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral- laxes for an unprecedented number of objects (more than 1.3 billion). Like Hipparcos ESA (1997) in its day, the availability of a large amount of new astrometric data, and in particular paral- laxes, opens the way to revisit old astrophysical problems and to tackle new ones. In many cases this will involve the inference of astrophysical quantities from Gaia astrometry, a task that is less trivial than it appears, especially when parallaxes are involved. The naive use of the simple approach of inverting the paral- lax to estimate a distance can provide an acceptable estimate in a limited number of cases, in particular when a precise parallax for an individual object is used. However, one of the important con- tributions of Gaia DR2 will be the possibility of working with large samples of objects, all of them with measured parallaxes. In these cases a proper statistical treatment of the parallaxes in order to derive distances, especially (but not only) when the rel- ative uncertainties are large, is mandatory. Otherwise, the eects of the observational errors in the parallaxes can lead to poten- tially strong biases. More generally, the use of full astrometric data to derive astrophysical parameters should follow a similar approach. A proper statistical treatment of the data, its uncer- tainties, and correlations is required to take full advantage of the Gaia results. This paper is a complement for the Gaia consortium Gaia DR2 papers. We analyse the problem of the inference of distances (and other astrophysical parameters) from parallaxes. In Sect. 2 we start with a short review of the properties of the Gaia astrometric data. Then in Sect. 3 we review several of the most popular approaches to using measured parallaxes in as- tronomy and highlight their intricacies, pitfalls, and problems. In Sect. 4 we make recommendations on what we think is the appropriate way to use astrometric data. Finally, in Sect. 5 we link to some worked examples, ranging from very basic demon- Article number, page 1 of 20 arXiv:1804.09376v2 [astro-ph.IM] 8 Jun 2018

Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

Astronomy amp Astrophysics manuscript no 32964_Arxiv ccopyESO 2018June 12 2018

Gaia Data Release 2 Using Gaia parallaxesX Luri1 AGA Brown4 LM Sarro6 F Arenou2 CAL Bailer-Jones3 A Castro-Ginard1 J de Bruijne5 T

Prusti5 C Babusiaux7 2 and HE Delgado6

1 Dept Fiacutesica Quagraventica i Astrofiacutesica Institut de Ciegravencies del Cosmos (ICCUB) Universitat de Barcelona (IEEC-UB) Martiacute i Fran-quegraves 1 E08028 Barcelona Spain

2 GEPI Observatoire de Paris Universiteacute PSL CNRS 5 Place Jules Janssen 92190 Meudon France3 Max Planck Institute for Astronomy Koumlnigstuhl 17 69117 Heidelberg Germany4 Sterrewacht Leiden Leiden University PO Box 9513 2300 RA Leiden The Netherlands5 Science Support Office Directorate of Science European Space Research and Technology Centre (ESAESTEC) Keplerlaan 1

2201 AZ Noordwijk The Netherlands6 Dpto Inteligencia Artificial UNED Juan del Rosal 16 28040 Madrid7 Univ Grenoble Alpes CNRS IPAG 38000 Grenoble France

Received date Accepted date

ABSTRACT

Context The second Gaia data release (Gaia DR2 ) provides precise five-parameter astrometric data (positions proper motionsand parallaxes) for an unprecedented number of sources (more than 13 billion mostly stars) This new wealth of data will enable theundertaking of statistical analysis of many astrophysical problems that were previously infeasible for lack of reliable astrometry andin particular because of the lack of parallaxes However the use of this wealth of astrometric data comes with a specific challengehow can the astrophysical parameters of interest be properly inferred from these dataAims The main focus of this paper but not the only focus is the issue of the estimation of distances from parallaxes possiblycombined with other information We start with a critical review of the methods traditionally used to obtain distances from parallaxesand their shortcomings Then we provide guidelines on how to use parallaxes more efficiently to estimate distances by using Bayesianmethods In particular we also show that negative parallaxes or parallaxes with relatively large uncertainties still contain valuableinformation Finally we provide examples that show more generally how to use astrometric data for parameter estimation includingthe combination of proper motions and parallaxes and the handling of covariances in the uncertaintiesMethods The paper contains examples based on simulated Gaia data to illustrate the problems and the solutions proposed Further-more the developments and methods proposed in the paper are linked to a set of tutorials included in the Gaia archive documentationthat provide practical examples and a good starting point for the application of the recommendations to actual problems In all casesthe source code for the analysis methods is providedResults Our main recommendation is to always treat the derivation of (astro-)physical parameters from astrometric data in particularwhen parallaxes are involved as an inference problem which should preferably be handled with a full Bayesian approachConclusions Gaia will provide fundamental data for many fields of astronomy Further data releases will provide more data andmore precise data Nevertheless to fully use the potential it will always be necessary to pay careful attention to the statistical treatmentof parallaxes and proper motions The purpose of this paper is to help astronomers find the correct approach

Key words astrometry ndash parallaxes ndash Methods data analysis ndash Methods statistical ndash catalogues

1 Introduction

The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al(2018b) provides precise positions proper motions and paral-laxes for an unprecedented number of objects (more than 13billion) Like Hipparcos ESA (1997) in its day the availability ofa large amount of new astrometric data and in particular paral-laxes opens the way to revisit old astrophysical problems and totackle new ones In many cases this will involve the inference ofastrophysical quantities from Gaia astrometry a task that is lesstrivial than it appears especially when parallaxes are involved

The naive use of the simple approach of inverting the paral-lax to estimate a distance can provide an acceptable estimate in alimited number of cases in particular when a precise parallax foran individual object is used However one of the important con-tributions of Gaia DR2 will be the possibility of working withlarge samples of objects all of them with measured parallaxesIn these cases a proper statistical treatment of the parallaxes in

order to derive distances especially (but not only) when the rel-ative uncertainties are large is mandatory Otherwise the effectsof the observational errors in the parallaxes can lead to poten-tially strong biases More generally the use of full astrometricdata to derive astrophysical parameters should follow a similarapproach A proper statistical treatment of the data its uncer-tainties and correlations is required to take full advantage of theGaia results

This paper is a complement for the Gaia consortiumGaia DR2 papers We analyse the problem of the inference ofdistances (and other astrophysical parameters) from parallaxesIn Sect 2 we start with a short review of the properties of theGaia astrometric data Then in Sect 3 we review several of themost popular approaches to using measured parallaxes in as-tronomy and highlight their intricacies pitfalls and problemsIn Sect 4 we make recommendations on what we think is theappropriate way to use astrometric data Finally in Sect 5 welink to some worked examples ranging from very basic demon-

Article number page 1 of 20

arX

iv1

804

0937

6v2

[as

tro-

phI

M]

8 J

un 2

018

AampA proofs manuscript no 32964_Arxiv

strations to full Bayesian analysis available as Python and Rnotebooks and source code from the tutorial section on the Gaiaarchive 1

2 Gaia astrometric data

The Gaia astrometry ie celestial coordinates trigonometricparallaxes and proper motions for more than one billion ob-jects results from the observations coming from the spacecraftinstruments and their subsequent processing by the Gaia DataProcessing and Analysis Consortium (DPAC) The astrometricprocessing is detailed in Gaia Collaboration et al (2018c) andreaders are strongly encouraged to familiarise themselves withthe contents of that paper in order to understand the strengthsand weaknesses of the published astrometry and in particular ofthe parallaxes The processed data was submitted to extensivevalidation prior to publication as detailed in Gaia Collaborationet al (2018a) This paper is also highly recommended in orderto gain a proper understanding of how to use and how not to usethe astrometric data As a simple and striking example a smallnumber of sources with unrealistic very large positive and verylarge negative parallaxes are present in the data Advice on howto filter these sources from the data analysis is provided in theGaia DR2 documentation

21 Uncertainties

The published parallaxes and more generally all astrometricparameters are measured quantities and as such have an asso-ciated measurement uncertainty These uncertainties are pub-lished source per source and depend mostly on position onthe sky as a result of the scanning law and on magnitudeFor parallaxes uncertainties are typically around 004 mas forsources brighter than sim14 mag around 01 mas for sourceswith a G magnitude around 17 and around 07 mas at the faintend around 20 mag The astrometric uncertainties provided inGaia DR2 have been derived from the formal errors computedin the astrometric processing Unlike for Gaia DR1 the par-allax uncertainties have not been calibrated externally ie theyare known as an ensemble to be underestimated by sim8ndash12for faint sources (G gtsim 16 mag) outside the Galactic plane and byup to sim30 for bright stars (G ltsim 12 mag) Based on an assess-ment of the measured parallaxes of a set of about half a millionknown quasars which can be assumed in practice to have zeroparallax the uncertainties are normally distributed with impres-sive approximation (Fig 1) However as is common when tak-ing measurements and especially in such large samples like theGaia catalogue there are small numbers of outliers even up tounrealistically high confidence levels (eg at the 100σ level)

22 Correlations

The parallaxes for each source published in Gaia DR2 havenot been derived in isolation but result from a simultaneousfive-parameter fit of an astrometric source model to the data InGaia DR2 only one astrometric source model has been usedthat of a single star This model assumes a uniform rectilinearspace motion relative to the solar system barycentre The astro-metric data in Gaia DR2 thus comprise five astrometric param-

1 httpsgithubcomagabrownastrometry-inference-tutorials

iexcl6 iexcl5 iexcl4 iexcl3 iexcl2 iexcl1 0 1 2 3 4 5 6

Normalised centred parallax ($ + 0029)=frac34$

1

10

100

1000

1e4

1e5

Num

ber

per

bin

of0

1

Fig 1 Distribution of normalised re-centred parallaxes of 556 849quasars from the AllWISE catalogue present in Gaia DR2 (blue curve)The grey curve denotes the subsample composed of 492 920 sourceswith parallax errors σ$ lt 1 mas The centring adopted in this plotreflects a global parallax zero-point shift of minus0029 mas Ideally bothcurves should follow a normal distribution with zero mean and unit vari-ance The red curve shows a Gaussian distribution with the same stan-dard deviation (1081) as the normalised centred parallaxes for the fullsample Figure from Gaia Collaboration et al (2018c)

eters2 with their associated uncertainties but also ten correlationcoefficients between the estimated parameters It is critical to usethe full (5 times 5) covariance matrix when propagating the uncer-tainties on subsets andor linear combinations of the astrometricparameters

As an example consider the transformation of the measuredproper motions microαlowast and microδ in equatorial coordinates to equivalentvalues microllowast and microb in galactic coordinates Following the notationin ESA (1997 Sections 12 and 15) we have(microllowastmicrob

)=

(c sminuss c

) (microαlowastmicroδ

) (1)

where the 2 times 2 matrix is a rotation matrix that depends on theobjectrsquos coordinates α and δ c = c(αlowast δ) and s = s(αlowast δ) Inorder to transform the proper-motion errors from the equatorialto the galactic system we have

Clb =

(σ2microllowast

ρmicrobmicrollowastσmicrollowastσmicrob

ρmicrobmicrollowastσmicrollowastσmicrob σ2

microb

)(2)

= JCαδJ prime (3)

=

(c sminuss c

) (σ2microαlowast

ρmicroδmicroαlowastσmicroαlowastσmicroδ

ρmicroδmicroαlowastσmicroαlowastσmicroδ σ2

microδ

) (c minusss c

) (4)

where the prime denotes matrix transposition J denotes the Ja-cobian matrix of the transformation (which for a rotation is therotation matrix itself) and C denotes the variance-covariancematrix It immediately follows that σmicrollowast and σmicrob depend on thegenerally non-zero correlation coefficient ρmicroδmicroαlowast between the equa-torial proper-motion measurements Neglecting this correlationterm can give seriously incorrect results Some further examplesof how error propagation should be handled can be found in

2 For a subset of the data only two parameters (right ascension α anddeclination δ) could be determined

Article number page 2 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

for instance Brown et al (1997) and Lindegren et al (2000)In addition to error propagation the covariance matrix shouldalso be taken into account when estimating model parametersfor example in chi-square fitting maximum likelihood estimatesBayesian analysis etc For more details see Volume 1 Section15 of ESA (1997)

23 Systematic errors

Both the design of the spacecraft and the design and implemen-tation of the data processing software and algorithms aim to pre-vent biases or systematic effects in the astrometry Systematicerrors at low levels nonetheless exist in Gaia DR2 (see GaiaCollaboration et al 2018ac) Systematic effects are complicatedand largely unknown functions of position on the sky magni-tude and colour Although systematic effects are not dealt within the remainder of this paper it is important for users to beaware of their presence

The parallaxes and proper motions in Gaia DR2 may beaffected by systematic errors Although the precise magnitudeand distribution of these errors is unknown they are believedto be limited on global scales to plusmn01 mas for parallaxes andplusmn01 mas yrminus1 for proper motions There is a significant aver-age parallax zero-point shift of about minus30 microas in the sense Gaiaminus external data This shift has not been corrected for andis present in the published data Significant spatial correlationsbetween stars up to 004 mas in parallax and 007 mas yrminus1 inproper motion exist on both small (ltsim1) and intermediate (ltsim20)angular scales As a result averaging parallaxes over small re-gions of the sky for instance in an open cluster in the Mag-ellanic Clouds or in the Galactic Centre will not reduce theuncertainty on the mean below the sim01 mas level

Unfortunately there is no simple recipe to account for thesystematic errors The general advice is to proceed with the anal-ysis of the Gaia DR2 data using the uncertainties reported in thecatalogue ideally while modelling systematic effects as part ofthe analysis and to keep the systematics in mind when interpret-ing the results

24 Completeness

As argued in the next sections a correct estimation requires fullknowledge of the survey selection function Conversely neglect-ing the selection function can causes severe biases Derivationof the selection function is far from trivial yet estimates havebeen made for Gaia DR1 (TGAS) by for instance Schoumlnrich ampAumer (2017) and Bovy (2017)

This paper does not intend to define the survey selectionfunction We merely limit ourselves to mentioning a number offeatures of the Gaia DR2 data that should be properly reflectedin the selection function The Gaia DR2 catalogue is essentiallycomplete between G asymp 12 and sim17 mag Although the com-pleteness at the bright end (G in the range sim3ndash7 mag) has im-proved compared to Gaia DR1 a fraction of bright stars inthis range is still missing in Gaia DR2 Most stars brighterthan sim3 mag are missing In addition about one out of everyfive high-proper-motion stars (micro gtsim 06 arcsec yrminus1) is miss-ing Although the onboard detection threshold at the faint end isequivalent to G = 207 mag onboard magnitude estimation er-rors allow Gaia to see fainter stars although not at each transitGaia DR2 hence extends well beyond G = 20 mag However indense areas on the sky (above sim400 000 stars degminus2) the effec-tive magnitude limit of the survey can be as bright as sim18 mag

The somewhat fuzzy faint-end limit depends on object density(and hence celestial position) in combination with the scan-lawcoverage underlying the 22 months of data of Gaia DR2 and thefiltering on data quality that has been applied prior to publica-tion This has resulted in some regions on the sky showing artifi-cial source-density fluctuations for instance reflecting the scan-law pattern In small selected regions gaps are present in thesource distribution These are particularly noticeable near verybright stars In terms of effective angular resolution the resolu-tion limit of Gaia DR2 is sim04 arcsec

Given the properties of Gaia DR2 summarised above the in-terpretation of the data is far from straightforward This is partic-ularly true when accounting for the incompleteness in any sam-ple drawn from the Gaia Archive We therefore strongly encour-age the users of the data to read the papers and documentationaccompanying Gaia DR2 and to carefully consider the warningsgiven therein before drawing any conclusions from the data

3 Critical review of the traditional use of parallaxes

We start this section by briefly describing how parallaxes aremeasured and how the presence of measurement noise leads tothe occurrence of zero and negative observed parallaxes In therest of the section we review several of the most popular ap-proaches to using measured parallaxes ($) to estimate distancesand other astrophysical parameters In doing so we will attemptto highlight the intricacies pitfalls and problems of these lsquotradi-tionalrsquo approaches

31 Measurement of parallaxes

In simplified form astrometric measurements (source positionsproper motions and parallaxes) are made by repeatedly deter-mining the direction to a source on the sky and modelling thechange of direction to the source as a function of time as a com-bination of its motion through space (as reflected in its propermotion and radial velocity) and the motion of the observing plat-form (earth Gaia etc) around the Sun (as reflected in the par-allax of the source) As explained in more detail in Lindegrenet al (2016) and Lindegren et al (2012) this basic model of thesource motion on the sky describes the time-dependent coordi-nate direction from the observer towards an object outside thesolar system as the unit vector

u(t) = 〈r + (tB minus tep)(pmicroαlowast + qmicroδ + rmicror) minus$bO(t)Au〉 (5)

where t is the time of observation and tep is a reference time bothin units of Barycentric Coordinate Time (TCB) p q and r areunit vectors pointing in the direction of increasing right ascen-sion increasing declination and towards the position (α δ) ofthe source respectively tB is the time of observation correctedfor the Roslashmer delay bO(t) is the barycentric position of the ob-server at the time of observation Au is the astronomical unitand 〈〉 denotes normalisation The components of proper mo-tion along p and q are respectively microαlowast = microα cos δ and microδ $is the parallax and micror = vr$Au is the lsquoradial proper motionrsquowhich accounts for the fact that the distance to the star changesas a consequence of its radial motion which in turn affects theproper motion and parallax The effect of the radial proper mo-tion is negligibly small in most cases and can be ignored in thepresent discussion

The above source model predicts the well-known helix orwave-like pattern for the apparent motion of a typical source onthe sky A fit of this model to noisy observations can lead to

Article number page 3 of 20

AampA proofs manuscript no 32964_Arxiv

negative parallaxes as illustrated in Fig 2 We note how in thesource model described in Eq (5) the parallax appears as thefactor minus$ in front of the barycentric position of the observerwhich means that for each source its parallactic motion on thesky will have a sense which reflects the sense of the motion ofthe observer around the Sun In the presence of large measure-ment noise (comparable to the size of the parallax) it is entirelypossible that the parallax value estimated for the source modelvanishes or becomes negative This case can be interpreted as themeasurement being consistent with the source going lsquothe wrongway aroundrsquo on the sky as shown in Fig 2

2 1 0 1 2 3 4 [mas]

8

6

4

2

0

2

[mas

]

True = 040 = 100 = 200Fit = 049 = 113 = 169Observations

2101234

[m

as]

2015 2016 2017 2018 2019Time [yr]

8

6

4

2

0

2

[mas

]

Fig 2 Example of a negative parallax arising from the astrometric dataprocessing Solid blue lines true path of the object red dots the in-dividual measurements of the source position on the sky dashed or-ange lines the source path according to the least-squares astrometricsolution which here features a negative parallax Left Path on the skyshowing the effect of proper motion (linear trend) and parallax (loops)Right Right ascension and declination of the source as a function oftime In the fitted solution the negative parallax effect is equivalent to ayearly motion of the star in the opposite direction of the true parallacticmotion (which gives a phase-shift of π in the sinusoidal curves in theright panels) The error bars indicate a measurement uncertainty of 07mas the uncertainties on ∆αlowast and ∆δ are assumed to be uncorrelated

This example is intended to clarify why parallaxes can havenon-positive observed values and more importantly to conveythe message that the parallax is not a direct measurement ofthe distance to a source The distance (or any other quantity de-pending on distance) has to be estimated from the observed par-allax (and other relevant information) taking into account theuncertainty in the measurement A simplified demonstration ofhow negative parallaxes arise (allowing the reader to reproduceFig 2) can be found in the online tutorials accompanying thispaper 3

32 Estimating distance by inverting the parallax

In the absence of measurement uncertainties the distance to astar can be obtained from its true parallax through r = 1$Truewith$True indicating the true value of the parallax Thus naivelywe could say that the distance to a star can be obtained by in-verting the observed parallax ρ = 1$ where now ρ is usedto indicate the distance derived from the observed value of theparallax For this discussion the observed parallax is assumed to

3 httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibrationDemoNegativeParallaxipynb

be free of systematic measurement errors and to be distributednormally around the true parallax

p($ | $True) =1

σ$radic

2πexp

(minus

($ minus$True)2

2σ2$

) (6)

where σ$ indicates the measurement uncertainty on $ Blinduse of 1$ as an estimator of the distance will lead to unphys-ical results in case the observed parallax is non-positive Nev-ertheless we could still consider the use of the 1$ distanceestimate for positive values for instance a sample where mostor all of the observed values are positive or in the limiting casewhere there is a single positive parallax value In this case it iscrucial to be aware of the statistical properties of the estimate ρGiven a true distance r = 1$True what will be the behaviourof ρ We can obtain the probability density function (PDF) of ρfrom Eq (6) as

p(ρ | $True) = p($ = 1ρ | $True) middot∣∣∣∣∣d$dρ

∣∣∣∣∣=

1

ρ2σ$radic

2πexp

(minus

(1ρ minus$True)2

2σ2$

)(7)

In Fig 3 we depict p(ρ | $True) for two extreme cases of verylow and very high relative uncertainty The shape of p(ρ | $True)describes what we can expect when using ρ as an estimate ofthe true distance r The distribution of the figure on the left cor-responds to a case with a low fractional parallax uncertaintydefined as f = σ$$True It looks unbiased and symmetricalThus using ρ = 1$ to estimate the distance in a case like thisis relatively safe and would lead to more or less reliable resultsHowever in spite of its appearance the figure hides an intrin-sic non-Gaussianity that is made evident in the right-hand fig-ure This second plot corresponds to the case of high fractionalparallax uncertainty and the distribution shows several featuresfirst the mode (the most probable value) does not coincide withthe true distance value second the distribution is strongly asym-metric and finally it presents a long tail towards large values ofρ For more extreme values of f there is a noticeable negativetail to this distribution corresponding to the negative tail of theobserved parallax distribution

Fig 3 PDF of ρ = 1$ in two extreme cases The red vertical line indi-cates the true distance r Left Object at r = 100 pc with an uncertaintyon the observed parallax of σ$ = 03 mas Right Object at r = 2000 pcwith an uncertainty on the observed parallax of σ$ = 03 mas

Article number page 4 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

In view of Fig 3 it is tempting to apply corrections to theρ estimator based on the value of the fractional parallax uncer-tainty f Unfortunately in order to do so we would need to knowthe true value of the parallax and f Using the apparent fractionaluncertainty fapp = σ$$ is not feasible since the denominatorin f (the true parallax) can be very close to zero so its distribu-tion has very extended wings and using fapp will often result ingross errors

Furthermore reporting a ρ value should always be accompa-nied by an uncertainty estimate usually the standard deviationof the estimator but the standard deviation or the variance is de-fined in terms of an unknown quantity $True In addition thelong tail shown in the right panel of Fig 3 makes the estimatesof the variance quickly become pathological as discussed below

In order to clarify the previous assertions we recall the clas-sical concept of bias because it plays a central role in the discus-sion that develops in this section In statistics an estimator is saidto be biased if its expected value differs from the true value Inour case we aim to infer the true value of the parallax $True (oralternatively related quantities such as the true distance r ab-solute magnitude luminosity or 3D velocity components) andwe aim to infer it from the measured parallax In the Gaia casethis measured parallax will be affected by quasi-Gaussian uncer-tainties (see Sect 21) In this case the expectation value of theobserved parallax coincides with the true value

E[$] =

int$p($|$True)middotd$ =

int$N($$True σ$)middotd$ = $True

(8)

where N($$True σ$) represents the Gaussian probability dis-tribution centred at the true parallax and with a standard devia-tion σ$ Hence the observed parallax is an unbiased estimatorof the true parallax (under the strong hypothesis that there are nosystematic biases associated with the survey and that the errorsare normally distributed)

Now in order to assess the bias of ρ = 1$ as an estimatorof the true distance we need to calculate its expected value

E[ρ] = E[1$] =

int1$middotp($|$True)middotd$ =

int1$middotN($True σ$)middotd$

(9)

This bias was approximated by Smith amp Eichhorn (1996) (seeSect 341) as a function of the fractional parallax uncertainty fusing a series expansion of the term in the integral and severalapproximations for the asymptotic regimes of small and largevalues of f and it indeed shows that the distance estimator 1$is unbiased for vanishingly small values of f but it rapidly be-comes significantly biased for values of f beyond 01 But notonly is 1$ a biased estimator of the true distance it is also ahigh-variance estimator The reason for this variance explosionis related to the long tail towards large distances illustrated in theright panel of Figs 3 and 4 Relatively large fractional uncertain-ties inevitably imply noise excursions in the parallax that resultin vanishingly small observed parallaxes and disproportionatedistances (and hence an inflation of the variance)

The effects discussed above can be illustrated with the use ofsimulated data Figure 4 shows the results of a simulation of ob-jects located between 05 and 2kpc where starting from the truedistances we have simulated observed parallaxes with a Gaus-sian uncertainty of σ$ = 03 mas and then calculated for eachobject ρ = 1$

The figure on the left shows that (by construction) the errorsin the observed parallaxes are well behaved and perfectly sym-metrical (Gaussian) while in the centre figure the errors in theestimation of distances using ρ show a strong asymmetry Thecharacteristics of these residuals depend on the distribution oftrue distances and uncertainties This is more evident in the fig-ure on the right where the true distance r is plotted against ρthere is a very prominent tail of overestimated distances and thedistribution is asymmetrical around the one-to-one line the moredistant the objects the more marked the asymmetry These fea-tures are very prominent because we have simulated objects sothat the relative errors in parallax are large but they are present(albeit at a smaller scale) even when the relative errors are small

The plots in Fig 4 correspond to a simple simulation with amild uncertainty σ$ = 03 mas Figure 5 shows the same plotsfor a realistic simulation of the Gaia DR2 data set The simula-tion is described in Appendix A in this case the errors in paral-lax follow a realistic model of the Gaia DR2 errors depicted inFig A2

As a summary we have seen in previous paragraphs thatthe naive approach of inverting the observed parallax has sig-nificant drawbacks we are forced to dispose of valuable data(non-positive parallaxes) and as an estimator ρ = 1$ is biasedand has a very high variance

33 Sample truncation

In addition to the potential sources of trouble described in theprevious sections the traditional use of samples of parallaxes in-cludes a practice that tends to aggravate these effects truncationof the used samples

As discussed in Sect 31 negative parallaxes are a naturalresult of the Gaia measurement process (and of astrometry ingeneral) Since inverting negative parallaxes leads to physicallymeaningless negative distances we are tempted to just get rid ofthese values and form a lsquocleanrsquo sample This results in a biasedsample however

On the one hand removing the negative parallaxes biases thedistribution of this parameter Consider for instance the case il-lustrated in Fig 1 for the quasars from the AllWISE catalogueThese objects have a near zero true parallax and the distribu-tion of its observed values shown in the figure corresponds tothis with a mean of minus10 microas close to zero However if we re-move the negative parallaxes from this sample deeming themlsquounphysicalrsquo the mean of the observed values would be signif-icantly positive about 08 mas This is completely unrealisticfor quasars in removing the negative parallaxes we have signif-icantly biased the observed parallax set for these objects Withsamples of other types of objects with non-zero parallaxes theeffect can be smaller but it will be present

On the other hand when by removing negative parallaxesthe contents of the sample are no longer representative of thebase population from which it has been extracted since stars withlarge parallaxes are over-represented and stars with small paral-laxes are under-represented This can be clearly illustrated us-ing a simulation We have generated a sample of simulated starsmimicking the contents of the full Gaia DR2 (see Appendix A)and truncated it by removing the negative parallaxes In Fig 6we can compare the distribution of the true distances of the orig-inal (non-truncated) sample and the resulting (truncated) sam-ple it is clear that after the removal of negative parallaxes wehave favoured the stars at short distances (large parallaxes) withrespect to the stars at large distances (small parallaxes) The spa-

Article number page 5 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 4 Behaviour of PDF of ρ = 1$ as estimator of the true distance Left Histogram of differences between true parallaxes and observedparallaxes Centre Histogram of differences between true distances and their estimation using ρ Right Comparison of the true distances and theirestimations using ρ The observed parallaxes $ have been simulated using an uncertainty of σ$ = 03 mas

Fig 5 Behaviour of PDF of ρ = 1$ as estimator of the true distance for a simulation of the full Gaia DR2 data set Left Histogram of differencesbetween true parallaxes and observed parallaxes Centre Histogram of differences between true distances and their estimation using ρ RightComparison of the true distances and their estimations using ρ The observed parallaxes $ have been simulated using a realistic Gaia DR2 errormodel described in the Appendix The contour lines correspond to the distribution percentiles of 35 55 90 and 98

tial distribution of the sample has thus been altered and maytherefore bias any analysis based on it

Fig 6 Effect of removing the negative and zero parallaxes from a sim-ulation of Gaia DR2 Distribution of true distances histogram of dis-tances for the complete sample (thick line) histogram of distances forthe sample truncated by removing $ le 0 (thin line)

A stronger version of truncation that has traditionally beenapplied is to remove not only negative parallaxes but also all theparallaxes with a relative error above a given threshold k select-

ing σ$$

lt k This selection tends to favour the removal of starswith small parallaxes The effect is similar to the previous casebut more accentuated as can be seen in Fig 7 Again stars atshort distances are favoured in the sample with respect to distantstars

Fig 7 Effect of removing the positive parallaxes with a relative errorabove 50 as well as negative parallaxes Thick line histogram of dis-tances for the complete sample Thin line histogram of distances forthe sample truncated by removing objects with (|σ$

$| gt 05)

Article number page 6 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 2: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

strations to full Bayesian analysis available as Python and Rnotebooks and source code from the tutorial section on the Gaiaarchive 1

2 Gaia astrometric data

The Gaia astrometry ie celestial coordinates trigonometricparallaxes and proper motions for more than one billion ob-jects results from the observations coming from the spacecraftinstruments and their subsequent processing by the Gaia DataProcessing and Analysis Consortium (DPAC) The astrometricprocessing is detailed in Gaia Collaboration et al (2018c) andreaders are strongly encouraged to familiarise themselves withthe contents of that paper in order to understand the strengthsand weaknesses of the published astrometry and in particular ofthe parallaxes The processed data was submitted to extensivevalidation prior to publication as detailed in Gaia Collaborationet al (2018a) This paper is also highly recommended in orderto gain a proper understanding of how to use and how not to usethe astrometric data As a simple and striking example a smallnumber of sources with unrealistic very large positive and verylarge negative parallaxes are present in the data Advice on howto filter these sources from the data analysis is provided in theGaia DR2 documentation

21 Uncertainties

The published parallaxes and more generally all astrometricparameters are measured quantities and as such have an asso-ciated measurement uncertainty These uncertainties are pub-lished source per source and depend mostly on position onthe sky as a result of the scanning law and on magnitudeFor parallaxes uncertainties are typically around 004 mas forsources brighter than sim14 mag around 01 mas for sourceswith a G magnitude around 17 and around 07 mas at the faintend around 20 mag The astrometric uncertainties provided inGaia DR2 have been derived from the formal errors computedin the astrometric processing Unlike for Gaia DR1 the par-allax uncertainties have not been calibrated externally ie theyare known as an ensemble to be underestimated by sim8ndash12for faint sources (G gtsim 16 mag) outside the Galactic plane and byup to sim30 for bright stars (G ltsim 12 mag) Based on an assess-ment of the measured parallaxes of a set of about half a millionknown quasars which can be assumed in practice to have zeroparallax the uncertainties are normally distributed with impres-sive approximation (Fig 1) However as is common when tak-ing measurements and especially in such large samples like theGaia catalogue there are small numbers of outliers even up tounrealistically high confidence levels (eg at the 100σ level)

22 Correlations

The parallaxes for each source published in Gaia DR2 havenot been derived in isolation but result from a simultaneousfive-parameter fit of an astrometric source model to the data InGaia DR2 only one astrometric source model has been usedthat of a single star This model assumes a uniform rectilinearspace motion relative to the solar system barycentre The astro-metric data in Gaia DR2 thus comprise five astrometric param-

1 httpsgithubcomagabrownastrometry-inference-tutorials

iexcl6 iexcl5 iexcl4 iexcl3 iexcl2 iexcl1 0 1 2 3 4 5 6

Normalised centred parallax ($ + 0029)=frac34$

1

10

100

1000

1e4

1e5

Num

ber

per

bin

of0

1

Fig 1 Distribution of normalised re-centred parallaxes of 556 849quasars from the AllWISE catalogue present in Gaia DR2 (blue curve)The grey curve denotes the subsample composed of 492 920 sourceswith parallax errors σ$ lt 1 mas The centring adopted in this plotreflects a global parallax zero-point shift of minus0029 mas Ideally bothcurves should follow a normal distribution with zero mean and unit vari-ance The red curve shows a Gaussian distribution with the same stan-dard deviation (1081) as the normalised centred parallaxes for the fullsample Figure from Gaia Collaboration et al (2018c)

eters2 with their associated uncertainties but also ten correlationcoefficients between the estimated parameters It is critical to usethe full (5 times 5) covariance matrix when propagating the uncer-tainties on subsets andor linear combinations of the astrometricparameters

As an example consider the transformation of the measuredproper motions microαlowast and microδ in equatorial coordinates to equivalentvalues microllowast and microb in galactic coordinates Following the notationin ESA (1997 Sections 12 and 15) we have(microllowastmicrob

)=

(c sminuss c

) (microαlowastmicroδ

) (1)

where the 2 times 2 matrix is a rotation matrix that depends on theobjectrsquos coordinates α and δ c = c(αlowast δ) and s = s(αlowast δ) Inorder to transform the proper-motion errors from the equatorialto the galactic system we have

Clb =

(σ2microllowast

ρmicrobmicrollowastσmicrollowastσmicrob

ρmicrobmicrollowastσmicrollowastσmicrob σ2

microb

)(2)

= JCαδJ prime (3)

=

(c sminuss c

) (σ2microαlowast

ρmicroδmicroαlowastσmicroαlowastσmicroδ

ρmicroδmicroαlowastσmicroαlowastσmicroδ σ2

microδ

) (c minusss c

) (4)

where the prime denotes matrix transposition J denotes the Ja-cobian matrix of the transformation (which for a rotation is therotation matrix itself) and C denotes the variance-covariancematrix It immediately follows that σmicrollowast and σmicrob depend on thegenerally non-zero correlation coefficient ρmicroδmicroαlowast between the equa-torial proper-motion measurements Neglecting this correlationterm can give seriously incorrect results Some further examplesof how error propagation should be handled can be found in

2 For a subset of the data only two parameters (right ascension α anddeclination δ) could be determined

Article number page 2 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

for instance Brown et al (1997) and Lindegren et al (2000)In addition to error propagation the covariance matrix shouldalso be taken into account when estimating model parametersfor example in chi-square fitting maximum likelihood estimatesBayesian analysis etc For more details see Volume 1 Section15 of ESA (1997)

23 Systematic errors

Both the design of the spacecraft and the design and implemen-tation of the data processing software and algorithms aim to pre-vent biases or systematic effects in the astrometry Systematicerrors at low levels nonetheless exist in Gaia DR2 (see GaiaCollaboration et al 2018ac) Systematic effects are complicatedand largely unknown functions of position on the sky magni-tude and colour Although systematic effects are not dealt within the remainder of this paper it is important for users to beaware of their presence

The parallaxes and proper motions in Gaia DR2 may beaffected by systematic errors Although the precise magnitudeand distribution of these errors is unknown they are believedto be limited on global scales to plusmn01 mas for parallaxes andplusmn01 mas yrminus1 for proper motions There is a significant aver-age parallax zero-point shift of about minus30 microas in the sense Gaiaminus external data This shift has not been corrected for andis present in the published data Significant spatial correlationsbetween stars up to 004 mas in parallax and 007 mas yrminus1 inproper motion exist on both small (ltsim1) and intermediate (ltsim20)angular scales As a result averaging parallaxes over small re-gions of the sky for instance in an open cluster in the Mag-ellanic Clouds or in the Galactic Centre will not reduce theuncertainty on the mean below the sim01 mas level

Unfortunately there is no simple recipe to account for thesystematic errors The general advice is to proceed with the anal-ysis of the Gaia DR2 data using the uncertainties reported in thecatalogue ideally while modelling systematic effects as part ofthe analysis and to keep the systematics in mind when interpret-ing the results

24 Completeness

As argued in the next sections a correct estimation requires fullknowledge of the survey selection function Conversely neglect-ing the selection function can causes severe biases Derivationof the selection function is far from trivial yet estimates havebeen made for Gaia DR1 (TGAS) by for instance Schoumlnrich ampAumer (2017) and Bovy (2017)

This paper does not intend to define the survey selectionfunction We merely limit ourselves to mentioning a number offeatures of the Gaia DR2 data that should be properly reflectedin the selection function The Gaia DR2 catalogue is essentiallycomplete between G asymp 12 and sim17 mag Although the com-pleteness at the bright end (G in the range sim3ndash7 mag) has im-proved compared to Gaia DR1 a fraction of bright stars inthis range is still missing in Gaia DR2 Most stars brighterthan sim3 mag are missing In addition about one out of everyfive high-proper-motion stars (micro gtsim 06 arcsec yrminus1) is miss-ing Although the onboard detection threshold at the faint end isequivalent to G = 207 mag onboard magnitude estimation er-rors allow Gaia to see fainter stars although not at each transitGaia DR2 hence extends well beyond G = 20 mag However indense areas on the sky (above sim400 000 stars degminus2) the effec-tive magnitude limit of the survey can be as bright as sim18 mag

The somewhat fuzzy faint-end limit depends on object density(and hence celestial position) in combination with the scan-lawcoverage underlying the 22 months of data of Gaia DR2 and thefiltering on data quality that has been applied prior to publica-tion This has resulted in some regions on the sky showing artifi-cial source-density fluctuations for instance reflecting the scan-law pattern In small selected regions gaps are present in thesource distribution These are particularly noticeable near verybright stars In terms of effective angular resolution the resolu-tion limit of Gaia DR2 is sim04 arcsec

Given the properties of Gaia DR2 summarised above the in-terpretation of the data is far from straightforward This is partic-ularly true when accounting for the incompleteness in any sam-ple drawn from the Gaia Archive We therefore strongly encour-age the users of the data to read the papers and documentationaccompanying Gaia DR2 and to carefully consider the warningsgiven therein before drawing any conclusions from the data

3 Critical review of the traditional use of parallaxes

We start this section by briefly describing how parallaxes aremeasured and how the presence of measurement noise leads tothe occurrence of zero and negative observed parallaxes In therest of the section we review several of the most popular ap-proaches to using measured parallaxes ($) to estimate distancesand other astrophysical parameters In doing so we will attemptto highlight the intricacies pitfalls and problems of these lsquotradi-tionalrsquo approaches

31 Measurement of parallaxes

In simplified form astrometric measurements (source positionsproper motions and parallaxes) are made by repeatedly deter-mining the direction to a source on the sky and modelling thechange of direction to the source as a function of time as a com-bination of its motion through space (as reflected in its propermotion and radial velocity) and the motion of the observing plat-form (earth Gaia etc) around the Sun (as reflected in the par-allax of the source) As explained in more detail in Lindegrenet al (2016) and Lindegren et al (2012) this basic model of thesource motion on the sky describes the time-dependent coordi-nate direction from the observer towards an object outside thesolar system as the unit vector

u(t) = 〈r + (tB minus tep)(pmicroαlowast + qmicroδ + rmicror) minus$bO(t)Au〉 (5)

where t is the time of observation and tep is a reference time bothin units of Barycentric Coordinate Time (TCB) p q and r areunit vectors pointing in the direction of increasing right ascen-sion increasing declination and towards the position (α δ) ofthe source respectively tB is the time of observation correctedfor the Roslashmer delay bO(t) is the barycentric position of the ob-server at the time of observation Au is the astronomical unitand 〈〉 denotes normalisation The components of proper mo-tion along p and q are respectively microαlowast = microα cos δ and microδ $is the parallax and micror = vr$Au is the lsquoradial proper motionrsquowhich accounts for the fact that the distance to the star changesas a consequence of its radial motion which in turn affects theproper motion and parallax The effect of the radial proper mo-tion is negligibly small in most cases and can be ignored in thepresent discussion

The above source model predicts the well-known helix orwave-like pattern for the apparent motion of a typical source onthe sky A fit of this model to noisy observations can lead to

Article number page 3 of 20

AampA proofs manuscript no 32964_Arxiv

negative parallaxes as illustrated in Fig 2 We note how in thesource model described in Eq (5) the parallax appears as thefactor minus$ in front of the barycentric position of the observerwhich means that for each source its parallactic motion on thesky will have a sense which reflects the sense of the motion ofthe observer around the Sun In the presence of large measure-ment noise (comparable to the size of the parallax) it is entirelypossible that the parallax value estimated for the source modelvanishes or becomes negative This case can be interpreted as themeasurement being consistent with the source going lsquothe wrongway aroundrsquo on the sky as shown in Fig 2

2 1 0 1 2 3 4 [mas]

8

6

4

2

0

2

[mas

]

True = 040 = 100 = 200Fit = 049 = 113 = 169Observations

2101234

[m

as]

2015 2016 2017 2018 2019Time [yr]

8

6

4

2

0

2

[mas

]

Fig 2 Example of a negative parallax arising from the astrometric dataprocessing Solid blue lines true path of the object red dots the in-dividual measurements of the source position on the sky dashed or-ange lines the source path according to the least-squares astrometricsolution which here features a negative parallax Left Path on the skyshowing the effect of proper motion (linear trend) and parallax (loops)Right Right ascension and declination of the source as a function oftime In the fitted solution the negative parallax effect is equivalent to ayearly motion of the star in the opposite direction of the true parallacticmotion (which gives a phase-shift of π in the sinusoidal curves in theright panels) The error bars indicate a measurement uncertainty of 07mas the uncertainties on ∆αlowast and ∆δ are assumed to be uncorrelated

This example is intended to clarify why parallaxes can havenon-positive observed values and more importantly to conveythe message that the parallax is not a direct measurement ofthe distance to a source The distance (or any other quantity de-pending on distance) has to be estimated from the observed par-allax (and other relevant information) taking into account theuncertainty in the measurement A simplified demonstration ofhow negative parallaxes arise (allowing the reader to reproduceFig 2) can be found in the online tutorials accompanying thispaper 3

32 Estimating distance by inverting the parallax

In the absence of measurement uncertainties the distance to astar can be obtained from its true parallax through r = 1$Truewith$True indicating the true value of the parallax Thus naivelywe could say that the distance to a star can be obtained by in-verting the observed parallax ρ = 1$ where now ρ is usedto indicate the distance derived from the observed value of theparallax For this discussion the observed parallax is assumed to

3 httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibrationDemoNegativeParallaxipynb

be free of systematic measurement errors and to be distributednormally around the true parallax

p($ | $True) =1

σ$radic

2πexp

(minus

($ minus$True)2

2σ2$

) (6)

where σ$ indicates the measurement uncertainty on $ Blinduse of 1$ as an estimator of the distance will lead to unphys-ical results in case the observed parallax is non-positive Nev-ertheless we could still consider the use of the 1$ distanceestimate for positive values for instance a sample where mostor all of the observed values are positive or in the limiting casewhere there is a single positive parallax value In this case it iscrucial to be aware of the statistical properties of the estimate ρGiven a true distance r = 1$True what will be the behaviourof ρ We can obtain the probability density function (PDF) of ρfrom Eq (6) as

p(ρ | $True) = p($ = 1ρ | $True) middot∣∣∣∣∣d$dρ

∣∣∣∣∣=

1

ρ2σ$radic

2πexp

(minus

(1ρ minus$True)2

2σ2$

)(7)

In Fig 3 we depict p(ρ | $True) for two extreme cases of verylow and very high relative uncertainty The shape of p(ρ | $True)describes what we can expect when using ρ as an estimate ofthe true distance r The distribution of the figure on the left cor-responds to a case with a low fractional parallax uncertaintydefined as f = σ$$True It looks unbiased and symmetricalThus using ρ = 1$ to estimate the distance in a case like thisis relatively safe and would lead to more or less reliable resultsHowever in spite of its appearance the figure hides an intrin-sic non-Gaussianity that is made evident in the right-hand fig-ure This second plot corresponds to the case of high fractionalparallax uncertainty and the distribution shows several featuresfirst the mode (the most probable value) does not coincide withthe true distance value second the distribution is strongly asym-metric and finally it presents a long tail towards large values ofρ For more extreme values of f there is a noticeable negativetail to this distribution corresponding to the negative tail of theobserved parallax distribution

Fig 3 PDF of ρ = 1$ in two extreme cases The red vertical line indi-cates the true distance r Left Object at r = 100 pc with an uncertaintyon the observed parallax of σ$ = 03 mas Right Object at r = 2000 pcwith an uncertainty on the observed parallax of σ$ = 03 mas

Article number page 4 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

In view of Fig 3 it is tempting to apply corrections to theρ estimator based on the value of the fractional parallax uncer-tainty f Unfortunately in order to do so we would need to knowthe true value of the parallax and f Using the apparent fractionaluncertainty fapp = σ$$ is not feasible since the denominatorin f (the true parallax) can be very close to zero so its distribu-tion has very extended wings and using fapp will often result ingross errors

Furthermore reporting a ρ value should always be accompa-nied by an uncertainty estimate usually the standard deviationof the estimator but the standard deviation or the variance is de-fined in terms of an unknown quantity $True In addition thelong tail shown in the right panel of Fig 3 makes the estimatesof the variance quickly become pathological as discussed below

In order to clarify the previous assertions we recall the clas-sical concept of bias because it plays a central role in the discus-sion that develops in this section In statistics an estimator is saidto be biased if its expected value differs from the true value Inour case we aim to infer the true value of the parallax $True (oralternatively related quantities such as the true distance r ab-solute magnitude luminosity or 3D velocity components) andwe aim to infer it from the measured parallax In the Gaia casethis measured parallax will be affected by quasi-Gaussian uncer-tainties (see Sect 21) In this case the expectation value of theobserved parallax coincides with the true value

E[$] =

int$p($|$True)middotd$ =

int$N($$True σ$)middotd$ = $True

(8)

where N($$True σ$) represents the Gaussian probability dis-tribution centred at the true parallax and with a standard devia-tion σ$ Hence the observed parallax is an unbiased estimatorof the true parallax (under the strong hypothesis that there are nosystematic biases associated with the survey and that the errorsare normally distributed)

Now in order to assess the bias of ρ = 1$ as an estimatorof the true distance we need to calculate its expected value

E[ρ] = E[1$] =

int1$middotp($|$True)middotd$ =

int1$middotN($True σ$)middotd$

(9)

This bias was approximated by Smith amp Eichhorn (1996) (seeSect 341) as a function of the fractional parallax uncertainty fusing a series expansion of the term in the integral and severalapproximations for the asymptotic regimes of small and largevalues of f and it indeed shows that the distance estimator 1$is unbiased for vanishingly small values of f but it rapidly be-comes significantly biased for values of f beyond 01 But notonly is 1$ a biased estimator of the true distance it is also ahigh-variance estimator The reason for this variance explosionis related to the long tail towards large distances illustrated in theright panel of Figs 3 and 4 Relatively large fractional uncertain-ties inevitably imply noise excursions in the parallax that resultin vanishingly small observed parallaxes and disproportionatedistances (and hence an inflation of the variance)

The effects discussed above can be illustrated with the use ofsimulated data Figure 4 shows the results of a simulation of ob-jects located between 05 and 2kpc where starting from the truedistances we have simulated observed parallaxes with a Gaus-sian uncertainty of σ$ = 03 mas and then calculated for eachobject ρ = 1$

The figure on the left shows that (by construction) the errorsin the observed parallaxes are well behaved and perfectly sym-metrical (Gaussian) while in the centre figure the errors in theestimation of distances using ρ show a strong asymmetry Thecharacteristics of these residuals depend on the distribution oftrue distances and uncertainties This is more evident in the fig-ure on the right where the true distance r is plotted against ρthere is a very prominent tail of overestimated distances and thedistribution is asymmetrical around the one-to-one line the moredistant the objects the more marked the asymmetry These fea-tures are very prominent because we have simulated objects sothat the relative errors in parallax are large but they are present(albeit at a smaller scale) even when the relative errors are small

The plots in Fig 4 correspond to a simple simulation with amild uncertainty σ$ = 03 mas Figure 5 shows the same plotsfor a realistic simulation of the Gaia DR2 data set The simula-tion is described in Appendix A in this case the errors in paral-lax follow a realistic model of the Gaia DR2 errors depicted inFig A2

As a summary we have seen in previous paragraphs thatthe naive approach of inverting the observed parallax has sig-nificant drawbacks we are forced to dispose of valuable data(non-positive parallaxes) and as an estimator ρ = 1$ is biasedand has a very high variance

33 Sample truncation

In addition to the potential sources of trouble described in theprevious sections the traditional use of samples of parallaxes in-cludes a practice that tends to aggravate these effects truncationof the used samples

As discussed in Sect 31 negative parallaxes are a naturalresult of the Gaia measurement process (and of astrometry ingeneral) Since inverting negative parallaxes leads to physicallymeaningless negative distances we are tempted to just get rid ofthese values and form a lsquocleanrsquo sample This results in a biasedsample however

On the one hand removing the negative parallaxes biases thedistribution of this parameter Consider for instance the case il-lustrated in Fig 1 for the quasars from the AllWISE catalogueThese objects have a near zero true parallax and the distribu-tion of its observed values shown in the figure corresponds tothis with a mean of minus10 microas close to zero However if we re-move the negative parallaxes from this sample deeming themlsquounphysicalrsquo the mean of the observed values would be signif-icantly positive about 08 mas This is completely unrealisticfor quasars in removing the negative parallaxes we have signif-icantly biased the observed parallax set for these objects Withsamples of other types of objects with non-zero parallaxes theeffect can be smaller but it will be present

On the other hand when by removing negative parallaxesthe contents of the sample are no longer representative of thebase population from which it has been extracted since stars withlarge parallaxes are over-represented and stars with small paral-laxes are under-represented This can be clearly illustrated us-ing a simulation We have generated a sample of simulated starsmimicking the contents of the full Gaia DR2 (see Appendix A)and truncated it by removing the negative parallaxes In Fig 6we can compare the distribution of the true distances of the orig-inal (non-truncated) sample and the resulting (truncated) sam-ple it is clear that after the removal of negative parallaxes wehave favoured the stars at short distances (large parallaxes) withrespect to the stars at large distances (small parallaxes) The spa-

Article number page 5 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 4 Behaviour of PDF of ρ = 1$ as estimator of the true distance Left Histogram of differences between true parallaxes and observedparallaxes Centre Histogram of differences between true distances and their estimation using ρ Right Comparison of the true distances and theirestimations using ρ The observed parallaxes $ have been simulated using an uncertainty of σ$ = 03 mas

Fig 5 Behaviour of PDF of ρ = 1$ as estimator of the true distance for a simulation of the full Gaia DR2 data set Left Histogram of differencesbetween true parallaxes and observed parallaxes Centre Histogram of differences between true distances and their estimation using ρ RightComparison of the true distances and their estimations using ρ The observed parallaxes $ have been simulated using a realistic Gaia DR2 errormodel described in the Appendix The contour lines correspond to the distribution percentiles of 35 55 90 and 98

tial distribution of the sample has thus been altered and maytherefore bias any analysis based on it

Fig 6 Effect of removing the negative and zero parallaxes from a sim-ulation of Gaia DR2 Distribution of true distances histogram of dis-tances for the complete sample (thick line) histogram of distances forthe sample truncated by removing $ le 0 (thin line)

A stronger version of truncation that has traditionally beenapplied is to remove not only negative parallaxes but also all theparallaxes with a relative error above a given threshold k select-

ing σ$$

lt k This selection tends to favour the removal of starswith small parallaxes The effect is similar to the previous casebut more accentuated as can be seen in Fig 7 Again stars atshort distances are favoured in the sample with respect to distantstars

Fig 7 Effect of removing the positive parallaxes with a relative errorabove 50 as well as negative parallaxes Thick line histogram of dis-tances for the complete sample Thin line histogram of distances forthe sample truncated by removing objects with (|σ$

$| gt 05)

Article number page 6 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 3: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

for instance Brown et al (1997) and Lindegren et al (2000)In addition to error propagation the covariance matrix shouldalso be taken into account when estimating model parametersfor example in chi-square fitting maximum likelihood estimatesBayesian analysis etc For more details see Volume 1 Section15 of ESA (1997)

23 Systematic errors

Both the design of the spacecraft and the design and implemen-tation of the data processing software and algorithms aim to pre-vent biases or systematic effects in the astrometry Systematicerrors at low levels nonetheless exist in Gaia DR2 (see GaiaCollaboration et al 2018ac) Systematic effects are complicatedand largely unknown functions of position on the sky magni-tude and colour Although systematic effects are not dealt within the remainder of this paper it is important for users to beaware of their presence

The parallaxes and proper motions in Gaia DR2 may beaffected by systematic errors Although the precise magnitudeand distribution of these errors is unknown they are believedto be limited on global scales to plusmn01 mas for parallaxes andplusmn01 mas yrminus1 for proper motions There is a significant aver-age parallax zero-point shift of about minus30 microas in the sense Gaiaminus external data This shift has not been corrected for andis present in the published data Significant spatial correlationsbetween stars up to 004 mas in parallax and 007 mas yrminus1 inproper motion exist on both small (ltsim1) and intermediate (ltsim20)angular scales As a result averaging parallaxes over small re-gions of the sky for instance in an open cluster in the Mag-ellanic Clouds or in the Galactic Centre will not reduce theuncertainty on the mean below the sim01 mas level

Unfortunately there is no simple recipe to account for thesystematic errors The general advice is to proceed with the anal-ysis of the Gaia DR2 data using the uncertainties reported in thecatalogue ideally while modelling systematic effects as part ofthe analysis and to keep the systematics in mind when interpret-ing the results

24 Completeness

As argued in the next sections a correct estimation requires fullknowledge of the survey selection function Conversely neglect-ing the selection function can causes severe biases Derivationof the selection function is far from trivial yet estimates havebeen made for Gaia DR1 (TGAS) by for instance Schoumlnrich ampAumer (2017) and Bovy (2017)

This paper does not intend to define the survey selectionfunction We merely limit ourselves to mentioning a number offeatures of the Gaia DR2 data that should be properly reflectedin the selection function The Gaia DR2 catalogue is essentiallycomplete between G asymp 12 and sim17 mag Although the com-pleteness at the bright end (G in the range sim3ndash7 mag) has im-proved compared to Gaia DR1 a fraction of bright stars inthis range is still missing in Gaia DR2 Most stars brighterthan sim3 mag are missing In addition about one out of everyfive high-proper-motion stars (micro gtsim 06 arcsec yrminus1) is miss-ing Although the onboard detection threshold at the faint end isequivalent to G = 207 mag onboard magnitude estimation er-rors allow Gaia to see fainter stars although not at each transitGaia DR2 hence extends well beyond G = 20 mag However indense areas on the sky (above sim400 000 stars degminus2) the effec-tive magnitude limit of the survey can be as bright as sim18 mag

The somewhat fuzzy faint-end limit depends on object density(and hence celestial position) in combination with the scan-lawcoverage underlying the 22 months of data of Gaia DR2 and thefiltering on data quality that has been applied prior to publica-tion This has resulted in some regions on the sky showing artifi-cial source-density fluctuations for instance reflecting the scan-law pattern In small selected regions gaps are present in thesource distribution These are particularly noticeable near verybright stars In terms of effective angular resolution the resolu-tion limit of Gaia DR2 is sim04 arcsec

Given the properties of Gaia DR2 summarised above the in-terpretation of the data is far from straightforward This is partic-ularly true when accounting for the incompleteness in any sam-ple drawn from the Gaia Archive We therefore strongly encour-age the users of the data to read the papers and documentationaccompanying Gaia DR2 and to carefully consider the warningsgiven therein before drawing any conclusions from the data

3 Critical review of the traditional use of parallaxes

We start this section by briefly describing how parallaxes aremeasured and how the presence of measurement noise leads tothe occurrence of zero and negative observed parallaxes In therest of the section we review several of the most popular ap-proaches to using measured parallaxes ($) to estimate distancesand other astrophysical parameters In doing so we will attemptto highlight the intricacies pitfalls and problems of these lsquotradi-tionalrsquo approaches

31 Measurement of parallaxes

In simplified form astrometric measurements (source positionsproper motions and parallaxes) are made by repeatedly deter-mining the direction to a source on the sky and modelling thechange of direction to the source as a function of time as a com-bination of its motion through space (as reflected in its propermotion and radial velocity) and the motion of the observing plat-form (earth Gaia etc) around the Sun (as reflected in the par-allax of the source) As explained in more detail in Lindegrenet al (2016) and Lindegren et al (2012) this basic model of thesource motion on the sky describes the time-dependent coordi-nate direction from the observer towards an object outside thesolar system as the unit vector

u(t) = 〈r + (tB minus tep)(pmicroαlowast + qmicroδ + rmicror) minus$bO(t)Au〉 (5)

where t is the time of observation and tep is a reference time bothin units of Barycentric Coordinate Time (TCB) p q and r areunit vectors pointing in the direction of increasing right ascen-sion increasing declination and towards the position (α δ) ofthe source respectively tB is the time of observation correctedfor the Roslashmer delay bO(t) is the barycentric position of the ob-server at the time of observation Au is the astronomical unitand 〈〉 denotes normalisation The components of proper mo-tion along p and q are respectively microαlowast = microα cos δ and microδ $is the parallax and micror = vr$Au is the lsquoradial proper motionrsquowhich accounts for the fact that the distance to the star changesas a consequence of its radial motion which in turn affects theproper motion and parallax The effect of the radial proper mo-tion is negligibly small in most cases and can be ignored in thepresent discussion

The above source model predicts the well-known helix orwave-like pattern for the apparent motion of a typical source onthe sky A fit of this model to noisy observations can lead to

Article number page 3 of 20

AampA proofs manuscript no 32964_Arxiv

negative parallaxes as illustrated in Fig 2 We note how in thesource model described in Eq (5) the parallax appears as thefactor minus$ in front of the barycentric position of the observerwhich means that for each source its parallactic motion on thesky will have a sense which reflects the sense of the motion ofthe observer around the Sun In the presence of large measure-ment noise (comparable to the size of the parallax) it is entirelypossible that the parallax value estimated for the source modelvanishes or becomes negative This case can be interpreted as themeasurement being consistent with the source going lsquothe wrongway aroundrsquo on the sky as shown in Fig 2

2 1 0 1 2 3 4 [mas]

8

6

4

2

0

2

[mas

]

True = 040 = 100 = 200Fit = 049 = 113 = 169Observations

2101234

[m

as]

2015 2016 2017 2018 2019Time [yr]

8

6

4

2

0

2

[mas

]

Fig 2 Example of a negative parallax arising from the astrometric dataprocessing Solid blue lines true path of the object red dots the in-dividual measurements of the source position on the sky dashed or-ange lines the source path according to the least-squares astrometricsolution which here features a negative parallax Left Path on the skyshowing the effect of proper motion (linear trend) and parallax (loops)Right Right ascension and declination of the source as a function oftime In the fitted solution the negative parallax effect is equivalent to ayearly motion of the star in the opposite direction of the true parallacticmotion (which gives a phase-shift of π in the sinusoidal curves in theright panels) The error bars indicate a measurement uncertainty of 07mas the uncertainties on ∆αlowast and ∆δ are assumed to be uncorrelated

This example is intended to clarify why parallaxes can havenon-positive observed values and more importantly to conveythe message that the parallax is not a direct measurement ofthe distance to a source The distance (or any other quantity de-pending on distance) has to be estimated from the observed par-allax (and other relevant information) taking into account theuncertainty in the measurement A simplified demonstration ofhow negative parallaxes arise (allowing the reader to reproduceFig 2) can be found in the online tutorials accompanying thispaper 3

32 Estimating distance by inverting the parallax

In the absence of measurement uncertainties the distance to astar can be obtained from its true parallax through r = 1$Truewith$True indicating the true value of the parallax Thus naivelywe could say that the distance to a star can be obtained by in-verting the observed parallax ρ = 1$ where now ρ is usedto indicate the distance derived from the observed value of theparallax For this discussion the observed parallax is assumed to

3 httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibrationDemoNegativeParallaxipynb

be free of systematic measurement errors and to be distributednormally around the true parallax

p($ | $True) =1

σ$radic

2πexp

(minus

($ minus$True)2

2σ2$

) (6)

where σ$ indicates the measurement uncertainty on $ Blinduse of 1$ as an estimator of the distance will lead to unphys-ical results in case the observed parallax is non-positive Nev-ertheless we could still consider the use of the 1$ distanceestimate for positive values for instance a sample where mostor all of the observed values are positive or in the limiting casewhere there is a single positive parallax value In this case it iscrucial to be aware of the statistical properties of the estimate ρGiven a true distance r = 1$True what will be the behaviourof ρ We can obtain the probability density function (PDF) of ρfrom Eq (6) as

p(ρ | $True) = p($ = 1ρ | $True) middot∣∣∣∣∣d$dρ

∣∣∣∣∣=

1

ρ2σ$radic

2πexp

(minus

(1ρ minus$True)2

2σ2$

)(7)

In Fig 3 we depict p(ρ | $True) for two extreme cases of verylow and very high relative uncertainty The shape of p(ρ | $True)describes what we can expect when using ρ as an estimate ofthe true distance r The distribution of the figure on the left cor-responds to a case with a low fractional parallax uncertaintydefined as f = σ$$True It looks unbiased and symmetricalThus using ρ = 1$ to estimate the distance in a case like thisis relatively safe and would lead to more or less reliable resultsHowever in spite of its appearance the figure hides an intrin-sic non-Gaussianity that is made evident in the right-hand fig-ure This second plot corresponds to the case of high fractionalparallax uncertainty and the distribution shows several featuresfirst the mode (the most probable value) does not coincide withthe true distance value second the distribution is strongly asym-metric and finally it presents a long tail towards large values ofρ For more extreme values of f there is a noticeable negativetail to this distribution corresponding to the negative tail of theobserved parallax distribution

Fig 3 PDF of ρ = 1$ in two extreme cases The red vertical line indi-cates the true distance r Left Object at r = 100 pc with an uncertaintyon the observed parallax of σ$ = 03 mas Right Object at r = 2000 pcwith an uncertainty on the observed parallax of σ$ = 03 mas

Article number page 4 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

In view of Fig 3 it is tempting to apply corrections to theρ estimator based on the value of the fractional parallax uncer-tainty f Unfortunately in order to do so we would need to knowthe true value of the parallax and f Using the apparent fractionaluncertainty fapp = σ$$ is not feasible since the denominatorin f (the true parallax) can be very close to zero so its distribu-tion has very extended wings and using fapp will often result ingross errors

Furthermore reporting a ρ value should always be accompa-nied by an uncertainty estimate usually the standard deviationof the estimator but the standard deviation or the variance is de-fined in terms of an unknown quantity $True In addition thelong tail shown in the right panel of Fig 3 makes the estimatesof the variance quickly become pathological as discussed below

In order to clarify the previous assertions we recall the clas-sical concept of bias because it plays a central role in the discus-sion that develops in this section In statistics an estimator is saidto be biased if its expected value differs from the true value Inour case we aim to infer the true value of the parallax $True (oralternatively related quantities such as the true distance r ab-solute magnitude luminosity or 3D velocity components) andwe aim to infer it from the measured parallax In the Gaia casethis measured parallax will be affected by quasi-Gaussian uncer-tainties (see Sect 21) In this case the expectation value of theobserved parallax coincides with the true value

E[$] =

int$p($|$True)middotd$ =

int$N($$True σ$)middotd$ = $True

(8)

where N($$True σ$) represents the Gaussian probability dis-tribution centred at the true parallax and with a standard devia-tion σ$ Hence the observed parallax is an unbiased estimatorof the true parallax (under the strong hypothesis that there are nosystematic biases associated with the survey and that the errorsare normally distributed)

Now in order to assess the bias of ρ = 1$ as an estimatorof the true distance we need to calculate its expected value

E[ρ] = E[1$] =

int1$middotp($|$True)middotd$ =

int1$middotN($True σ$)middotd$

(9)

This bias was approximated by Smith amp Eichhorn (1996) (seeSect 341) as a function of the fractional parallax uncertainty fusing a series expansion of the term in the integral and severalapproximations for the asymptotic regimes of small and largevalues of f and it indeed shows that the distance estimator 1$is unbiased for vanishingly small values of f but it rapidly be-comes significantly biased for values of f beyond 01 But notonly is 1$ a biased estimator of the true distance it is also ahigh-variance estimator The reason for this variance explosionis related to the long tail towards large distances illustrated in theright panel of Figs 3 and 4 Relatively large fractional uncertain-ties inevitably imply noise excursions in the parallax that resultin vanishingly small observed parallaxes and disproportionatedistances (and hence an inflation of the variance)

The effects discussed above can be illustrated with the use ofsimulated data Figure 4 shows the results of a simulation of ob-jects located between 05 and 2kpc where starting from the truedistances we have simulated observed parallaxes with a Gaus-sian uncertainty of σ$ = 03 mas and then calculated for eachobject ρ = 1$

The figure on the left shows that (by construction) the errorsin the observed parallaxes are well behaved and perfectly sym-metrical (Gaussian) while in the centre figure the errors in theestimation of distances using ρ show a strong asymmetry Thecharacteristics of these residuals depend on the distribution oftrue distances and uncertainties This is more evident in the fig-ure on the right where the true distance r is plotted against ρthere is a very prominent tail of overestimated distances and thedistribution is asymmetrical around the one-to-one line the moredistant the objects the more marked the asymmetry These fea-tures are very prominent because we have simulated objects sothat the relative errors in parallax are large but they are present(albeit at a smaller scale) even when the relative errors are small

The plots in Fig 4 correspond to a simple simulation with amild uncertainty σ$ = 03 mas Figure 5 shows the same plotsfor a realistic simulation of the Gaia DR2 data set The simula-tion is described in Appendix A in this case the errors in paral-lax follow a realistic model of the Gaia DR2 errors depicted inFig A2

As a summary we have seen in previous paragraphs thatthe naive approach of inverting the observed parallax has sig-nificant drawbacks we are forced to dispose of valuable data(non-positive parallaxes) and as an estimator ρ = 1$ is biasedand has a very high variance

33 Sample truncation

In addition to the potential sources of trouble described in theprevious sections the traditional use of samples of parallaxes in-cludes a practice that tends to aggravate these effects truncationof the used samples

As discussed in Sect 31 negative parallaxes are a naturalresult of the Gaia measurement process (and of astrometry ingeneral) Since inverting negative parallaxes leads to physicallymeaningless negative distances we are tempted to just get rid ofthese values and form a lsquocleanrsquo sample This results in a biasedsample however

On the one hand removing the negative parallaxes biases thedistribution of this parameter Consider for instance the case il-lustrated in Fig 1 for the quasars from the AllWISE catalogueThese objects have a near zero true parallax and the distribu-tion of its observed values shown in the figure corresponds tothis with a mean of minus10 microas close to zero However if we re-move the negative parallaxes from this sample deeming themlsquounphysicalrsquo the mean of the observed values would be signif-icantly positive about 08 mas This is completely unrealisticfor quasars in removing the negative parallaxes we have signif-icantly biased the observed parallax set for these objects Withsamples of other types of objects with non-zero parallaxes theeffect can be smaller but it will be present

On the other hand when by removing negative parallaxesthe contents of the sample are no longer representative of thebase population from which it has been extracted since stars withlarge parallaxes are over-represented and stars with small paral-laxes are under-represented This can be clearly illustrated us-ing a simulation We have generated a sample of simulated starsmimicking the contents of the full Gaia DR2 (see Appendix A)and truncated it by removing the negative parallaxes In Fig 6we can compare the distribution of the true distances of the orig-inal (non-truncated) sample and the resulting (truncated) sam-ple it is clear that after the removal of negative parallaxes wehave favoured the stars at short distances (large parallaxes) withrespect to the stars at large distances (small parallaxes) The spa-

Article number page 5 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 4 Behaviour of PDF of ρ = 1$ as estimator of the true distance Left Histogram of differences between true parallaxes and observedparallaxes Centre Histogram of differences between true distances and their estimation using ρ Right Comparison of the true distances and theirestimations using ρ The observed parallaxes $ have been simulated using an uncertainty of σ$ = 03 mas

Fig 5 Behaviour of PDF of ρ = 1$ as estimator of the true distance for a simulation of the full Gaia DR2 data set Left Histogram of differencesbetween true parallaxes and observed parallaxes Centre Histogram of differences between true distances and their estimation using ρ RightComparison of the true distances and their estimations using ρ The observed parallaxes $ have been simulated using a realistic Gaia DR2 errormodel described in the Appendix The contour lines correspond to the distribution percentiles of 35 55 90 and 98

tial distribution of the sample has thus been altered and maytherefore bias any analysis based on it

Fig 6 Effect of removing the negative and zero parallaxes from a sim-ulation of Gaia DR2 Distribution of true distances histogram of dis-tances for the complete sample (thick line) histogram of distances forthe sample truncated by removing $ le 0 (thin line)

A stronger version of truncation that has traditionally beenapplied is to remove not only negative parallaxes but also all theparallaxes with a relative error above a given threshold k select-

ing σ$$

lt k This selection tends to favour the removal of starswith small parallaxes The effect is similar to the previous casebut more accentuated as can be seen in Fig 7 Again stars atshort distances are favoured in the sample with respect to distantstars

Fig 7 Effect of removing the positive parallaxes with a relative errorabove 50 as well as negative parallaxes Thick line histogram of dis-tances for the complete sample Thin line histogram of distances forthe sample truncated by removing objects with (|σ$

$| gt 05)

Article number page 6 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 4: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

negative parallaxes as illustrated in Fig 2 We note how in thesource model described in Eq (5) the parallax appears as thefactor minus$ in front of the barycentric position of the observerwhich means that for each source its parallactic motion on thesky will have a sense which reflects the sense of the motion ofthe observer around the Sun In the presence of large measure-ment noise (comparable to the size of the parallax) it is entirelypossible that the parallax value estimated for the source modelvanishes or becomes negative This case can be interpreted as themeasurement being consistent with the source going lsquothe wrongway aroundrsquo on the sky as shown in Fig 2

2 1 0 1 2 3 4 [mas]

8

6

4

2

0

2

[mas

]

True = 040 = 100 = 200Fit = 049 = 113 = 169Observations

2101234

[m

as]

2015 2016 2017 2018 2019Time [yr]

8

6

4

2

0

2

[mas

]

Fig 2 Example of a negative parallax arising from the astrometric dataprocessing Solid blue lines true path of the object red dots the in-dividual measurements of the source position on the sky dashed or-ange lines the source path according to the least-squares astrometricsolution which here features a negative parallax Left Path on the skyshowing the effect of proper motion (linear trend) and parallax (loops)Right Right ascension and declination of the source as a function oftime In the fitted solution the negative parallax effect is equivalent to ayearly motion of the star in the opposite direction of the true parallacticmotion (which gives a phase-shift of π in the sinusoidal curves in theright panels) The error bars indicate a measurement uncertainty of 07mas the uncertainties on ∆αlowast and ∆δ are assumed to be uncorrelated

This example is intended to clarify why parallaxes can havenon-positive observed values and more importantly to conveythe message that the parallax is not a direct measurement ofthe distance to a source The distance (or any other quantity de-pending on distance) has to be estimated from the observed par-allax (and other relevant information) taking into account theuncertainty in the measurement A simplified demonstration ofhow negative parallaxes arise (allowing the reader to reproduceFig 2) can be found in the online tutorials accompanying thispaper 3

32 Estimating distance by inverting the parallax

In the absence of measurement uncertainties the distance to astar can be obtained from its true parallax through r = 1$Truewith$True indicating the true value of the parallax Thus naivelywe could say that the distance to a star can be obtained by in-verting the observed parallax ρ = 1$ where now ρ is usedto indicate the distance derived from the observed value of theparallax For this discussion the observed parallax is assumed to

3 httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibrationDemoNegativeParallaxipynb

be free of systematic measurement errors and to be distributednormally around the true parallax

p($ | $True) =1

σ$radic

2πexp

(minus

($ minus$True)2

2σ2$

) (6)

where σ$ indicates the measurement uncertainty on $ Blinduse of 1$ as an estimator of the distance will lead to unphys-ical results in case the observed parallax is non-positive Nev-ertheless we could still consider the use of the 1$ distanceestimate for positive values for instance a sample where mostor all of the observed values are positive or in the limiting casewhere there is a single positive parallax value In this case it iscrucial to be aware of the statistical properties of the estimate ρGiven a true distance r = 1$True what will be the behaviourof ρ We can obtain the probability density function (PDF) of ρfrom Eq (6) as

p(ρ | $True) = p($ = 1ρ | $True) middot∣∣∣∣∣d$dρ

∣∣∣∣∣=

1

ρ2σ$radic

2πexp

(minus

(1ρ minus$True)2

2σ2$

)(7)

In Fig 3 we depict p(ρ | $True) for two extreme cases of verylow and very high relative uncertainty The shape of p(ρ | $True)describes what we can expect when using ρ as an estimate ofthe true distance r The distribution of the figure on the left cor-responds to a case with a low fractional parallax uncertaintydefined as f = σ$$True It looks unbiased and symmetricalThus using ρ = 1$ to estimate the distance in a case like thisis relatively safe and would lead to more or less reliable resultsHowever in spite of its appearance the figure hides an intrin-sic non-Gaussianity that is made evident in the right-hand fig-ure This second plot corresponds to the case of high fractionalparallax uncertainty and the distribution shows several featuresfirst the mode (the most probable value) does not coincide withthe true distance value second the distribution is strongly asym-metric and finally it presents a long tail towards large values ofρ For more extreme values of f there is a noticeable negativetail to this distribution corresponding to the negative tail of theobserved parallax distribution

Fig 3 PDF of ρ = 1$ in two extreme cases The red vertical line indi-cates the true distance r Left Object at r = 100 pc with an uncertaintyon the observed parallax of σ$ = 03 mas Right Object at r = 2000 pcwith an uncertainty on the observed parallax of σ$ = 03 mas

Article number page 4 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

In view of Fig 3 it is tempting to apply corrections to theρ estimator based on the value of the fractional parallax uncer-tainty f Unfortunately in order to do so we would need to knowthe true value of the parallax and f Using the apparent fractionaluncertainty fapp = σ$$ is not feasible since the denominatorin f (the true parallax) can be very close to zero so its distribu-tion has very extended wings and using fapp will often result ingross errors

Furthermore reporting a ρ value should always be accompa-nied by an uncertainty estimate usually the standard deviationof the estimator but the standard deviation or the variance is de-fined in terms of an unknown quantity $True In addition thelong tail shown in the right panel of Fig 3 makes the estimatesof the variance quickly become pathological as discussed below

In order to clarify the previous assertions we recall the clas-sical concept of bias because it plays a central role in the discus-sion that develops in this section In statistics an estimator is saidto be biased if its expected value differs from the true value Inour case we aim to infer the true value of the parallax $True (oralternatively related quantities such as the true distance r ab-solute magnitude luminosity or 3D velocity components) andwe aim to infer it from the measured parallax In the Gaia casethis measured parallax will be affected by quasi-Gaussian uncer-tainties (see Sect 21) In this case the expectation value of theobserved parallax coincides with the true value

E[$] =

int$p($|$True)middotd$ =

int$N($$True σ$)middotd$ = $True

(8)

where N($$True σ$) represents the Gaussian probability dis-tribution centred at the true parallax and with a standard devia-tion σ$ Hence the observed parallax is an unbiased estimatorof the true parallax (under the strong hypothesis that there are nosystematic biases associated with the survey and that the errorsare normally distributed)

Now in order to assess the bias of ρ = 1$ as an estimatorof the true distance we need to calculate its expected value

E[ρ] = E[1$] =

int1$middotp($|$True)middotd$ =

int1$middotN($True σ$)middotd$

(9)

This bias was approximated by Smith amp Eichhorn (1996) (seeSect 341) as a function of the fractional parallax uncertainty fusing a series expansion of the term in the integral and severalapproximations for the asymptotic regimes of small and largevalues of f and it indeed shows that the distance estimator 1$is unbiased for vanishingly small values of f but it rapidly be-comes significantly biased for values of f beyond 01 But notonly is 1$ a biased estimator of the true distance it is also ahigh-variance estimator The reason for this variance explosionis related to the long tail towards large distances illustrated in theright panel of Figs 3 and 4 Relatively large fractional uncertain-ties inevitably imply noise excursions in the parallax that resultin vanishingly small observed parallaxes and disproportionatedistances (and hence an inflation of the variance)

The effects discussed above can be illustrated with the use ofsimulated data Figure 4 shows the results of a simulation of ob-jects located between 05 and 2kpc where starting from the truedistances we have simulated observed parallaxes with a Gaus-sian uncertainty of σ$ = 03 mas and then calculated for eachobject ρ = 1$

The figure on the left shows that (by construction) the errorsin the observed parallaxes are well behaved and perfectly sym-metrical (Gaussian) while in the centre figure the errors in theestimation of distances using ρ show a strong asymmetry Thecharacteristics of these residuals depend on the distribution oftrue distances and uncertainties This is more evident in the fig-ure on the right where the true distance r is plotted against ρthere is a very prominent tail of overestimated distances and thedistribution is asymmetrical around the one-to-one line the moredistant the objects the more marked the asymmetry These fea-tures are very prominent because we have simulated objects sothat the relative errors in parallax are large but they are present(albeit at a smaller scale) even when the relative errors are small

The plots in Fig 4 correspond to a simple simulation with amild uncertainty σ$ = 03 mas Figure 5 shows the same plotsfor a realistic simulation of the Gaia DR2 data set The simula-tion is described in Appendix A in this case the errors in paral-lax follow a realistic model of the Gaia DR2 errors depicted inFig A2

As a summary we have seen in previous paragraphs thatthe naive approach of inverting the observed parallax has sig-nificant drawbacks we are forced to dispose of valuable data(non-positive parallaxes) and as an estimator ρ = 1$ is biasedand has a very high variance

33 Sample truncation

In addition to the potential sources of trouble described in theprevious sections the traditional use of samples of parallaxes in-cludes a practice that tends to aggravate these effects truncationof the used samples

As discussed in Sect 31 negative parallaxes are a naturalresult of the Gaia measurement process (and of astrometry ingeneral) Since inverting negative parallaxes leads to physicallymeaningless negative distances we are tempted to just get rid ofthese values and form a lsquocleanrsquo sample This results in a biasedsample however

On the one hand removing the negative parallaxes biases thedistribution of this parameter Consider for instance the case il-lustrated in Fig 1 for the quasars from the AllWISE catalogueThese objects have a near zero true parallax and the distribu-tion of its observed values shown in the figure corresponds tothis with a mean of minus10 microas close to zero However if we re-move the negative parallaxes from this sample deeming themlsquounphysicalrsquo the mean of the observed values would be signif-icantly positive about 08 mas This is completely unrealisticfor quasars in removing the negative parallaxes we have signif-icantly biased the observed parallax set for these objects Withsamples of other types of objects with non-zero parallaxes theeffect can be smaller but it will be present

On the other hand when by removing negative parallaxesthe contents of the sample are no longer representative of thebase population from which it has been extracted since stars withlarge parallaxes are over-represented and stars with small paral-laxes are under-represented This can be clearly illustrated us-ing a simulation We have generated a sample of simulated starsmimicking the contents of the full Gaia DR2 (see Appendix A)and truncated it by removing the negative parallaxes In Fig 6we can compare the distribution of the true distances of the orig-inal (non-truncated) sample and the resulting (truncated) sam-ple it is clear that after the removal of negative parallaxes wehave favoured the stars at short distances (large parallaxes) withrespect to the stars at large distances (small parallaxes) The spa-

Article number page 5 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 4 Behaviour of PDF of ρ = 1$ as estimator of the true distance Left Histogram of differences between true parallaxes and observedparallaxes Centre Histogram of differences between true distances and their estimation using ρ Right Comparison of the true distances and theirestimations using ρ The observed parallaxes $ have been simulated using an uncertainty of σ$ = 03 mas

Fig 5 Behaviour of PDF of ρ = 1$ as estimator of the true distance for a simulation of the full Gaia DR2 data set Left Histogram of differencesbetween true parallaxes and observed parallaxes Centre Histogram of differences between true distances and their estimation using ρ RightComparison of the true distances and their estimations using ρ The observed parallaxes $ have been simulated using a realistic Gaia DR2 errormodel described in the Appendix The contour lines correspond to the distribution percentiles of 35 55 90 and 98

tial distribution of the sample has thus been altered and maytherefore bias any analysis based on it

Fig 6 Effect of removing the negative and zero parallaxes from a sim-ulation of Gaia DR2 Distribution of true distances histogram of dis-tances for the complete sample (thick line) histogram of distances forthe sample truncated by removing $ le 0 (thin line)

A stronger version of truncation that has traditionally beenapplied is to remove not only negative parallaxes but also all theparallaxes with a relative error above a given threshold k select-

ing σ$$

lt k This selection tends to favour the removal of starswith small parallaxes The effect is similar to the previous casebut more accentuated as can be seen in Fig 7 Again stars atshort distances are favoured in the sample with respect to distantstars

Fig 7 Effect of removing the positive parallaxes with a relative errorabove 50 as well as negative parallaxes Thick line histogram of dis-tances for the complete sample Thin line histogram of distances forthe sample truncated by removing objects with (|σ$

$| gt 05)

Article number page 6 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 5: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

In view of Fig 3 it is tempting to apply corrections to theρ estimator based on the value of the fractional parallax uncer-tainty f Unfortunately in order to do so we would need to knowthe true value of the parallax and f Using the apparent fractionaluncertainty fapp = σ$$ is not feasible since the denominatorin f (the true parallax) can be very close to zero so its distribu-tion has very extended wings and using fapp will often result ingross errors

Furthermore reporting a ρ value should always be accompa-nied by an uncertainty estimate usually the standard deviationof the estimator but the standard deviation or the variance is de-fined in terms of an unknown quantity $True In addition thelong tail shown in the right panel of Fig 3 makes the estimatesof the variance quickly become pathological as discussed below

In order to clarify the previous assertions we recall the clas-sical concept of bias because it plays a central role in the discus-sion that develops in this section In statistics an estimator is saidto be biased if its expected value differs from the true value Inour case we aim to infer the true value of the parallax $True (oralternatively related quantities such as the true distance r ab-solute magnitude luminosity or 3D velocity components) andwe aim to infer it from the measured parallax In the Gaia casethis measured parallax will be affected by quasi-Gaussian uncer-tainties (see Sect 21) In this case the expectation value of theobserved parallax coincides with the true value

E[$] =

int$p($|$True)middotd$ =

int$N($$True σ$)middotd$ = $True

(8)

where N($$True σ$) represents the Gaussian probability dis-tribution centred at the true parallax and with a standard devia-tion σ$ Hence the observed parallax is an unbiased estimatorof the true parallax (under the strong hypothesis that there are nosystematic biases associated with the survey and that the errorsare normally distributed)

Now in order to assess the bias of ρ = 1$ as an estimatorof the true distance we need to calculate its expected value

E[ρ] = E[1$] =

int1$middotp($|$True)middotd$ =

int1$middotN($True σ$)middotd$

(9)

This bias was approximated by Smith amp Eichhorn (1996) (seeSect 341) as a function of the fractional parallax uncertainty fusing a series expansion of the term in the integral and severalapproximations for the asymptotic regimes of small and largevalues of f and it indeed shows that the distance estimator 1$is unbiased for vanishingly small values of f but it rapidly be-comes significantly biased for values of f beyond 01 But notonly is 1$ a biased estimator of the true distance it is also ahigh-variance estimator The reason for this variance explosionis related to the long tail towards large distances illustrated in theright panel of Figs 3 and 4 Relatively large fractional uncertain-ties inevitably imply noise excursions in the parallax that resultin vanishingly small observed parallaxes and disproportionatedistances (and hence an inflation of the variance)

The effects discussed above can be illustrated with the use ofsimulated data Figure 4 shows the results of a simulation of ob-jects located between 05 and 2kpc where starting from the truedistances we have simulated observed parallaxes with a Gaus-sian uncertainty of σ$ = 03 mas and then calculated for eachobject ρ = 1$

The figure on the left shows that (by construction) the errorsin the observed parallaxes are well behaved and perfectly sym-metrical (Gaussian) while in the centre figure the errors in theestimation of distances using ρ show a strong asymmetry Thecharacteristics of these residuals depend on the distribution oftrue distances and uncertainties This is more evident in the fig-ure on the right where the true distance r is plotted against ρthere is a very prominent tail of overestimated distances and thedistribution is asymmetrical around the one-to-one line the moredistant the objects the more marked the asymmetry These fea-tures are very prominent because we have simulated objects sothat the relative errors in parallax are large but they are present(albeit at a smaller scale) even when the relative errors are small

The plots in Fig 4 correspond to a simple simulation with amild uncertainty σ$ = 03 mas Figure 5 shows the same plotsfor a realistic simulation of the Gaia DR2 data set The simula-tion is described in Appendix A in this case the errors in paral-lax follow a realistic model of the Gaia DR2 errors depicted inFig A2

As a summary we have seen in previous paragraphs thatthe naive approach of inverting the observed parallax has sig-nificant drawbacks we are forced to dispose of valuable data(non-positive parallaxes) and as an estimator ρ = 1$ is biasedand has a very high variance

33 Sample truncation

In addition to the potential sources of trouble described in theprevious sections the traditional use of samples of parallaxes in-cludes a practice that tends to aggravate these effects truncationof the used samples

As discussed in Sect 31 negative parallaxes are a naturalresult of the Gaia measurement process (and of astrometry ingeneral) Since inverting negative parallaxes leads to physicallymeaningless negative distances we are tempted to just get rid ofthese values and form a lsquocleanrsquo sample This results in a biasedsample however

On the one hand removing the negative parallaxes biases thedistribution of this parameter Consider for instance the case il-lustrated in Fig 1 for the quasars from the AllWISE catalogueThese objects have a near zero true parallax and the distribu-tion of its observed values shown in the figure corresponds tothis with a mean of minus10 microas close to zero However if we re-move the negative parallaxes from this sample deeming themlsquounphysicalrsquo the mean of the observed values would be signif-icantly positive about 08 mas This is completely unrealisticfor quasars in removing the negative parallaxes we have signif-icantly biased the observed parallax set for these objects Withsamples of other types of objects with non-zero parallaxes theeffect can be smaller but it will be present

On the other hand when by removing negative parallaxesthe contents of the sample are no longer representative of thebase population from which it has been extracted since stars withlarge parallaxes are over-represented and stars with small paral-laxes are under-represented This can be clearly illustrated us-ing a simulation We have generated a sample of simulated starsmimicking the contents of the full Gaia DR2 (see Appendix A)and truncated it by removing the negative parallaxes In Fig 6we can compare the distribution of the true distances of the orig-inal (non-truncated) sample and the resulting (truncated) sam-ple it is clear that after the removal of negative parallaxes wehave favoured the stars at short distances (large parallaxes) withrespect to the stars at large distances (small parallaxes) The spa-

Article number page 5 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 4 Behaviour of PDF of ρ = 1$ as estimator of the true distance Left Histogram of differences between true parallaxes and observedparallaxes Centre Histogram of differences between true distances and their estimation using ρ Right Comparison of the true distances and theirestimations using ρ The observed parallaxes $ have been simulated using an uncertainty of σ$ = 03 mas

Fig 5 Behaviour of PDF of ρ = 1$ as estimator of the true distance for a simulation of the full Gaia DR2 data set Left Histogram of differencesbetween true parallaxes and observed parallaxes Centre Histogram of differences between true distances and their estimation using ρ RightComparison of the true distances and their estimations using ρ The observed parallaxes $ have been simulated using a realistic Gaia DR2 errormodel described in the Appendix The contour lines correspond to the distribution percentiles of 35 55 90 and 98

tial distribution of the sample has thus been altered and maytherefore bias any analysis based on it

Fig 6 Effect of removing the negative and zero parallaxes from a sim-ulation of Gaia DR2 Distribution of true distances histogram of dis-tances for the complete sample (thick line) histogram of distances forthe sample truncated by removing $ le 0 (thin line)

A stronger version of truncation that has traditionally beenapplied is to remove not only negative parallaxes but also all theparallaxes with a relative error above a given threshold k select-

ing σ$$

lt k This selection tends to favour the removal of starswith small parallaxes The effect is similar to the previous casebut more accentuated as can be seen in Fig 7 Again stars atshort distances are favoured in the sample with respect to distantstars

Fig 7 Effect of removing the positive parallaxes with a relative errorabove 50 as well as negative parallaxes Thick line histogram of dis-tances for the complete sample Thin line histogram of distances forthe sample truncated by removing objects with (|σ$

$| gt 05)

Article number page 6 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 6: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

Fig 4 Behaviour of PDF of ρ = 1$ as estimator of the true distance Left Histogram of differences between true parallaxes and observedparallaxes Centre Histogram of differences between true distances and their estimation using ρ Right Comparison of the true distances and theirestimations using ρ The observed parallaxes $ have been simulated using an uncertainty of σ$ = 03 mas

Fig 5 Behaviour of PDF of ρ = 1$ as estimator of the true distance for a simulation of the full Gaia DR2 data set Left Histogram of differencesbetween true parallaxes and observed parallaxes Centre Histogram of differences between true distances and their estimation using ρ RightComparison of the true distances and their estimations using ρ The observed parallaxes $ have been simulated using a realistic Gaia DR2 errormodel described in the Appendix The contour lines correspond to the distribution percentiles of 35 55 90 and 98

tial distribution of the sample has thus been altered and maytherefore bias any analysis based on it

Fig 6 Effect of removing the negative and zero parallaxes from a sim-ulation of Gaia DR2 Distribution of true distances histogram of dis-tances for the complete sample (thick line) histogram of distances forthe sample truncated by removing $ le 0 (thin line)

A stronger version of truncation that has traditionally beenapplied is to remove not only negative parallaxes but also all theparallaxes with a relative error above a given threshold k select-

ing σ$$

lt k This selection tends to favour the removal of starswith small parallaxes The effect is similar to the previous casebut more accentuated as can be seen in Fig 7 Again stars atshort distances are favoured in the sample with respect to distantstars

Fig 7 Effect of removing the positive parallaxes with a relative errorabove 50 as well as negative parallaxes Thick line histogram of dis-tances for the complete sample Thin line histogram of distances forthe sample truncated by removing objects with (|σ$

$| gt 05)

Article number page 6 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 7: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Even worse as in the previous case the truncation not onlymakes the distribution of true distances unrepresentative but italso biases the distribution of observed parallaxes stars withpositive errors (making the observed parallax larger than thetrue one) tend to be less removed than stars with negative er-rors (making the observed parallax smaller than the true one)By favouring positive errors with respect to negative errors weare also biasing the overall distribution of parallaxes Figure 8depicts this effect The plots show the difference $ minus$True as afunction of $True We can see in the middle and bottom figureshow the removal of objects is non-symmetrical around the zeroline so that the overall distribution of$minus$True becomes biasedFrom an almost zero bias for the full sample (as expected fromGaia in absence of systematics) we go to significant biases oncewe introduce the truncation and the bias is dependent of the cutvalue we introduce

Furthermore in Gaia the parallax uncertainties vary withthe object magnitude being larger for faint stars (see Fig A2)Therefore a threshold on the relative error will favour brightstars over the faint ones adding to the above described biases

Another type of truncation that has been traditionally appliedis to introduce limits in the observed parallax The effects of sucha limit are closely related to the LutzndashKelker bias discussed inSect 342 Suffice it here to illustrate the effect with a specificexample on a Gaia DR2 -like sample If we take the full sampleand remove stars with $ lt 02 mas we could imagine that weare roughly removing objects further away than 5 kpc Howeverthe net result is depicted in Fig 9 where we can see that insteadof the distribution of true distances of the complete sample up to5 kpc (solid line) we get a distribution with a lack of closer starsand a long tail of stars with greater distances A larger limit inparallax (shorter limiting distance) will produce a less prominenteffect since the relative errors in the parallax will be smaller butthe bias will be nonetheless present

In conclusion our advice to readers is to avoid introducingtruncations when using Gaia data since as illustrated above theycan strongly affect the properties of the sample and thereforeaffect the data analysis If truncation is unavoidable it should beincluded in the Bayesian modelling of the overall problem (seeSect 43)

34 Corrections and transformations

In this section we review proposals in the literature for the useof parallaxes to estimate distances In general they take the formof lsquoremediesrsquo to correct one or another problem on this use Herewe explain why these remedies cannot be recommended

341 The SmithndashEichhorn correction

Smith amp Eichhorn (1996) attempt to compensate for the bias in-troduced by the naive inversion of the observed parallax (andthe associated variance problem) in two different ways The firstinvolves transforming the measured parallaxes into a pseudo-parallax $lowast according to

$lowast equiv β middot σ$

1exp(φ) + exp(minus16$

σ$)

+ φ

(10)

where φ equiv ln(1 + exp( 2$σ$

))2 and β is an adjustable constantThe qualitative effect of the transformation is to map negativeparallaxes into the positive semi-axis R+ and to increase the

Fig 8 Differences between true and observed parallax effect of remov-ing observed negative parallaxes and those above a given relative errorWe start with a representative subsample of 1 million stars (top fig-ure) and truncate it according to the apparent relative parallax precisionTop Complete sample Mean difference $ minus$True is 155 times 10minus5 masMiddle Retaining only objects with positive parallaxes and |σ$

$| lt 05

The mean difference $minus$True is 0164 mas Bottom Retaining onlyobjects with positive parallaxes and |σ$

$| lt 02 The mean difference

$ minus$True is 0026 mas

value of small parallaxes until it asymptotically converges to themeasured value for large $ For $ = 0 φ = ln(2)2 and thepseudo-parallax value $lowast = β middot σ$

(1

1+exp(ln(2)2) +ln(2)

2

)has the

undesirable property of depending on the choice of β and on theparallax uncertainty Thus even in the case of a small parallaxmeasurement $ rarr 0 with a relatively small fractional parallax

Article number page 7 of 20

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 8: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

Fig 9 Effect of introducing a limit on the observed parallax Thick linehistogram of distances up to 5 kpc for the complete sample Thin linehistogram of distances for the sample truncated by removing objectswith $ lt 02 mas (distance estimated as 1$ up to 5 kpc)

uncertainty (eg f = 01) we substitute a perfectly useful andvaluable measurement by an arbitrary value of $lowast

The SmithndashEichhorn transformation is an arbitrary (andrather convolved) choice amongst many such transformationsthat can reduce the bias for certain particular situations Both theanalytical expression and the choice of constants and β are theresult of an unspecified trial-and-error procedure the applicabil-ity of which is unclear Furthermore as stated by the authorsthey introduced a new bias because the transformed parallax isalways larger than the measured parallax This new bias has nophysical interpretation because it is the result of an ad hoc choicefor the analytical expression in Eq (10) It is designed to reducethe bias but it does so by substituting perfectly reasonable directmeasurements (negative and small parallaxes) that we can inter-pret and use for inference by constructed values arising from thechoice

342 The LutzndashKelker correction

Lutz amp Kelker (1973) realised that the spatial distribution ofsources around the observer together with the unavoidable ob-servational errors and a truncation of the sample on the valueof the observed parallax result in systematic biases in the av-erage parallax of certain stellar samples The bias described byLutz amp Kelker (1973) is a manifestation of the truncation bi-ases described above and can be understood if we look at a fewvery simple examples First let us imagine a density of sourcesaround the observer such that the distribution of true parallaxesp($True) is constant in a given interval and zero outside Let usalso imagine that the observation uncertainty σ$ is constant andequal to 03 mas The left panel of Fig 10 shows the distributionof true and observed parallaxes for a simulation of such a situ-ation and 106 sources We see that the observed parallaxes arealso approximately uniform and the departures from uniformityappear near the edges For a given bin of intermediate parallaxwe have as many sources contaminating from neighbouring binsas we have sources lost to other bins due to the observationaluncertainties and the result is a negligible net flux

In the middle panel we show the same histograms exceptthat instead of a constant parallax uncertainty we use a constantvalue of the fractional parallax uncertainty f This is still an un-realistic situation because the distribution of true parallaxes is

not uniform and also because a constant f implies that largerparallaxes are characterised by larger uncertainties However ithelps us to illustrate that even in the case of uniform true par-allaxes we may have a non-zero net flux of sources betweendifferent parallax bins depending on the distribution of parallaxuncertainties In this case noise shifts the value of large paral-laxes more than that of smaller parallaxes (because of the con-stant value of f ) so larger parallaxes get more scattered thussuppressing the distribution more at large parallaxes

Finally the right panel shows the same plot for a realisticdistribution of distances from a Gaia Universe Model Snapshot(GUMS) sample (see Appendix A for a full description) We seethat the effect of a realistic non-uniform distribution of parallaxesand parallax uncertainties results in a net flux in the oppositedirection (smaller true parallaxes become more suppressed andlarger parallax bins are enhanced in both cases the bins of neg-ative parallaxes become populated) This is the root of the LutzndashKelker bias It is important to distinguish between the LutzndashKelker bias and the LutzndashKelker correction The LutzndashKelkerbias is the negative difference for any realistic sample betweenthe average true parallax and the average measured parallax (iethe average true parallax is smaller than the average measuredparallax) This bias has been known at least since the work ofTrumpler amp Weaver (1953) although it was already discussed ina context different from parallaxes as early as Eddington (1913)The LutzndashKelker correction presented in Lutz amp Kelker (1973)and discussed below is an attempt to remedy this bias based ona series of assumptions

In the case of Gaussian uncertainties such as those describedin Eq (6) it is evident that the probability of measuring a value ofthe parallax greater than the true parallax p($ gt $True|$True) =05 The same value holds for the probability that $ lt $Truebecause the Gaussian distribution is symmetrical with respectto the true value of the parallax This is also true for the jointprobability of $ and $True

p($ gt $True) =

S

p($$True) middot d$ middot d$True = 05 (11)

where S is the region of the ($$True) plane where $ gt $TrueHowever the probability distribution of the true parallax

given the observed parallax p($True|$) does not fulfil this seem-ingly desirable property of probability mass equipartition at the$ = $True point We can write the latter probability as

p($True|$) =p($$True)

p($)=

p($|$True) middot p($True)p($)

(12)

using the product rule of probability In the context of inferringthe true parallax from the observed one Eq (12) is the well-known Bayesrsquo theorem where the left-hand side is the posteriorprobability p($ | $True) is the likelihood p($True) is the priorand p($) is the evidence For most realistic prior distributionsp($True) neither the median nor the mode or the mean of theposterior in Eq (12) is at $ = $True Let us take for examplea uniform volume density of sources out to a certain distancelimit In such a distribution the number of sources in a sphericalshell of infinitesimal width at a distance r scales as r2 as doesthe probability distribution of the distances Since

p(r) middot dr = p($True) middot d$True (13)

Article number page 8 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 9: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Uniform ϖTrue and constant σϖ

Parallax (mas)

Pro

babi

lity

minus002 002 006 010

05

1015

20Uniform ϖTrue and constant fTrue

Parallax (mas)

Den

sity

minus005 000 005 010 015 020

GUMS simulation

Parallax (mas)

Den

sity

minus1 0 1 2

01

23

45

Fig 10 Histograms of true (grey) and observed (blue) parallaxes for three simulations uniform distribution of parallaxes and constant σ$ = 03mas (left) uniform distribution of parallaxes and constant f = 05 (centre) and the GUMS simulation described in Appendix A (right)

the probability distribution for the true parallax in such a trun-cated constant volume density scenario is proportional to

p($True) prop $minus4True (14)

out to the truncation radius Hence for Gaussian distributed un-certainties we can write p($True|$) as

p($True|$) prop1σ$middot exp(

minus($ minus$True)2

2σ$) middot$minus4

True (15)

The joint distribution p($$True) (ie the non-normalisedposterior plotted as a 2D function of data parallax and param-eter $True) for this particular case of truncated uniform stellarvolume densities is depicted in Fig 11 together with the con-ditional distributions for particular values of $ and $True Itshows graphically the symmetry of the probability distributionp($|$True) (with respect to $True) and the bias and asymmetryof p($True|$)

Lutz amp Kelker (1973) obtain Eq (15) in their Sect ii underthe assumption of uniform stellar volume densities and constantfractional parallax uncertainties (constant f ) They discuss sev-eral distributions for different values of the ratio σ$$True Intheir Sect iii they use the expected value of the true parallaxgiven by the distribution p($True|$) in Eq (15) to infer the ex-pected value of the difference between the true absolute magni-tude MTrue and the value obtained with the naive inversion of theobserved parallax The expected value of this absolute magni-tude error is derived and tabulated for the distribution p($True|$)as a function of the fractional parallax uncertainty f This so-called LutzndashKelker correction is often applied to stellar samplesthat do not fulfil the assumptions under which it was derived be-cause the stellar volume density is far from uniform at scaleslarger than a few tens of parsecs and the samples to which thecorrection is applied are never characterised by a unique valueof f

p(ϖ

Tru

e|ϖ

)0

12

34

5

00 02 04 06 08 10

00 02 04 06 08 10

00

05

10

15

ϖTrue

ϖ

p(ϖ|ϖTrue)

piob

s

0 1 2 3 4

00

05

10

15

minus50

minus40

minus30

minus20

minus10

0

10

Fig 11 (Lower left) Joint probability distribution (in logarithmic scaleto improve visibility) for the random variables $ and $True in the sce-nario of a truncated uniform volume density The colour code is shownto the right of the lower right panel and white marks the region wherethe probability is zero (Lower right) Conditional probability distribu-tion of the observed parallax for $True=05 (Upper left) Conditionalprobability distribution for $True given an observed parallax $ Thefractional parallax uncertainty assumed for the computation of all prob-abilities is f = 02

35 Astrometry-based luminosity

An obvious way to avoid the problems associated with the naiveinversion of observed parallaxes (see Sect 341) is to remainin the space of parallaxes (as opposed to that of distances) in-sofar as this is possible One example of this approach is the

Article number page 9 of 20

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 10: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

astrometry-based luminosity (ABL) method (Arenou amp Luri1999) originating from Malmquist (1920) The ABL methodconsists in substituting the absolute magnitudes by a proxy thatis linearly dependent on the parallax The original proposal was

aV equiv 1002MV = $10mV +5

5 (16)

and has been recently used to obtain maximum likelihood esti-mates of the period-luminosity relation coefficients for Cepheidsand RR Lyrae stars (Gaia Collaboration et al 2017a) and to im-prove the Gaia parallax uncertainties using deconvolved colour-magnitude diagrams as prior (Anderson et al 2017) The newastrometry-based luminosity depends linearly on the parallaxand thus its uncertainty can be expected to have an approxi-mately Gaussian distribution if the fractional uncertainty of theapparent magnitude is negligible This is more often the casethan for fractional parallax uncertainties and is in general a goodapproximation

Unfortunately the astrometry-based luminosity can only beapplied to the study of the luminosity and can do nothing forthe analysis of spatial distributions where distances or tangentialvelocities are inevitable ingredients

4 Recommendations for using astrometric data

In this section we provide specific advice on the use of astro-metric data in astronomical data analysis Although the focus ison the use of Gaia data many of the recommendations hold forthe analysis of astrometric data in general To illustrate the rec-ommendations we also provide a small number of worked ex-amples ranging from very basic demonstrations of the issuesmentioned in Sect 3 to full Bayesian analyses Some of theseexamples are available in the Gaia archive tutorial described inSect 5

41 Using Gaia astrometric data how to proceed

The fundamental quantity sought when measuring a stellar par-allax is the distance to the star in question However as discussedin the previous sections the quantity of interest has a non-linearrelation to the measurement r = 1$True and is constrainedto be positive while the measured parallax can be zero or evennegative Starting from a measured parallax which is normallydistributed about the true parallax this leads to a probability den-sity for the simple distance estimator ρ = 1$ (see Sect 3) forwhich the moments are defined in terms of unknown quantitiesThis means we cannot calculate the variance of the estimator orthe size of a possible bias which renders the estimator uselessfrom the statistical point of view

Our first and main recommendation is thus to alwaystreat the derivation of (astro-)physical parameters from as-trometric data in particular when parallaxes are involvedas an inference problem which should preferably be handledwith a full Bayesian approach

411 Bayesian inference of distances from parallaxes

The Bayesian approach to inference involves estimating a PDFover the quantity of interest given the observables In this casewe want to estimate the distance r given the parallax $ Afuller treatment of this problem has been given in Bailer-Jones

(2015) so only a brief summary will be given here Using Bayesrsquotheorem we can write the posterior as

P(r | $) =1Z

P($ | r)P(r) (17)

Formally everything is also conditioned on the parallax uncer-tainty σ$ and any relevant constraints or assumptions but sym-bols for these are omitted for brevity The quantity P($ | r) isthe likelihood from Eq (12) The prior P(r) incorporates ourassumptions and Z is a normalisation constant

In addition to the likelihood there are two important choiceswhich must be made to estimate a distance the choice of priorand the choice of estimator We will first focus on the formerand start discussing the simplest prior the uniform unboundedprior With a uniform boundless (and thus improper) prior ondistances the posterior is proportional to the likelihood and ifwe choose the mode of the posterior as our estimator then thesolution is mathematically equivalent to maximising the likeli-hood However a boundless uniform prior permits negative dis-tances which are non-physical so we should at least truncate itto exclude these values

The more measurements we have or the more precise themeasurements are the narrower the likelihood and the lower theimpact of the prior should be on the posterior It turns out how-ever that with the unbounded uniform prior the posterior is im-proper ie it is not normalisable Consequently the mean andmedian are not defined The only point estimator is the modeie rest = 1$ (the standard deviation and the other quantilesare likewise undefined) which is rather restrictive Finally thisBayesian distance estimate defined for an unbounded uniformprior reduces to the maximum likelihood estimate and coincideswith the naive inversion discussed in Sect 32 The posterioris ill-defined for the unbounded uniform prior for parallaxesThis prior describes an unrealistic situation where the observer isplaced at the centre of a distribution of sources that is sphericallysymmetric and the volume density of which decreases sharplywith distance

The solution to these problems (non-physical distances im-proper posterior) is to use a more appropriate prior The proper-ties of various priors and estimators have been studied by Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones (2016b) The lat-ter makes a detailed study using a Milky Way model for a priorand also investigates how the estimates change when the Gaiaphotometric measurements are used in addition to the parallaxOne of the least informative priors we can use is the exponen-tially decreasing space density prior

P(r) =

1

2L3 r2eminusrL if r gt 0

0 otherwise (18)

For distances r L this corresponds to a constant space densityof stars with the probability dropping exponentially at distancesmuch larger than the mode (which is at 2L) Examples of theshape of the posterior for parallaxes of different precisions areshown in Bailer-Jones (2015) and Astraatmadja amp Bailer-Jones(2016b)

The posterior obtained for the prior defined in Eq (18) is nor-malised and thus we have a choice of point estimators (meanmedian or mode) Also the distribution is asymmetric andtwo quantiles (5 and 95) rather than the standard deviationare recommended to summarise the uncertainty in the point es-timate The median as a point estimate is guaranteed to lie

Article number page 10 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 11: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

between these quantiles Astraatmadja amp Bailer-Jones (2016b)used this prior as well as a Milky Way prior to infer distancesfor the two million TGAS stars in the first Gaia data releaseThe behaviour of the estimates derived from the exponentiallydecreasing space density prior can be explored using the interac-tive tool available in the tutorial described in Sect 51

In general the introduction of reasonable prior probabilitiesaccounts for the LutzndashKelker bias although the inevitable mis-match between the true distribution of parallaxes and the priorused will result in less accurate inferences In any case the ad-vantage with respect to the methods discussed in Sect 3 is cleari) we do not need to tabulate corrections for each prior assumingconstant f ii) we do not need to dispose of non-positive par-allaxes iii) we obtain a proper full posterior distribution withwell-defined moments and credible intervals iv) even simplepriors such as the exponential decreasing volume density willimprove our estimates with respect to the unrealistic prior un-derlying the maximum likelihood solution rest = 1$ and fi-nally v) we obtain estimators that degrade gracefully as the dataquality degrades and with credible intervals that grow with theobservational uncertainties until they reach the typical scales ofthe prior when the observations are non-informative These ad-vantages come at the expense of an inference that is more com-putationally demanding in general (as it requires obtaining theposterior and its summary statistics if needed) the need for athoughtful choice of a prior distribution and the analysis of theinfluence of the prior on the inference results

Figure 12 shows the distribution of means (left) modes (cen-tre) and medians (right) of the posteriors inferred for a simu-lation of 105 sources drawn from an exponentially decreasingspace density distribution This simulation represents the un-likely case where the prior is a perfect representation of the truedistribution

From a Bayesian perspective the full posterior PDF is thefinal result of our inference if we only use parallax measure-ments to infer the distance (see below) and further analysesshould make use of it as a whole This is our recommendationin general avoid expectations and summaries of the posteriorHowever it is often useful to compute summary statistics suchas the mean (expectation) median mode quantiles etc to havean approximate idea of the distribution but we should not usethese summaries for further inference for example to estimateabsolute magnitudes Cartesian velocities etc We recommendinferring the full posterior distributions for these derived quanti-ties using the posterior of the true parallax or of the distance orusing the same Bayesian scheme as for the true parallax as ex-plained in Sect 42 In Fig 13 we show the values of the mean(left) mode (centre) and median (right) that we would obtainfrom a set of 104 simulated observations of a star at 100 parsecswith f = 02 We assume a Gaussian distribution of the obser-vations around the true parallax The posterior distribution is in-ferred using Eq (17) and two priors a uniform volume densityof sources truncated at 1 kpc (results in grey) and a uniform den-sity of sources multiplied by an exponential decay of length scale200 pc as defined in Eq (18) (in blue) The expectation valuesof the histograms are shown as dashed lines of the same colourwith the true value (100 pc) shown as a red dashed line We seein general that i) the truncation has the effect of increasing thenumber of overestimated distances ii) the three estimators arebiased towards larger distances (smaller parallaxes) but the ex-pectation of the mode is significantly closer to the true valueand iii) the abrupt truncation of the prior results in a spuriouspeak of modes at the truncation distance as already discussed inBailer-Jones (2015)

Figure 14 and Table 1 show a comparison of the absolutevalue of the empirical bias and standard deviation associatedwith some distance estimators discussed in this paper as a func-tion of the measured fractional uncertainty in the parallax Wechose the measured value even though it is a very poor and non-robust estimator because as stated in Sect 32 we never have ac-cess to the true fractional parallax uncertainty This figure showsthe results obtained for 107 sources in the Gaia DR2 simulationdescribed in Appendix A for the maximum likelihood estimateρ = 1

$with and without the SmithndashEichhorn correction and for

the mode estimates based on the posterior distribution for twopriors (a uniform distance prior UD with maximum distancerlim = 100 kpc and an exponentially decreasing space densityprior EDSD with L = 135 kpc) neither of which matches thetrue distribution of sources in the simulation Only the mode ofthe posteriors is plotted (but not the mean or the median) for thesake of clarity The conclusions described next are only valid un-der the conditions of the exercise and are provided as a demon-stration of the caveats and problems described in previous sec-tions not as a recommendation of the mode of the posterior in-ferred under the EDSD prior as an estimator At the risk of re-peating ourselves we emphasise the need to adopt priors adaptedto the inference problem at hand Also the conclusions only holdfor the used simulation (where we generate the true distances andhence can calculate the bias and standard deviation) and need notbe representative of the true performance for the real Gaia dataset They can be summarised as follows

ndash the mode of the EDSD prior shows the smallest bias andstandard deviation in practically the entire range of estimatedfractional parallax uncertainties (in particular everywherebeyond the range of fapp represented in the plot)

ndash the SmithndashEichhorn estimate shows pathological biases andstandard deviations in the vicinity of the supposedly best-quality measurements at fapp = 0 Away from this region itprovides the next less biased estimates (averaged over binsof fapp) after the mode of the EDSD posterior

412 Bayesian inference of distances from parallaxes andcomplementary information

The methodology recommended in the previous paragraphs isuseful when we only have observed parallaxes to infer distancesThere are however common situations in astronomy where theparallaxes are only one of many observables and distances arenot the final goal of the inference problem but a means to achieveit In this context we recommend an extension of the classicalBayesian inference methods described in the previous sectionThese problems are characterised by a set of observables andassociated uncertainties (that include but are not restricted toparallaxes) and a series of parameters (the values of which areunknown a priori) with complex interdependence relationshipsamongst them Some of these parameters will be the ultimategoal of the inference process Other parameters do play an im-portant role but we are not interested in their particular valuesand we call them nuisance parameters following the literatureFor example in determining the shape of a stellar associationthe individual stellar distances are not relevant by themselvesbut only insomuch as we need them to achieve our objectiveWe show below how we deal with the nuisance parameters Theinterested reader can find applications of the methodology de-scribed in this section to inferring the coefficients of period-luminosity relations in Gaia Collaboration et al (2017a) andSesar et al (2017) Also the same methodology (a hierarchical

Article number page 11 of 20

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 12: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

Mean

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Mode

Cou

nts

minus8000 minus4000 0 2000

Median

Cou

nts

minus8000 minus4000 0 2000

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Fig 12 Probability distribution of the residuals of the Bayesian estimate of the true distance in parsecs for 100000 simulated stars drawn from auniform density plus exponential decay distribution and σ$ = 3 middot 10minus1 mas (orange) 3 mas (blue) and 30 mas (grey)

Mean

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Mode

Cou

nts

0 200 400 600 800 1000

Median

Cou

nts

0 200 400 600 800 1000

000

00

005

001

00

015

002

0

Fig 13 Distributions of means (left) modes (centre) and medians (right) for a series of 104 posteriors calculated for a star at a true distance of100 pc and f = 02 and observed parallaxes drawn at random from the corresponding Gaussian distribution Posteriors are inferred with a uniformdensity prior truncated at 1 kpc (grey) or with a uniform density with exponential decay prior and length scale 135 kpc (blue) The red vertical linemarks the true parallax the grey and blue lines correspond to the expected value (mean) of each distribution (same colours as for the histograms)

Bayesian model) is applied in Hawkins et al (2017) where theconstraint on the distances comes not from a period-luminosityrelation but from the relatively small dispersion of the abso-lute magnitudes and colour indices of red clump stars A lastexample of this methodology can be found in Leistedt amp Hogg(2017b) where the constraint comes from modelling the colour-magnitude diagram

Just as in the previous section where we aimed at estimat-ing distances from parallaxes alone the two key elements in thiscase are the definitions of a likelihood and a prior The likeli-hood represents the probability distribution of the observablesgiven the model parameters Typically the likelihood is basedon a generative or forward model for the data Such modelspredict the data from our assumptions about the physical pro-cess that generates the true values (ie the distribution of stars in

space) and our knowledge of the measurement process (eg jus-tifying the assumption of a normal distribution of the observedparallax around its true value) Forward models can be used togenerate arbitrarily large synthetic data sets for a given set ofthe parameters In this case however where we are concernedwith several types of measurements that depend on parametersother than the distance the likelihood term will be in generalmore complex than in Sect 411 and may include probabilisticdependencies between the parameters The term hierarchical ormulti-level model is often used to refer to this kind of model

Let us illustrate the concept of hierarchical models with asimple extension of the Bayesian model described in Sect 411where instead of assuming a fixed value of the prior length scaleL in Eq (18) we make it another parameter of the model and tryto infer it Let us further assume that we have a set of N parallax

Article number page 12 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 13: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Table 1 Average bias and standard deviation in three regimes of fapp for four distance estimators discussed in this paper (from left to right) themode of the posterior based on the exponentially decreasing space density (EDSD) prior the mode of the posterior of the uniform distance (UD)distribution the maximum likelihood estimate corrected according to Smith amp Eichhorn (1996) abbreviated as SE and the maximum likelihood(ML) estimate The wider ranges of fapp exclude narrower ranges shown in previous rows of the table

Summary fapp Range EDSD UD SE ML

Bias(-11) -02 97 342 -095(-55) -03 107 -034 -12(-5050) -03 162 -04 -38

Std Deviation(-11) 04 80 6858 05(-55) 04 84 05 195(-5050) 04 106 04 171

fapp

log 1

0(|B

ias|

)

minus4

minus2

02

4

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

fapp

log 1

0(σ)

minus1

12

34

5

minus5 minus3 minus1 1 3 5

Mode EDSDMode UDSmithminusEichhorn estimateML estimate

Fig 14 Bias (top) and standard deviation (bottom) averaged over binsof the estimated fractional parallax uncertainty fapp for four estimatorsof the distance the maximum likelihood (ML) estimator rest = 1

$(or-

ange) the ML estimator corrected as described in Smith amp Eichhorn(1996) (light blue) the mode of the posterior obtained with an expo-nentially decreasing space density (EDSD) prior and L = 135 kpc (darkgreen) and the mode of the posterior obtained with a uniform distance(UD) prior truncated at 100 kpc (red)

measurements $k one for each of a sample of N stars In thiscase the likelihood can be written as

p($k|rk L) = p($k | rk) middot p(rk | L)

= p($k | rk) middotNprod

k=1

p(rk | L) (19)

where rk is the true unknown distance to the kth star We note thatvery often the measured parallaxes are assumed independentand thus p($k | rk) is written as the product

prodNk=1 p($k | rk)

This is incorrect in general for Gaia parallaxes because the par-allax measurements are not independent As described in GaiaCollaboration et al (2018c) and Sect 2 of this paper there areregional correlations amongst them (see Sect 43) but for thesake of simplicity let us assume the sample of N measurementsis spread all over the celestial sphere such that the correlationsaverage out and can be neglected Hence we write

p($k|rk L) =

Nprodk=1

p($k | rk) middot p(rk | L) (20)

Under the assumption of Gaussian uncertainties the first termin the product is given by Eq (6) while the second is given byEq (18)

This likelihood can be represented by a simple directed graph(see Fig 15) that provides information about the conditional de-pendencies amongst the parameters The shaded nodes representthe observations the open circles represent model parametersand the small black circles at the origin of the arrows representmodel constants The arrows denote conditional dependence re-lations and the plate notation indicates repetition over the mea-surements k

The next key element is as in Sect 411 the prior Accord-ing to Fig 15 the only parameter that needs a prior specificationis the one without a parent node L The rest of the arcs in thegraph are defined in the likelihood term (Eq (20)) If the sampleof N stars were representative of the inner Galactic halo for ex-ample we could use a Gaussian prior centred at asymp 30 kpc (seeeg Iorio et al 2018 and references therein) Such a hierarchicalmodel can potentially shrink the individual parallax uncertaintiesby incorporating the constraint on the distribution of distances

If we are only interested in the individual distances rk wecan consider L as a nuisance parameter

p($Truek | $k) =

intp($Truek L | $k) middot dL (21)

=

intp($Truek | $k L) middot p(L | $k) middot dL

This integral (known as the marginalisation integral) allows us towrite the posterior we are interested in without having to fix thevalue of L to any particular value Depending on the objective ofthe inference we could have alternatively determined the poste-rior distribution of L by marginalising the individual distanceswith an N-dimensional integral over the rk parameters

Article number page 13 of 20

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 14: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

L

rk

$ σ$

microL

σL

k = 1 2 N

Fig 15 Directed acyclic graph that represents a hierarchical Bayesianmodel of a set of N parallax measurements characterised by uncertain-ties σ$ and true distances drawn from an exponentially decreasing den-sity distribution of distances (see Eq (18)) The scale length of the ex-ponential decrease L is itself a model parameter that we can infer fromthe sample Its prior is defined in this case for the sake of simplicity asa Gaussian distribution of mean microL and standard deviation σL

In parameter spaces of dimensionality greater than 3ndash4 thecomputation of the possibly marginalised posteriors andor ev-idence requires efficient sampling schemes like those inspiredin Markov chain Monte Carlo (MCMC) methods to avoid largenumbers of calculations in regions of parameter space with neg-ligible contributions to the posterior This adds to the highercomputational burden of the Bayesian inference method men-tioned in the previous section

The previous simple example can be extended to includemore levels in the hierarchy and more importantly more pa-rameters and measurement types Section 55 and 56 developin greater detail two examples of hierarchical models of directapplicability in the Gaia context

42 Absolute magnitudes tangential velocities and otherderived quantities

The approaches described in the previous sections can be appliedto any quantity we want to estimate using the parallax For exam-ple if we want to infer the absolute magnitude MG then giventhe measured apparent magnitude G and line-of-sight extinctionAG the true parallax $True is related to MG via the conservationof flux

5 log$True = MG + AG minusG minus 5 (22)

Assuming for simplicity that G and AG are known Bayesrsquotheorem gives the posterior on MG as

P(MG | $G AG) =1Z

P($ | MG AGG)P(MG) (23)

where the likelihood is still the usual Gaussian distribution forthe parallax (Eq (6)) in which the true parallax is given byEq (22) As this expression is non-linear we again obtain anasymmetric posterior PDF over MG the exact shape of whichalso depends on the prior

The inference of other quantities can be approached in thesame way In general we need to consider a multi-dimensionallikelihood to accommodate the measurement uncertainties (andcorrelations) in all observed quantities For instance the set ofparameters θ = r v φ (distance tangential speed and direc-tion measured anticlockwise from north) can be inferred fromthe Gaia astrometric measurements o = $ microαlowast microδ (where microαlowastand microδ are the measured proper motions) using the likelihood

p(o | θ) = N(θΣ) =1

(2π)32|Σ|12exp

(minus

12

(o minus x)T Σminus1(o minus x))

(24)

where N denotes the Gaussian distribution Σ is the full (non-diagonal) covariance matrix provided as part of the Gaia DataRelease and

x =

(1r

v sin(φ)r

v cos(φ)

r

)(25)

is the vector of model parameters geometrically transformed intothe space of observables in noise-free conditions Equation (24)assumes correlated Gaussian uncertainties in the measurements

The posterior distribution can then be obtained by multiply-ing the likelihood by a suitable prior The simplest assumptionwould be a separable prior such that p(θ) = p(r) middot p(v) middot p(φ)where p(v) and p(φ) should reflect our knowledge about thedynamical properties of the population from where the sourceor sources were drawn (eg thin disk thick disk bulge halo)Again hierarchical models can be used in the analysis of sam-ples in order to infer the population properties (prior hyper-parameters) themselves

Similar procedures can be followed to infer kinematic ener-gies or full 3D velocities when the forward model is extendedwith radial velocity measurements

43 Further recommendations

In this subsection we provide some further recommendationsand guidance in the context of the Bayesian approach outlinedabove Although powerful inference with Bayesian methodsusually comes at a large computational cost Some of the rec-ommendations below can also be seen in the light of taking dataanalysis approaches that approximate the Bayesian methodologyand can be much faster

Where possible formulate the problem in the data spaceThe problems caused by the ill-defined uncertainties on quan-tities derived from parallaxes can be avoided by carrying outthe analysis in the data space where the behaviour of the un-certainties is well understood This means that the quantities tobe inferred are treated as parameters in a forward or generativemodel that is used to predict the data Some adjustment processthen leads to estimates of the parameters A very simple forwardmodelling example can be found in Schroumlder et al (2004) whostudied the luminosity calibrations of O-stars by predicting the

Article number page 14 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 15: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

expected Hipparcos parallaxes from the assumed luminosity cal-ibration and comparing those to the measured parallaxes A morecomplex example can be found in Lindegren et al (2000) whopresent a kinematic model for clusters which describes the ve-locity field of the cluster members and predicts the proper mo-tions accounting for the astrometric uncertainties and covari-ances As shown in previous sections the Bayesian approachnaturally lends itself to (and in fact requires) forward modellingof the data

Forward modelling has the added advantage that it forces usto consider the proper formulation of the questions asked fromthe astrometric data This naturally leads to the insight that oftenthe explicit knowledge of the distances to sources is not of inter-est For example in the Schroumlder et al (2004) case an assumedluminosity of the O-stars and their known apparent magnitudeis sufficient to predict the observed parallaxes In more complexanalyses the distances to sources can often be treated as nuisanceparameters which in a Bayesian setting can be marginalised outof the posterior

Use all relevant information Although the parallax has a di-rect relation to the distance of a star it is not the only measure-ment that contains distance information The apparent magni-tude and the colour of a star carry significant information on theplausible distances at which it can be located as the colour pro-vides information on plausible absolute magnitude values for thestar This is used in two papers (Leistedt amp Hogg 2017a An-derson et al 2017) in which the information contained in thephotometry of stars is combined with Gaia DR1 parallax in-formation to arrive at more precise representations of the colour-magnitude diagram than can be obtained through the parallaxesalone Similarly proper motions contain distance informationwhich can be used to good effect to derive more precise distancesor luminosities for open cluster members (eg de Bruijne et al2001 Gaia Collaboration et al 2017b) The Bayesian approachnaturally allows for the combination of different types of data inthe modelling of the problem at hand It should be emphasisedhowever that adding additional data necessitates increasing themodel complexity (eg to include the relation among apparentmagnitude absolute magnitude extinction and parallax) whichleads to the need to make more assumptions

Incorporate a proper prior Bailer-Jones (2015) discussed thesimplest case of inferring the distance to a source from a sin-gle observed parallax He showed that if the minimal prior thatr should be positive is used the resulting posterior is not nor-malisable and hence has no mean variance other momentsor percentiles This behaviour of the posterior is not limitedto inference of distance Examples of other quantities that arenon-linearly related to parallax are absolute magnitude (M prop

log10 $) tangential velocity (vT prop 1$) kinetic energy and an-gular momentum (both proportional to 1$2 when determinedrelative to the observer) In all these cases it can be shown thatthe posterior for an improper uniform prior is not normalisableand has no moments The conclusion is that a proper prior onthe parameters to be estimated should be included to ensure thatthe posterior represents a normalised probability distribution Ingeneral using unconstrained non-informative priors in Bayesiandata analysis (such as the one on r above) is bad practice In-evitably there is always a mismatch between the prior and thetrue distribution (if there were not there would be no need todo the inference) This will unavoidably lead to some biases al-though the better the data the smaller these will be We cannot

expect to do much better than our prior knowledge in case weonly have access to poor data The Bayesian approach guaran-tees a graceful transition of the posterior into the prior as thedata quality degrades

The above discussion raises the question of what priors to in-clude for a specific inference problem Simple recipes cannot begiven as the prior to be used depends on the problem at hand andthe information already available When deciding on a prior weshould keep in mind that some information is always availableEven if a parallax is only available for a single star we knowthat it cannot be located at arbitrary distances It is unlikely tobe much closer that 1 pc (although we cannot fully exclude thepresence of faint stars closer than Proxima Centauri) and it mustbe located at a finite distance (we can observe the star) Thiswould suggest a non-informative uniform prior on r with lib-eral lower and upper bounds (such that the prior is normalised)However as pointed out in Bailer-Jones (2015) a uniform distri-bution in r implies a space density of stars that falls of as 1r2

from the position of the Sun which is of course physically im-plausible Hence one should assume a reasonable distribution ofstars in space and construct the prior on r accordingly Bailer-Jones (2015) presents a prior derived from a uniform space den-sity with an exponential cut-off which was used in Astraatmadjaamp Bailer-Jones (2016b) to derive distances to stars for which par-allaxes are listed in Gaia DR1 This prior should not be usedindiscriminately at the very least we should carefully considerthe choice of the scale length L (or leave that as a parameter tobe estimated as described in Sect 412) and in most cases amore tailored prior based on our broad knowledge of the distri-bution of a given stellar population in the Milky Way would bebetter The tutorial cases introduced in the next section containsome more examples of priors on distance and other astrophysi-cal parameters

The next two items discuss simplifications to the Bayesianapproach that nevertheless need to be justified carefully

Maximum likelihood modelling We have seen that priors arethe bridges that allow us to go from the probability of the obser-vations given the unknown parameters to the desired probabil-ity of the parameters given the observations It is only based onthis probability distribution that we can make statements aboutcredible intervals (uncertainties) for the inferred quantities orselect amongst competing models (a topic that is beyond thescope of this paper) However if making prior-free inferences ispreferred then maximising the likelihood is the only alternativeThe kinematic modelling presented in Lindegren et al (2000)is a non-trivial example of this A more complex example canbe found in Palmer et al (2014) The ML approach just as theBayesian framework described above allows the combinationof different types of data and accounts for selection functionsor missing data We have seen in Sect 411 that the maximumlikelihood estimate of the distance given a single parallax mea-surement is ρ = 1

$and this is a poor estimator of the distance

except for subsets of very accurate measurements In generalthe Bayesian and the maximum likelihood estimates coincide inthe limit of very small uncertainties or infinite numbers of mea-surements In such limits the maximum likelihood estimate issimpler to obtain although its computational cost may still belarge as the ML method is often equivalent to a complex optimi-sation problem involving a multi-dimensional function of manyparameters

Article number page 15 of 20

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 16: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

Selecting the lsquobestrsquo data Analyses that use parallax data areoften restricted to positive parallaxes with relative uncertaintiesbelow some limit (typically 20 ) This allows working in aregime where the uncertainties of derived quantities such as dis-tance tangential velocities luminosity etc are thought to bemanageable which allows working in the space of astrophysi-cal variables rather than the data Truncation on relative parallaxerror might be justified in an exploratory phase of the data anal-ysis however there are a number of reasons why this approachis almost never advisable Even at relative uncertainties below02 the quantities calculated from the parallax can be biased andsuffer from a large variance (see Bailer-Jones 2015) More im-portantly however the selection of lsquogoodrsquo parallaxes will biasthe sample studied to nearby andor bright representatives of anystellar population and the selection may lead to discarding avery large fraction of the potential sample Hence any inferencesbased on such data will be severely biased This is clearly illus-trated in Fig 7 where for an even less strict truncation of starswith a relative uncertainties below 50 the distribution of dis-tances of the resulting sample is clearly biased with respect theoriginal sample

Accounting for data selection and incompleteness Althoughthe Gaia survey is designed such that the only selection ofsources is that they are brighter than the survey limit at G = 207the combination of the onboard detection algorithm the Gaiascanning law and the filtering of results of insufficient qualityprior to a data release lead to a complex selection function espe-cially in the early data releases This selection function should betaken into account in any data analysis and this is most naturallydone as part of a Bayesian analysis where the selection func-tion is part of the forward model that predicts the data given themodel parameters Precise prescriptions of the selection func-tions are not foreseen to be part of the data release documenta-tion Hence selection function parameters need to be includedas part of the parameters inferred by the Bayesian analysis or ifthis is not possible the selection functions have to be borne inmind when interpreting the results

Covariances in the uncertainties All the uncertainties on theastrometric data quoted in the Gaia catalogue are presented asfull covariance matrices where the diagonal elements representthe standard uncertainties on the astrometric parameters and theoff-diagonal elements the covariances or correlations betweenthe uncertainties This amounts to treating the astrometric datavector as having been drawn from a multivariate normal distri-bution with a covariance matrix as given in the Gaia catalogueThe covariances are most easily handled in the data space as partof the likelihood (see Lindegren et al 2000 for an example inthe context of kinematic modelling) If the covariances in theastrometric uncertainties are not accounted for we can easily bemisled for example by spurious features in the inferred velocityfield of an open cluster (Brown et al 1997)

The uncertainties are also correlated from one source tothe next especially over small distances on the sky A studyof the star-to-star correlations in the parallax uncertainties inGaia DR1 was done for the Kepler field (Zinn et al 2017)where independent and precise asteroseismic distances to thestars are available enabling the authors to derive an expressionfor the correlation strength and spatial scale This expressioncan be used for studies of the Kepler field but care should betaken when extrapolating to other fields on the sky The func-tional form for the star-to-star correlations used by Zinn et al

(2017) could be introduced as part of the forward model withthe Zinn et al (2017) parameters as a good first guess

For Gaia DR1 the length scale for the star-to-star corre-lations was estimated to vary from subdegree scales to tens ofdegrees on the sky where Zinn et al (2017) derived the corre-lation function over length scales of sim 02 to sim 10 degrees ForGaia DR2 Gaia Collaboration et al (2018c) estimate that thespatial correlations extend over scales of below 1 degree to 10ndash20 degrees

Accounting for non-Gaussian andor systematic uncertain-ties Although the bulk of the sources in the Gaia catalogue havenormally distributed uncertainties there is a significant fractionfor which the uncertainties exhibit non-Gaussian behaviour (egwhen uncertainties are over- or underestimated) This can be ac-counted for in the data analysis by including the uncertainties aspart of the forward model (eg Sesar et al 2017) or by explicitlymodelling an outlier population Similarly systematic uncertain-ties can be included as part of the forward model For exampleSesar et al (2017) include a global parallax zero-point as part oftheir probabilistic model used to analyse the period-luminosityrelation of RR Lyrae stars with Gaia DR1 data An alternativeapproach to the investigation of systematics in the parallaxes (ordistance estimates obtained from photometry or spectroscopyfor example) is presented in Schoumlnrich et al (2012) and is ap-plied to Gaia DR1 in Schoumlnrich amp Aumer (2017) In this casewe can consider that for samples covering a significant fractionof the sky any systematic error in the estimated distances to thestars will show up as correlations in their 3D velocity compo-nents The presence of such correlations can be used to make in-ferences about systematic errors for example in the parallaxes

Systematic uncertainties are more difficult to handle as theymay show variations as a function of source brightness or ce-lestial position they may be correlated across neighbouringsources and they may not be well understood for a given earlyGaia data release In general the information needed to accu-rately model systematic uncertainties or uncertainty correlationsbetween sources may not be readily available This informationcan be obtained from a comparison between Gaia and otherhigh-precision data (eg Zinn et al 2017 Arenou et al 2017Gaia Collaboration et al 2018a) or by examining for exampleplots of the parallax or proper motion averaged over sky regionsfor samples where the true parallax or proper motion values canbe assumed to be known such as zero parallax and proper mo-tion for quasars (see Gaia Collaboration et al 2018c for exam-ples)

Two special cases should be mentioned when the sampleis well distributed over the sky we can safely assume that thelocal systematics vanish and that only the global parallax zero-point need to be subtracted locally we may be interested not bythe absolute value of the parallaxes but by the relative ones inwhich case the difference between parallaxes and their averageremoves part of the systematics

There is no general recipe for dealing with non-Gaussian un-certainties or correlated systematic uncertainties The main ad-vice we can give here is to proceed with the analysis of the astro-metric data as they are but to keep in mind the systematics andcorrelations discussed in Sect 2 when interpreting the resultsAlternatively the forward model can be extended to include sys-tematic and correlation terms for which parameters are also tobe estimated Such models can be guided by the studies of sys-tematic uncertainties mentioned above

Article number page 16 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 17: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Testing with simulations Finally we strongly advise that theinference problem at hand should be investigated through simu-lated data and that the simulations performed should be as closeas possible to the real data (in particular correctly modelling theuncertainties and selection effects) The simulations allow theanalysis method to be developed and tested for accuracy How-ever the performance of the analysis method should be inter-preted strictly in terms of how well the assumed model explainsthe simulated observed data That is we should not fall into thetrap of trying to tune the analysis method to get an answer that isas close to the lsquotruthrsquo as possible In real problems we can onlyjudge the adequacy of a model and its parameter values by howwell they predict the observed data (to within the observationaluncertainties it should be stressed as we should avoid lsquoover-fittingrsquo the data)

5 Using astrometric data practical examples

We introduce here a few worked examples to illustrate thepoints that were made in the previous section These ex-amples are available in full detail as online tutorials in theform of source code accompanied by much more exten-sive explanation than can be provided here The source codeand corresponding Python and R Notebooks can be foundat the following URL httpsgithubcomagabrownastrometry-inference-tutorials In all cases the readeris strongly encouraged to download the source code and exper-iment with modifications of the simulated input data andor theparameter choices in the inference methods

51 Comparison of distance and distance modulusestimators

The use of the Bayesian inference with non-informative priorsdescribed in Sect 411 is illustrated and implemented inthe following tutorial httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourceGraphicalUserInterface The tutorialcompares the performance of Bayesian distance estimationmethods with the SmithndashEichhorn transformation (Smith ampEichhorn 1996) (Sect 341) and the naive parallax inversion

The tutorial contains a Graphical User Interface that easilyvisualises and compares the behaviour of all these estimators fora given parallax and uncertainty For the Bayesian inference es-timations using the mode and the median are provided togetherwith a 90 confidence interval The tutorial also provides a li-brary pyrallaxes with the implementation of all these estima-tors The library can easily be customised to implement otherpriors for the Bayesian inference

Additionally an implementation of the Bayesian distanceestimator using the Exponentially Decreasing Space Densityprior introduced in Bailer-Jones (2015) will be available inTopCat (httpwwwstarlinkacuktopcat) and Stilts(httpwwwstarlinkacukstilts) from respectivelyversions 46 and 31-3 onwards

52 Inferring the distance to a single source using just theparallax

The issues surrounding the use of a parallax to infer a distancewere explored in Bailer-Jones (2015) and applied to simulatedGaia data in Astraatmadja amp Bailer-Jones (2016a) and to TGAS(Gaia DR1 ) in Astraatmadja amp Bailer-Jones (2016b) A

tutorial exploring this is provided at httpsgithubcomagabrownastrometry-inference-tutorialstreemastersingle-sourcetutorial This can be used toinvestigate how the posterior depends on the prior and the dataIt also includes a simple example of a hierarchical model toavoid specifying the exact length scale of a distance prior

53 Inferring the distance to and size of a cluster using justthe parallaxes

In many applications we are more interested in the averagedistance to a cluster (or a group of stars) rather than to theindividual stars In this case a basic mistake to be avoidedis estimating the individual distances (whatever the method)and then averaging these individual values A more correctapproach is to average the individual parallaxes and then toobtain a distance from this average value However a muchbetter solution is to set up a model for the cluster whichwould use as parameters the distance to its centre for exampleand some measure of its size and to infer its parametersThis is explored in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemastermultiple-source This introduces the overall prob-lem and derives a general solution Code is implemented forthe specific case of a model which assumes a random isotropicdistribution of the true stars from the centre of the cluster Thismodel has two parameters the cluster distance and the clustersize The solution uses a small angle approximation to makethe problem simpler although it is easily extended to the caseof clusters with a significant angular extent It is applied to thePleiades selection from the Gaia DR1 main release paper(Gaia Collaboration et al 2016) The tutorial also considers theproblem of how to accommodate correlations in the measuredparallaxes of different stars Finally it also shows the resultsfrom a classical and a naive combination of stellar parallaxesto estimate the cluster distance The combination of parallaxesand proper motions of individual stars in a cluster into a singlesolution for the mean parallax and proper motion is treated asan iterative least squares problem in Gaia Collaboration et al(2017b) (see their Appendix A for details)

54 Inferring the distance and velocity of a source using theparallax and proper motions

The velocity (speed and direction) of a source in the planeof the sky can be inferred from measurements of its parallaxand two proper motions The uncertainties in all three affectthe inferred velocity and its uncertainty Moreover as theGaia parallaxes and proper motions generally have non-zerocorrelations these must also be taken into account Thiscan be done in a straightforward manner in the Bayesianapproach as is shown in the tutorial at httpsgithubcomagabrownastrometry-inference-tutorialstreemaster3d-distance This sets up a three-parametermodel (distance speed angle) for a source Using the threemeasurements (parallax two proper motions) in a multivariateGaussian likelihood and suitable priors on the parameters wecan compute the trivariate posterior This is sampled in theposterior using an MCMC algorithm for a set of stars

Article number page 17 of 20

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 18: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

55 Luminosity calibration

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterluminosity-calibration) the problem of inferring (orcalibrating) the mean absolute magnitude of a specific classof stars is treated The measurements at hand are the parallaxand apparent magnitude for each of the stars in the sampleand the task is to infer their mean absolute magnitude microM andthe spread σM around this mean This is very similar to theproblem that Lutz amp Kelker (1973) and Turon Lacarrieu ampCreacutezeacute (1977) treated and a Bayesian approach to solving thisproblem was presented by Brown (2012) (albeit with the use ofimproper priors which we again note is bad practice) A morecomplex version of this problem (accounting for extinction anda contaminating population of stars) and its Bayesian solutionwas presented in Hawkins et al (2017) In this tutorial threeimportant points are illustrated

ndash Often the explicit computation of the distances to stars is notof interest In this example only the mean absolute magni-tude of the stars is to be estimated and the forward mod-elling approach as part of the Bayesian inference avoids theneed to calculate or estimate distances

ndash The data for all the stars carry information on the mean ab-solute magnitude including the negative parallaxes or par-allaxes with large relative errors This information can nat-urally be incorporated in a forward modelling approach (inthis example as part of a Bayesian inference method) thusavoiding the introduction of truncation biases caused by theselection of stars with lsquogoodrsquo parallaxes

ndash If the selection function is known (in this example the surveyis magnitude limited) it can and should be included in theforward modelling This accounts for sample selection biasesthat would otherwise occur

56 Period-luminosity relation

In this tutorial (httpsgithubcomagabrownastrometry-inference-tutorialstreemasterperiod-luminosity-relation) we include a hierarchicalmodel to infer period-luminosity-metallicity relations forclassical pulsating stars The full model can be applied tofundamental mode RR Lyrae stars and the abridged version(without the metallicity dependence) is suitable for samples ofclassical Cepheids We include the data set for the RR Lyraestars described and used for inference in Gaia Collaborationet al (2017a) and Delgado et al (2018) It contains a sampleof 200 stars (including fundamental radial pulsators but alsofundamentalised first overtone pulsators) with measured pe-riods apparent magnitudes in the K-band metallicities andparallaxes from the TGAS catalogue In the tutorial we describethe hierarchical model and discuss potential biases in the dataset Finally we analyse the sensitivity of the results to differentchoices of the priors and related parameters

6 Conclusions

Gaia data releases will provide a huge increase of astrometricdata available for the scientific community More than a billionparallaxes and proper motions allow new openings into many as-tronomical topics In most cases astronomers are exploiting theGaia catalogues to obtain physical quantities such as distanceand velocity Although it is easy to extract from the Gaia data it

is well known that direct inversion of parallax will lead to biaseswhich become more and more significant the larger the relativeparallax uncertainty While Gaia will provide high-quality astro-metric measurements hundreds of millions of stars have preci-sions which require proper statistical treatment in order to avoidbiased conclusions The aim of this paper is to guide the users ofGaia data to handle astrometric data correctly

In this study we summarise methods used to avoid bi-ases when converting astrometric data into physical quantitiesStarting from simple non-recommended sample truncation tomore complex methods the biases associated with the meth-ods are presented The basic recommendation is to treat deriva-tion of physical quantities from astrometric measurements asan inference problem which should be preferably handled withBayesian approach The recommended methods are described inSect 4 with a summary in Sect 43 To aid the users furtherSect 5 contains practical examples with links to Python and Rcode

Gaia will provide fundamental data for many fields of as-tronomy Further data releases will provide more data and moreprecise data Nevertheless for full use of the potential it will al-ways be necessary to pay careful attention to the statistical treat-ment of parallaxes and proper motions The purpose of this paperis to help astronomers find the correct approachAcknowledgements This work has made use of results from the European SpaceAgency (ESA) space mission Gaia the data from which were processed by theGaia Data Processing and Analysis Consortium (DPAC) Funding for the DPAChas been provided by national institutions in particular the institutions partici-pating in the Gaia Multilateral Agreement The Gaia mission website is httpwwwcosmosesaintgaia The authors are current or past members of theESA Gaia mission team and of the Gaia DPAC This work was supported by theMINECO (Spanish Ministry of Economy) through grant ESP2016-80079-C2-1-R (MINECOFEDER UE) and ESP2014-55996-C2-1-R (MINECOFEDERUE) and MDM-2014-0369 of ICCUB (Unidad de Excelencia lsquoMariacutea de Maeztursquo)and by the DLR (German space agency) via grant 50 QG 1403

ReferencesAnderson L Hogg DW Leistedt B Price-Whelan AM Bovy J Jun 2017

ArXiv e-printsArenou F Luri X 1999 In Egret D Heck A (eds) Harmonizing Cosmic

Distance Scales in a Post-HIPPARCOS Era vol 167 of Astronomical Societyof the Pacific Conference Series 13ndash32

Arenou F Luri X Babusiaux C et al Feb 2017 AampA 599 A50Astraatmadja TL Bailer-Jones CAL Dec 2016a ApJ 832 137Astraatmadja TL Bailer-Jones CAL Dec 2016b ApJ 833 119Bailer-Jones CAL Oct 2015 PASP 127 994Bovy J Sep 2017 MNRAS 470 1360Brown AGA Nov 2012 Statistical AstrometryBrown AGA Arenou F van Leeuwen F Lindegren L Luri X Aug 1997

In Bonnet RM Hoslashg E Bernacca PL et al (eds) Hipparcos - Venice rsquo97vol 402 of ESA Special Publication 63ndash68

de Bruijne JHJ Hoogerwerf R de Zeeuw PT Feb 2001 AampA 367 111Delgado HE Sarro LM Clementini G Muraveva T Garofalo A Mar 2018

ArXiv e-printsEddington AS Mar 1913 MNRAS 73 359ESA (ed) 1997 The HIPPARCOS and TYCHO catalogues Astrometric and

photometric star catalogues derived from the ESA HIPPARCOS Space As-trometry Mission vol 1200 of ESA Special Publication (available online athttpswwwcosmosesainthipparcos)

Gaia Collaboration Brown AGA Vallenari A et al Nov 2016 AampA 595A2

Gaia Collaboration Clementini G Eyer L et al Sep 2017a AampA 605 A79Gaia Collaboration van Leeuwen F Vallenari A et al May 2017b AampA 601

A19Gaia Collaboration Arenou F Luri X et al Apr 2018a AampA (special issue

for Gaia DR2)Gaia Collaboration Brown AGA Vallenari A Prusti T et al Apr 2018b

AampA (special issue for Gaia DR2)Gaia Collaboration Lindegren L Hernaacutendez J et al Apr 2018c AampA (special

issue for Gaia DR2)

Article number page 18 of 20

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 19: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

X Luri et al Gaia Data Release 2 Using Gaia parallaxes

Hawkins K Leistedt B Bovy J Hogg DW Oct 2017 MNRAS 471 722Iorio G Belokurov V Erkal D et al Feb 2018 MNRAS 474 2142Leistedt B Hogg DW Dec 2017a AJ 154 222Leistedt B Hogg DW Mar 2017b ArXiv e-printsLindegren L Madsen S Dravins D Apr 2000 AampA 356 1119Lindegren L Lammers U Hobbs D et al Feb 2012 AampA 538 A78Lindegren L Lammers U Bastian U et al Nov 2016 AampA 595 A4Lutz TE Kelker DH Oct 1973 PASP 85 573Malmquist GK Mar 1920 Meddelanden fran Lunds Astronomiska Observato-

rium Serie II 22 3Palmer M Arenou F Luri X Masana E Apr 2014 AampA 564 A49Robin C Luri X Reyleacute C et al 2012 AampA 543 A100Schoumlnrich R Aumer M Dec 2017 MNRAS 472 3979Schoumlnrich R Binney J Asplund M Feb 2012 MNRAS 420 1281Schroumlder SE Kaper L Lamers HJGLM Brown AGA Dec 2004 AampA

428 149Sesar B Fouesneau M Price-Whelan AM et al Apr 2017 ApJ 838 107Smith H Jr Eichhorn H Jul 1996 MNRAS 281 211Trumpler RJ Weaver HF 1953 Statistical astronomyTuron Lacarrieu C Creacutezeacute M Apr 1977 AampA 56 273Zinn JC Huber D Pinsonneault MH Stello D Aug 2017 ApJ 844 166

Article number page 19 of 20

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper
Page 20: Gaia Data Release 2: Using Gaia parallaxes · The Gaia Data Release 2 (Gaia DR2 ) Gaia Collaboration et al. (2018b) provides precise positions, proper motions, and paral-laxes for

AampA proofs manuscript no 32964_Arxiv

Appendix A Description of the simulated samplesused in this paper

The data used in this paper is from a Gaia Universe Model Snap-shot (GUMS) simulation (Robin C et al 2012) together withthe Gaia-like uncertainties and an estimation of the observabledata The uncertainties were computed from an implementationof the recipes described in the Gaia Science performance webpage4 provided by the python PyGaia toolkit5 which then haveto be re-scaled to fit the Gaia DR2 expectations

The simulation contains around 109 sources including onlysingle stars ie stars not belonging to multiple systems up toG lt 20 magnitude distributed as shown in Fig A1

0 5 10 15 20 25G magnitude [mag]

100

101

102

103

104

105

106

107

108

109

Num

ber

of

sourc

es

[log]

G magnitude histogram

Fig A1 Histogram of the starsrsquo G magnitude The total number ofsources is 1 069 138 714 which are distributed in bins of size ∆G = 02the sample is limited to G lt 20

To compute the Gaia potential observables standard uncer-tainties must be added to the simulation Because astrometricquantities (positions parallaxes and proper motions) are relateda single formalism to derive its standard errors is needed PyGaiaimplements a simple performance model depending on the VndashICcolour term and the G magnitude to estimate the end-of-missionerrors for the parallax uncertainty

σ$[microas] =radicminus1631 + 680766 z + 32732 z2 [0986+(1 minus 0986) (V-IC)]

(A1)

with

z = max(1004(1209minus15)1004(Gminus15)) (A2)

(see the Gaia science performance web for more details) It alsotakes into account the variation of the uncertainties over the skybecause of the scanning law tabulated as a function of the eclip-tic latitude6 β

However these are end-of-mission uncertainties so theyhave to be scaled by the fraction of mission time completed in or-der to get an estimation of them for Gaia DR2 In the case of the

parallax only the factorradic

5L being L the mission time (years)

included in Gaia DR2 need to be applied this error model is

4 httpswwwcosmosesaintwebgaiascience-performance5 httpspypipythonorgpypiPyGaia6 httpswwwcosmosesaintwebgaiatable-2-with-ascii

further described in Arenou et al (2017) In our case we havealso updated the calibration floor to take into account the prop-erties of the Gaia DR2 formal errors as shown in Fig A2 Thiscalibration floor introduces a minimum formal error stemmingfrom the fact that the calibrations used in the data processing(models and parameters) are at this stage still being refined Thisfloor affects mainly bright stars while for faint stars the photonnoise dominates Fig A2 shows the model used in the simula-tion for the Gaia DR2 parallax uncertainties as a function of theG magnitude

Fig A2 Average estimated errors as function of G The dash-dottedline represents the uncertainty at Gaia DR2 while the dashed linerepresents the end-of-mission uncertainties The solid line representsGaia DR2 errors including systematics

Article number page 20 of 20

  • 1 Introduction
  • 2 Gaia astrometric data
    • 21 Uncertainties
    • 22 Correlations
    • 23 Systematic errors
    • 24 Completeness
      • 3 Critical review of the traditional use of parallaxes
        • 31 Measurement of parallaxes
        • 32 Estimating distance by inverting the parallax
        • 33 Sample truncation
        • 34 Corrections and transformations
          • 341 The SmithndashEichhorn correction
          • 342 The LutzndashKelker correction
            • 35 Astrometry-based luminosity
              • 4 Recommendations for using astrometric data
                • 41 Using Gaia astrometric data how to proceed
                  • 411 Bayesian inference of distances from parallaxes
                  • 412 Bayesian inference of distances from parallaxes and complementary information
                    • 42 Absolute magnitudes tangential velocities and other derived quantities
                    • 43 Further recommendations
                      • 5 Using astrometric data practical examples
                        • 51 Comparison of distance and distance modulus estimators
                        • 52 Inferring the distance to a single source using just the parallax
                        • 53 Inferring the distance to and size of a cluster using just the parallaxes
                        • 54 Inferring the distance and velocity of a source using the parallax and proper motions
                        • 55 Luminosity calibration
                        • 56 Period-luminosity relation
                          • 6 Conclusions
                          • A Description of the simulated samples used in this paper