The ChandraGaia Catalog of Counterparts: Resolving Ambiguous Gaia Matches to X-ray Sources Using Machine Learning

V. Samuel Pérez-Díaz1,2,3,4,5, Vinay L. Kashyap2, Joshua D. Ingram2,6,7, David Fouhey1, Juan Rafael Martínez-Galarza2, Pavlos Protopapas3, Jeremy J. Drake8, Dong-Woo Kim2, Cecilia Garraffo2

1 Courant Institute, New York University    2 Center for Astrophysics | Harvard & Smithsonian    3 Harvard John A. Paulson School of Engineering and Applied Sciences    4 Universidad del Rosario    5 NSF AI Institute for Artificial Intelligence and Fundamental Interactions    6 Carnegie Mellon University    7 New College of Florida    8 Lockheed Martin Solar and Astrophysics Laboratory

Pipeline diagram: spatial cross-match with NWAY followed by LightGBM scoring to select Chandra–Gaia counterparts

Abstract

We present a framework to cross-match sources from the Chandra Source Catalog (CSC v2.1) with optical sources from Gaia Data Release 3. Unlike purely spatial approaches, we use source properties such as magnitudes, colors, and distances to identify true counterparts, detect chance coincidences, and resolve ambiguities when multiple plausible candidates exist. We define a training set of high-confidence matches using NWAY, a Bayesian cross-matching framework that accounts for positional errors and source densities.

We train a gradient-boosted classifier (LightGBM) on a variety of features from both catalogs. Of the ≈254k unique X-ray sources, we find counterparts for ≈113k sources, of which plausible multiple counterparts are found for ≈7k. We find no counterparts for ≈20k sources for which separation-based cross-matching does find a match, and attribute half of these to chance coincidences.

We validate the pipeline on the Chandra Orion Ultradeep Project (COUP), where the machine-learning matches reproduce 95% of NWAY cross-matches without using any positional information. We release a catalog of the ≈113k ChandraGaia counterparts, together with ≈7k alternative matches and ≈20k ambiguous NWAY associations, supporting future population studies of sources detectable by both Chandra and Gaia. We discuss limitations and provide a generalization of the framework that is applicable in other cross-matching scenarios.