• DaCHS is now at Version 2.7

    Logo-ish 2.7 with a multi-array plot

    Last Friday, I have released Version 2.7 of GAVO's Virtual Observatory server package DaCHS. As is customary, I will give a brief overview of the more noteworthy changes in this blog post. This is probably only of interest to people running DaCHS-based data centres. What I discuss here is both a bit more verbose and a bit less extensive than what you find in the Changes file (when installed from package, you would read it by running /usr/share/doc/python3-gavo/changelog.gz).

    The highlight in this release from my view are simple, numpy-like vector operations in ADQL. Regular readers of this blog will already have seen an example for their use. This is altogether a prototype, which is why what specification is there is only on the IVOA wiki. It is thus likely some details of the vector math will change until they make it into any sort of standard (I am hoping for ADQL 2.2). This should not keep you from trying it out and telling your users about it.

    In that same vein, the FITS binary table grammar now copes with vectors, which makes it easier to populate tables that make these operations useful, and for the sort of large tables where the array magic has particularly much promise, it is now a lot simpler to feed array-valued columns with C boosters.

    Other ADQL work includes the addition of proper, standards-compliant epoch propagation (i.e., “application of proper motion and radial velocity“) in the form of the ivo_epoch_prop and ivo_epoch_prop_pos user defined functions. Regrettably, this will not immediately work for you, as it builds on a feature in pgsphere that upstream has not merged yet; comments on that PR will certainly help make that happen. Of course, if you want, you can just build the pgsphere branch containing the new feature yourself. To make up for this complication, DaCHS will no longer advertise UDFs that will not work given the database extensions present – which will help me be a bit more liberal in letting in UDFs wrapping functionality not in Postgres' default distribution in the future.

    If you run datalink services and have multiple items with the same semantics, you may be interested in using local_semantics in Datalink. The use case here is that clients like TOPCAT will remain on, say, light curves in a red filter when the user jumps between records rather than randomly switching between red and blue ones when both have #coderived semantics (Mark's proposal). If you have data of this kind: you can now pass a localSemantics parameter to the makeLink and makeLinkFromFile methods of datalink descriptors; what string you use is up to you, as long as it's the same between similar rows for different datasets.

    I tend to forget that surprisingly many people actually do something with the ADQL form you get on DaCHS' web interface rather than use a TAP client. Well, a DaCHS operator complained about really sub-standard table headings in the HTML tables coming out of this service. Looking again, I had to admit he was right. So, TAP columns now have more meaningful table headings; in particular, if you write expressions, up to a certain length these expressions will be used as table headings. At least in this respect the ADQL form now has an advantage over using a proper client.

    In case you have a processor doing astrometric calibration with astrometry.net (you probably don't because it would have been very hard to make that work on without a lot of hacks so far) – have another look at the documentation because I have had various reasons to change api.AnetHeaderProcessor's API in quite a number of ways. It's now a lot easier to use with astrometry.net and source-extractor as distributed by Debian, but I'd still not have broken the API so badly if I had suspected anyone but me had significant code against this.

    I should also warn you that DaCHS now uses astropy to format sexagesimal times and coordinates. This is probably welcome news to those who ever encountered one of DaCHS' 05:59:60 outputs (which happened due to the way it did its rounding). Still, if you have regression tests testing for strings like that, you will need to update them.

    From the many minor fixes I should probably mention that DaCHS is now ready for Postgres 15 (which will probably the Postgres version in the next Debian stable). This used to be broken on new installations because Postgres 15 no longer lets normal users write to the public schema. DaCHS needs a database role that can do this, though, because it defines public functions. Since version 2.7, it does the necessary setup to make this possible. If you make your public schema non-world-writable manually – Postgres upgrades will not do that for you, and I would say there is no strong reason to do so for databases backing DaCHS –, do not forget to GRANT ALL ON SCHEMA public TO gavoadmin.

    With this – don't wait, upgrade. If you have GAVO's repository enabled, apt update && apt upgrade would probably do the trick, though of course I recommend having a look at our upgrading guide for robustness and good housekeeping.

  • Another Virtual Interop

    A part of a presentation slide, containing the sentence “to select single/multiple rows from plot use Handles layer

    One thing you could learn at this interop: How to identify the source row of a line in the TOPCAT's XYArray plot. See the end of this post for where this comes from.

    It's Interop time again! That is, most of the people involved in developing the Virtual Observatory (or for it) will report on what they have been up to since the last Interop, and what they are planning for the near-ish future. It is again an online meeting, so if interested, you could still register and then attend a couple of our sessions.

    You will see me as a chair (but for the first time since I became chair there not as a speaker) in Semantics, and I'll have talks in Registry (obligatorily) and DAL 1, though regular readers of this blog will have a few déjà vus.

    I plan to update this post as the meeting progresses – so, perhaps check back a few times until thursday.

    Update 2022-10-18, 15:00 UTC: I was expecting the VO in the Cloud Plenary with quite a bit of anxiety, because “in the cloud“ these days tends to mean “stuff things into proprietary walled gardens“. The first input talk turned out to be quite a bit less scary: Data providers want to have links to commerical cloud providers in addition to http download links. That's reasonable given users may want to optimise accesses for large data sets, and seeing that most respondents pointed to Datalink as the way to do that (as I did) was nice. The devil is in the details, though: Making good concepts that let clients figure out what are, in a sense, “equivalent“ ways to obtain the data is probably hard. The one thing I'm sure about is that I don't want concepts like #aws-metadata in datalink/core.

    And the rest of the session was rather a “how VO standards are or may be useful to us“ rather than the “dump the old open rubbish and move on to walled gardens“ I was worrying about. So… excellent!

    Update 2022-10-18, 21:10: Sitting in the DAL 1 Session, I am seriously tempted to become a gardener while listening to Tom's talk on Firewalls against ADQL. I have to thank U Heidelberg for hosting our services without horrible “Web Application Firewalls“ or trying to hack into https connections to “sanitise“ requests. At STScI, it seems the density of snake oil “security appliances“ is so high that at least somewhat advanced network usage like TAP and ADQL becomes really shaky.

    Can we just genrally disarm and perhaps, if SQL injection really is a problem in individual cases, just hire programmers on permanent contracts (meaning: they'll aquire sufficient experience) and/or reviewers for the software we run facing the net? It's not like SQL injection is just bad luck. It's a bug in every single case, and a sort of bug that's relatively simple to avoid – simpler in any case than detecting SQL injection attempts with a reasonable false-positive rate.

    Update 2022-10-20, 5:00 UTC: Yesterday, I had reasons both for rejoicing and for wishing for a brown bag. The rejoicing part was (for instance) in the solar system session, where Steve Joy reported on getting PDS Planetary Plasma Interactions (PPI) data into the VO – that's a good thing no matter what, especially given that I have a very soft spot for solar system data anyway. As the main author of DaCHS, however, I was particularly happy to see PPI are using it to talk to the VO. DaCHS thus is now running in Los Angeles, too. Hollywood, practically.

    The brown bag moment came in the Registry session; while my talks I think went fine – one of them basically being the oral version of a post from this blog –, Tom's talk on pyvo.registry made me cringe because he pointed out a bad interoperability sin on my side. The problem was not that my code unconcernedly uses COALESCE. From private mails I had understood, perhaps somewhat over-optimistically, that RegTAP operators had greenlighted that after my DAL post from last December, and it's a really simple extension anyway. I give you, though, that I should have ensured that COALESCE really had arrived on the servers before pushing for merging the new regsearch code into pyVO.

    No, what's really embarrassing is the UNION business. You see, the regsearch keyword constraint looks for the words in multiple places, and so it does something conceptually like WHERE keyword matches table1.descripition OR keyword matches table2.subject. Such cross-table ORs are generally extremely hard to plan for the database server, and thus when I re-wrote query generation for the RegTAP keyword search I just put in UNION – queries are really two orders of magnitude faster on my server this way.

    However, UNION has not been part of ADQL 2.0, and although I've lobbied for the set operators for a about a decade now, they are not formally part of ADQL yet. They will be part of ADQL 2.1, but even then they will not be mandatory. Hence, I should not blindly have employed UNION in code supposed to be interoperable, even less so because I can actually programmatically figure out whether a service supports UNION (from the TAP capabilities) and hence could have put in a fallback for where it's unavailable. Aw, dang.

    Update 2022-10-20, 20:00 UTC Just two sessions to go – Radio and Closing, though that little rest will be a challenge, with the closing session ending at 1 am my time.

    Thus, in the midnight hour, for the Semantics working group I will report on our session, which had quite a bit of rather deep plumbing this time. For instance, for the update to our standard on unit syntax, Norman raised the question whether “%“ ought to be a legal unit, and if so, if there's any way to keep ppm, ppb, and ppt out (؉ or ‰, on the other hand, are easy to keep out: We're really stubbornly insisting on pure ASCII). This may border on bikeshedding, but it has very concrete consequences on clients (such as astropy's unit parser) and services (where, for instance, VizieR has to cope with submissions that have columns given in percent). Before the session, it looked like we'd just let in percent, and that only grudgingly. Now… it's likely we will have to be more liberal.

    Great news in the session was that there is now a prototype of a Rosetta Stone for facility names in Paris, that is, a service that lets you map between all the different names your typical observation facility has (for instance, the part of my institute that is up on the mountain could be known as Königstuhl Observatory, Landessternwarte Königstuhl, LSW, Zentrum für Astronomie Heidelberg, and much more). If you have never tried linking all these various names up, you will be surprised how hard that problem is. See Baptiste's slides for how they are tackling it and how they are applying hardcore Semantics tech – in particular, SPARQL – to do it. I liked it a lot.

    Another talk I would like to call out is Steve Crawford's from the session of the Data Curation and Preservation IG. His recommendation to go with CC0 for, well, licensing, is something I can only support exactly because it is not a licence at all, which relieves you of the troublesome problem of assinging copyright so someone. That triviality is only the first of several legal problems we have since we have put the IVOA documents under CC-BY. But since nobody is ever going to court about any of this, the legal trouble is perhaps not terribly worrying. What is nasty about CC-BY is that whatever is licensed CC-BY is (generally) incompatible with the GPL and many other software licenses, which means you will get in trouble if you try to package it with something destined for Debian. And Steve makes some excellent points why CCO is just fine for science data.

    Finally, if you liked the posts on array plotting in TOPCAT and usage in ADQL, you should definitely have a look at Mark's talk in this morning's Apps session, where he in particular shows how you can go from a line in the array plot back to the row that contains the array.

    And with that I've told you where the opening slide fragment came from. Good night!

  • A Proposed Vector Extension for ADQL

    When I showed off my rendering of the Gaia DR 3 XP spectra a month ago, I promised I would later show how my proposal for a Vector extension to ADQL would enable quite a bit of interesting functionality on that table. Let me make good on this promise with a little project to find candidates for Wolf-Rayet stars, more specifically, WC stars. I give you that's a bit cheap because they have very distinctive spectral features, but then this is supposed to be a quick educational posting, not a science paper.

    Before I start, I probably should stress that in this context I am using the word “vector” like a computer guy would. We are talking about one-dimensional arrays here, not about the vectors the mathematicians have given us (as in “elements of vector spaces“).

    Getting Spectra for Wolf-Rayet Stars

    Let us first produce a list of Wolf-Rayet stars (of any denomination) using SIMBAD. So, start TOPCAT, open the TAP dialog and find the SIMBAD TAP server. Run:

    SELECT main_id, ra, dec
    FROM basic
    WHERE otype='WR*'

    there[1]. In the next step, we will need the Gaia DR3 source_ids for these objects, and so it would be nice to pull them immediately from Simbad; you could in principle do that by running:

    SELECT main_id, ra, dec, id
    FROM basic
      JOIN ident
      ON (oid=oidref)
    WHERE otype='WR*'
      AND id LIKE 'Gaia DR3%'

    – but for one, fiddling out the actual source_ids from the strings you get in the id column is a bit tedious, and then quite a few of these objects don't have Gaia ids yet: The first query returns 1548 at the moment, the second 696.

    If we had the source_ids, we could immediately join with the gdr3spec.spectra table at the GAVO DC TAP (http://dc.g-vo.org/tap). As I mentioned a month ago, this is a physical table just consisting of the source id and arrays of flux and flux errors. There is also gdr3spec.withpos that has positions and the spectra; but that's a view, and for the time being that means that the planner will quite likely get confused when positional constraints come into play[2]. The result would be queries running for half an hour when a few seconds would do just as well.

    On services already supporting ADQL 2.1, you can usually work around problems of this kind by re-writing your query to use CTEs (“WITH”), because these often work as planner barriers. In the present case, we first get source_ids for our Simbad objects in a CTE and then use these to join with our spectra table, like this:

    WITH wrids AS (
            SELECT source_id, main_id
            FROM gaia.edr3lite AS l
        JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001))
    SELECT main_id, source_id, flux, flux_error
    FROM gdr3spec.spectra
    JOIN wrids USING (source_id)

    As usual, you will probably have to adapt the number in what is tap_upload.t1 here to the table index you have for your SIMBAD result.

    This yields 574 spectra at the moment, within a few seconds. Spectra for about a third of our collection of objects: I'd say that's quite impressive.

    Investigating the spectra

    The September post already discussed a few aspects of array plotting. In short: try the plane plot and an XYArray control. Modern TOPCATs (and for what I'm doing here it's wise to use something newer than 4.8-7) will automatically figure out suitable columns for the x and y axes (except it forgets to label the spectral coordinate with its unit, nm):

    Plot: lots of wiggly lines

    There's a quite a bit of crowding here; finding global characteristics perhaps works better when you switch the ordinate to logarithmic, use transparent shading (in the Form tab) and raise the opaque limit a bit. This could give you something like:

    Plot: hazy structures in blue

    You can see that at least quite a few objects have nice and strong emission lines, as I had hoped for when choosing WC stars for this example. What if we could pick them out to build a template spectrum? Well, let's try. With the new vector math in ADQL, the database can normalise the spectra and compute a few statistics on them.

    But first: In order to get rid of the source_id-computing CTE above, let me obtain the source_ids I want to work with once and for all, as in:

    SELECT source_id, main_id
    FROM gaia.edr3lite AS l
      JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001)

    Memorise the Table List index of the result. With that, you can directly work with the gdr3spec.spectra table and experiment a bit; for me, the index was 10, and hence I use tap_uploads.t10 below.

    Computing some statistics

    The Gaia XP spectra are flux calibrated, and hence I will have to normalise them if I want to compare them. Ignoring all errors and thus in particular the fact that some (few) components are negative, this normalisation is harmless: We just divide by the sum of all vector components. The net result is that, were our spectra continuous, the integral over them would be one. And let's then use the standard deviation and the value of the 19th array element as metrics:

    WITH normalised AS (
    SELECT source_id, main_id,
            flux/arr_sum(flux) as nflux, flux
    FROM gdr3spec.spectra
    JOIN tap_upload.t10
    USING (source_id))
      source_id, main_id,
      flux, nflux,
      arr_avg(nflux*nflux)-POWER(arr_avg(nflux),2) AS sd,
      nflux[19] as em
    FROM normalised

    You can see quite a bit of the vector extension here:

    • arr_sum and arr_avg: These work as the aggregate functions in SQL do, just not on tuples but on the components of the vectors.
    • Multiplication of vectors is element-wise, so nflux*nflux computes a vector of the squares of the components of nflux. That's also true for all other basic arithmetic operators. If you know numpy: same thing.
    • You fetch individual elements in the [] notation you probably know from Python or C. Contrary to Python and C, common SQL implementations count indexes from 1 by default, and we are keeping that here (for now). Fortran lovers rejoice!

    Why did I use nflux[19]? Well, there is a reasonably strong emission feature in many spectra at about 580 nm (it's three-time ionised carbon, or C IV in the rotten notation I usually apologise for when talking to non-astronomers), which is, as experts tell me (thanks, Andreas!), rather characteristic for the WC stars that I'm after (whereas the even stronger feature on the left, around 470 nm, can also be Helium).

    If you inspect the spectral coordinate that TOPCAT has on the abscissa (it's a param, so you'll have to go to ViewsTable Parameters to see it), you will see that each spectral bin is 10 nm wide. So, I will hopefully hit the 580 nm feature when fetching the 19th element of the spectral vector.

    If you plot em versus the sd obtained like that, you will see two reasonably distinct groups, where the ascending arm has relatively strong emission around 580 nm and the descending arm does not. I have used the Blob subset feature to select the upper arm into a subset ”upper” that is blue in the following plot. If you click around in the em-sd plot and show the Activated subset in an array plot, you can see things like:

    Screenshot: scatterplot and stacked spectra next to each other

    The activated spectrum (shown here in yellow) has a strong Hα but basically no C IV – and it's safely outside of our carbon subset. Click around a bit on the ascending arm, and you will see that all these spectra have a bump around array element 18 (in TOPCAT's count, which starts at 0).

    Computing a Template on the Server Side

    Whatever the subset of stars that we would like to use to define our group of interest, we would now like to create a template spectrum from them. A plausible way to do that is to sum them all up – that has the nice side effect that stronger sources (which hopefully are less noisy) have a larger weight.

    To compute the template, in the Views → Column Info for the sd/em table, unselect all columns but source_id (that way, you only upload what you absolutely must), and in the main window, select the upper subset in the Row Subset combo box. That way, only the rows in that subset will get uploaded in the following query:

    SELECT summed/arr_sum(summed) AS tpl
    FROM (
      SELECT SUM(flux) AS summed
      FROM gdr3spec.spectra
      JOIN tap_upload.t19
      USING (source_id)) AS q

    Again, you will have to adapt the t19 to where your manipulated sd/em table is. If you get an “ambiguous column flux” error (or so) here, you forgot to unselect all columns but source_id in the columns window.

    It pays to briefly appreciate what happens here: The SELECT SUM(flux) is an aggregate function over arrays, meaning that all the arrays are being summed over component-wise. Against that, the sum in summed/arr_sum(summed) is summing within the array. If it helps you, you could imagine having all the arrays in the table stacked. Then, SUM(arr) produces the vertical margin sum, and array_sum(arr) procudes the horizontal margin sum.

    Well, here's the template I got in this way:

    Plot: a single wiggly line

    I give you that in this particular case, you could easily have done the computation on the client side, because you already had the spectra in your table. But the technique also works when you don't, and it will scale to millions of arrays (although you will have to carefully think about numerics when doing such enormous sums).

    Also – I cannot lie – I simply had to have a pretext for showing you aggregate functions over arrays.

    Finding Similar Objects

    Now that we have the template, can we find objects that have similar properties? Sure: We upload the array and compute some metric, perhaps the (squared) euclidian distances to normalised spectra. If the template is in TOPCAT's table 25, you can write:

    SELECT TOP 200000
          power(x,2),tpl-flux/arr_sum(flux))) AS dist2,
      arr_sum(flux_error/flux) AS errs
    FROM gdr3spec.spectra, tap_upload.t25

    This will compute the distances between (conceptually randomly drawn) 200000 spectra and your template. I am also requesting the sum of the (relative) flux errors as a measure of how likely it is that wild wiggles actually are just artefacts.

    There is one array-related feature in that query I have not yet mentioned: arr_map. This applies an expression to all components to a vector, pretty much like python's map function, and my attempt to have some (perhaps somewhat lame) substitute for numpy's ufunctions.

    I am rather sure we really need something like this. SQL has no notion of defining functions in queries. That is usually welcome, as otherwise it would quicky become Turing-complete, which would be bad for what it is designed to do. Here, however, that is a problem, because we do not have a clean way to write the expression be be computed for each component. For now, I have decreed that the first argument of map is an expression over a formal x. This is ugly not only because it will be confusing when there's an actual x in a table or query. I suspect with a bit more thought and creativity, one can find a better solution that still does not require a re-write of half the SQL grammar. But then let's see – perhaps this makeshift hack proves to be less troublesome than I expect.

    Note that on my server, you will only get back 20000 matches by default; you would have to adapt Max Rows to actually retrieve 200'000, and then you also must switch to Asynchronous mode. This will then actually take a non-trivial amount of CPU and disk I/O; going through the entire set of 2e8 rows will be a matter of hours or so. Hence, I'm grateful if you do all-sky scans only after having tested your queries on much smaller subsets.

    I've done this for 500'000 rows (which took a few minutes), which might bring up a few C IV-strong WR stars (these beasts are rare, you know). The result in the dist2/errs plane is (logplot zoomed a bit):

    Plot: a gigantic point cloud with a few outliers.

    Well: at least there are a few promising cases. Which would conclude this little demo for the ADQL vector proposal. Looking at what we have found here is another story.

    Still, I could not resist having another look at what my box has found there. There is a rather clear cut in the plot at perhaps 0.009, and thus I created a subset interesting consisting of objects for which dist2<0.009 (which is 18 objects for me) and did the trick above, only uploading this interesting subset with only the source_id column to resolve to Gaia DR3 (lite) rows:

    SELECT *
    FROM gaia.dr3lite
    JOIN tap_upload.t33
    USING (source_id)

    And then I wondered whether any of these were known to SIMBAD and switched to their TAP service:

      tc.*, otype
    FROM basic AS db
     ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                   CIRCLE('ICRS', tc.ra, tc.dec, 1./3600.))

    Note the use of RIGHT OUTER JOIN to ensure we won't lose any matches on the way; if this weren't such a small table, you'd be better off just uploading the positions and then doing a local match to recover the rest of the table, by the way.

    As to what's coming back: Well, a bunch of white dwarf candidates, a “blue“ star, a few objects SIMBAD knows as quasars (that at least makes sense, because it's rather likely that some of them have lines redshifted into my C IV window), and a few unclassified objects. Whether SIMBAD is wrong on at least some of them, whether the positional crossmatch fetched unrelated objects, or whether I got it all wrong I will not decide here. Let me give you my candidates as a VOTable, though.

    You now know what you have to do to add nice, if low-resolution, spectra to them.


    A notable absence from the current vector extension is slicing. I think we should have it – in this example, this would be really useful when summing different spectral regions without having to write long sums (“synthetic broadband photometry“).

    I have not put it in yet mainly because I am not sure if Python-like syntax (nflux[4:7]) is a good idea when we have 1-based arrays. Also: Do we want to keep the upper index out? That's certainly the right thing for Python (where you want a[:3]+a[3:] == a), but is it here? Speaking of which: Should we require support of half-open slices? Should we rather have a function arr_slice? With what arguments?

    I'd be curious about other peoples' thoughts on slicing.


    In case you wonder how I came up with the WR*: You can simply run something like:

    SELECT otype, label
    FROM otypedef
    WHERE description LIKE '%Wolf%'

    Once Simbad upgrades to ADQL 2.1, you probably want to replace the LIKE with ILIKE for robustness.

    [2]For the incurably curious, you can learn more about the underlying problem at https://github.com/segasai/q3c/issues/30.
  • We are at the AG-Tagung in Bremen

    The bottom part of a towel with a Hertzsprung-Russell diagram printed on it

    Our puzzler prize for this year (well, its lower part): The Hertzsprung-Russell diagram according to Gaia on a wonderfully soft towel.

    After two years of “virtual” meetings, this year the venerable “Herbsttagung der Astronomischen Gesellschaft”, the meeting of Germany's Astronomical Society, is back. Almost as before Corona, it is bringing astronomers together, this year in Bremen (previously on this blog: 2018 in Stuttgart).

    Bowing to the “German“ in GAVO, this is an opportunity for us to connect to the (or, rather, our) community, both with a splinter meeting and with our traditional booth, at which you can pick up various edifying printed matter, a laminated ADQL reference card, and lots of VO wisdom (i.e., chat with our friendly booth staff).

    And you can solve our puzzler, a little problem that has an elegant VO-based solution (previous puzzlers). As is tradition, solving the puzzler will not only give you intellectual satisfaction and perhaps even insights into the VO, it will also give you a chance to win an item that is heavenly fluffy. The article photo shows this year's puzzler prize, and if this piques your desire, absolutely feel free to hand in solutions even if you are not in Bremen[1].

    Update 2022-09-15: This year's prize went to Bonn. So, there's no point to hand in solutions any more – rather, have a look at how we thought the problem should be approached.

    [1]We've had an actual award (the AG-RAS Carolin Herschel medal) being handed out virtually yesterday, so in case we really draw a remote entry, I am very confident that we can work something out for handing over the prize.
  • Gaia DR3 XP Spectra: All Sampled

    Lots of blue crosses and a few red squares plotted over a sky photograph of a star cluster

    Around this time of the year on the northern hemisphere, you can spot the h and χ Persei double star cluster with the naked eye. One part of it, NGC 884 is shown here with LAMOST DR6 low resolution spectra (red squares) and Gaia DR3 XP spectra (blue crosses) overplotted. Given that LAMOST has already been one of the largest collections of spectra on the planet, you can see that there is really a lot of those XP spectra.

    When Gaia DR3 was released in June, I was somewhat disappointed when I realised what it is that they delivered as the BP/RP (or XP for short) spectra. You see, I had expected to see something rather similar to what I have in DFBS: structurally, arrays of a few dozen spectral points, mapping wavelengths to some sort of measure of the flux.

    What really came were, mainly, “continuous spectra“, that is coefficients of Gauss-Hermite polynomials. You can fetch them from the gaiadr3.xp_continuous_mean_spectrum table at the ARI-Gaia TAP service; the blue part of the spectrum of the star DR3 4295806720 looks like this in there:

    102.93398893929992, -12.336921213781045, -2.668856168170544, -0.12631176306793765, -0.9347021092539146, 0.05636787290132809, [...]

    No common spectral client can plot this. The Gaia DPAC has helpfully provided a Python library called GaiaXPy to turn these into “proper” spectra. Shortly after the data release, my plan has thus been to turn all these spectra into their “sampled” form using GaiaXPy and then re-publish them, both through SSAP for ad-hoc discovery and through TAP for (potentially) global analysis.

    Alas, for objects too faint to make it into DR3's xp_sampled_mean_spectrum table (that's 35 million spectra already turned to wavelength-flux pairs by DPAC), the spectra generated in this way looked fairly awful, with lots of very artificial-looking wiggles (“ringing”, if you will). After a bit of deliberation, I realised that when the errors are given on the Hermite coefficients, once you compute the samples, these errors will be liberally distributed among the output samples. In other words, the error on the samples will be grossly correlated over arbitrary distances; at least I am fairly helpless when trying to separate signal from artefact in these beasts.

    Bummer. Well, fortunately, Rene Andrae from “up the mountain” (i.e., the MPI for Astronomy) has worked out a reasonably elegant way to get more conventional spectra understandable to mere humans. Basically, you compute n distinct “realisations” of the error model given by the table of the continuous spectra and average over them. The more samples you take, the less correlated your spectral points and their errors will be and the less confusing the signal will be. The service docs for gaia/s3 give the math.

    Doing this on more than 200 million spectra is quite an effort, though, and so after some experimentation I decided to settle on 10 realisations per spectrum and have relatively wide bins (10 nm) over just the optical part of the spectrum (400 through 800 nm). The BP and RP bandpaths are a bit wider, and there is probably signal blotted out by the wide bins; I will probably be addressing this for DR4, except if these spectra become the smash hit they deserve to be.

    The result of this procedure is now available through an SSAP service that should show up in the VO Registry by the time the first of you read this; the Aladin image above gives you an impression of the density of results here – and don't forget: the spectra with the blue crosses are all reasonably well flux-calibrated.

    The data is also available on the TAP service http://dc.g-vo.org/tap, which opens up many interesting possibilities. Let me mention two here.

    Comparison with LAMOST

    I was rather nervous whether what I had done resulted in anything that bore even a fleeting resemblance to reality, and so about the first thing I tried was to compare my new data with what LAMOST has.

    That is a nice exercise for TAP and ADQL. Let's first match spectra from the two surveys, which luckily are on the same server, saving us some cross-server uploads. I am selecting a minimum of data, just the position and the two access URLs, and I let DaCHS' MAXREC kick in so I'm just retrieving 20000 of the millions of result records:

    SELECT a.ssa_location, a.accref, b.accref
      gdr3spec.ssameta AS a
      JOIN lamost6.ssa_lrs AS b
      ON DISTANCE(a.ssa_location, b.ssa_location)<0.001

    (this is using the DISTANCE(.,.)<radius idiom that we will be migrating towards in ADQL 2.1 instead of the dreaded 1=CONTAINS(POINT, CIRCLE) thing everyone has loathed in ADQL 2.0).

    Using the nifty activation actions, you can now tell TOPCAT to open the two spectra next to each other when you click on a row or a point in a sky plot. To reproduce,

    1. Make a sky plot. TOPCAT doesn't yet pick up the POINT in ssa_location, so you have to configure the Lon and Lat fields yourself to ssa_location[0] and ssa_location[1].
    2. Open the activation actions, either from the button bar or from the Views menu.
    3. In there, select Plot Table, make sure it says accref in Table Location and then check Plot Table in the Actions pane. When you now click on a point in the sky plot, you should see a spectrum pop up, except it is plotted with dots, which most people consider inappropriate for spectra. Use the Form tab in the plot window to style it a bit more spectrum-like (I recommend looking into Line and XYError).
    4. But how do you now add the LAMOST plot? I don't think TOPCAT's activation actions let you plot right into the plane plot you just configured. But you can add a second Plot Table action from the Actions menu in the window with the activation actions. As before, configure this new item, except this one needs to plot accref_ (which is what DaCHS has called the access reference for LAMOST to keep the names unique).
    5. As for Gaia, configure to plot to look good as a spectrum. In order to make the two spectra optically comparable, under Axes set the range to 4000 to 8000 Angstrom manually here.

    You can now click on points in your sky plot and, after a second or so, see the corresponding spectra next to each other (if you place the two plot windows that way).

    If you try this, you will (hopefully) see that major features of spectra are nicely reproduced, such as with these, I guess, molecular bands:

    Two line plots next to each other, the right one showing more features.  the left one roughly follows the major wiggles, though.

    As you probably have guessed, the extremely low-resolution Gaia XP spectrum is left, LAMOST's (somewhat higher-resolution) low-resolution spectrum is right:

    This also works with absorption in the blue, as in this example:

    Two line plots next to each other, the right one showing a lot of relatively sharp absoprtion lines, which the left one does not have.  A few major bumps are present in both, and the general shape conincides nicely, expect perhaps at the blue edge.

    In case of doubt, I have to say I'd probably trust Gaia's calibration around 400 nm better than LAMOST's. But that's mere guesswork.

    For fainter objects, you will see remnants of the systematic wiggles from the Hermite polynomials:

    Two line plots next to each other.  Both are relatively noisy, in particular on the blue edge.  The left one also seems to have a rather regular oscillation at the blue edge.

    Anyway, if you keep an eye on the errors, you can probably even work with spectra from the fainter objects:

    Two line plots next to each other.  The left one has fairly strong ringing which is not present in the right one, but it mainly stays within the error bars.  The total flux of this star is at least a factor of 10 less than for the prettier examples above.

    Mass Retrieval of Spectra

    One nice thing about the short spectra is that you can fetch many of them in one go and in very little time. For instance, to retrieve particularly red objects from the Gaia catalogue of Nearby Stars (also on the GAVO server) with spectra, say:

      source_id, ra, dec, parallax, phot_g_mean_mag,
      phot_bp_mean_mag, phot_rp_mean_mag, ruwe, adoptedrv,
      flux, flux_error
    FROM gcns.main
    JOIN gdr3spec.spectra
    USING (source_id)
    WHERE phot_rp_mean_mag<phot_bp_mean_mag-4

    [in case you wonder how I quickly got the column names enumerated here: do control-clicks into the Columns pane in TOCPAT's TAP window and then use the Cols button]. For when you do not have Gaia DR3 source_id-s in your source table, there is also gdr3spec.withpos against which you can do more conventional positional crossmatches.

    Within a few seconds, you can retrieve more than 4000 spectra in this way. You can now do whatever analysis you want on these spectra. Or, well, just plot them. The following procedure for that later task uses TOPCAT features only available in the next release, due before mid-October[1].

    First, make a colour-magnitude diagram (CMD) from this table as usual (e.g., BP-RP vs G). Then, open another plane plot and

    1. LayersAdd XYArray Control
    2. Configure the XYArray to plot from the table you just fetched, have nothing in X Values[2] and flux in Y Values.
    3. Under Axes, configure Y Log in order to better show the 4253 spectra at one time.
    4. Throw away or at least uncheck all other layers in the plot.
    5. In order to let TOPCAT highlight the spectrum of the activated source, in the Subsets pane check the Activated subset (that's the bleeding-edge functionality you will not have in older TOPCATs) and give it a sufficiently bright colour.

    With that, you can now click around in your CMD and immediately see that source's spectrum in the context of all the others, like this:

    An animation of someone selecting various points in a CMD and have simulataneous spectra plotted.

    These spectra have also inspired me to design and implement a vector extension for ADQL, which lets you do even more interesting things with these spectra. More on this… soon.

    [1]The Activated subset is only available in TOPCAT versions later than 4.8-7 (released in October 2022).
    [2]These should be the spectral points; DaCHS does not deliver them with this query because I am a coward. I think I will find my courage relatively soon and then fix this. Once that has happened, you can select param$spectral as X values. [Update: Mark Taylor remarks that by writing sequence(41, 400, 10) in bleeding-edge TOPCATs and add(multiply(10,sequence(41)),400) before that, you can add a proper spectral axis until then]

« Page 3 / 18 »