Articles from Demo

  • A Proposed Vector Extension for ADQL

    When I showed off my rendering of the Gaia DR 3 XP spectra a month ago, I promised I would later show how my proposal for a Vector extension to ADQL would enable quite a bit of interesting functionality on that table. Let me make good on this promise with a little project to find candidates for Wolf-Rayet stars, more specifically, WC stars. I give you that's a bit cheap because they have very distinctive spectral features, but then this is supposed to be a quick educational posting, not a science paper.

    Before I start, I probably should stress that in this context I am using the word “vector” like a computer guy would. We are talking about one-dimensional arrays here, not about the vectors the mathematicians have given us (as in “elements of vector spaces“).

    Getting Spectra for Wolf-Rayet Stars

    Let us first produce a list of Wolf-Rayet stars (of any denomination) using SIMBAD. So, start TOPCAT, open the TAP dialog and find the SIMBAD TAP server. Run:

    SELECT main_id, ra, dec
    FROM basic
    WHERE otype='WR*'
    

    there[1]. In the next step, we will need the Gaia DR3 source_ids for these objects, and so it would be nice to pull them immediately from Simbad; you could in principle do that by running:

    SELECT main_id, ra, dec, id
    FROM basic
      JOIN ident
      ON (oid=oidref)
    WHERE otype='WR*'
      AND id LIKE 'Gaia DR3%'
    

    – but for one, fiddling out the actual source_ids from the strings you get in the id column is a bit tedious, and then quite a few of these objects don't have Gaia ids yet: The first query returns 1548 at the moment, the second 696.

    If we had the source_ids, we could immediately join with the gdr3spec.spectra table at the GAVO DC TAP (http://dc.g-vo.org/tap). As I mentioned a month ago, this is a physical table just consisting of the source id and arrays of flux and flux errors. There is also gdr3spec.withpos that has positions and the spectra; but that's a view, and for the time being that means that the planner will quite likely get confused when positional constraints come into play[2]. The result would be queries running for half an hour when a few seconds would do just as well.

    On services already supporting ADQL 2.1, you can usually work around problems of this kind by re-writing your query to use CTEs (“WITH”), because these often work as planner barriers. In the present case, we first get source_ids for our Simbad objects in a CTE and then use these to join with our spectra table, like this:

    WITH wrids AS (
            SELECT source_id, main_id
            FROM gaia.edr3lite AS l
        JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001))
    SELECT main_id, source_id, flux, flux_error
    FROM gdr3spec.spectra
    JOIN wrids USING (source_id)
    

    As usual, you will probably have to adapt the number in what is tap_upload.t1 here to the table index you have for your SIMBAD result.

    This yields 574 spectra at the moment, within a few seconds. Spectra for about a third of our collection of objects: I'd say that's quite impressive.

    Investigating the spectra

    The September post already discussed a few aspects of array plotting. In short: try the plane plot and an XYArray control. Modern TOPCATs (and for what I'm doing here it's wise to use something newer than 4.8-7) will automatically figure out suitable columns for the x and y axes (except it forgets to label the spectral coordinate with its unit, nm):

    Plot: lots of wiggly lines

    There's a quite a bit of crowding here; finding global characteristics perhaps works better when you switch the ordinate to logarithmic, use transparent shading (in the Form tab) and raise the opaque limit a bit. This could give you something like:

    Plot: hazy structures in blue

    You can see that at least quite a few objects have nice and strong emission lines, as I had hoped for when choosing WC stars for this example. What if we could pick them out to build a template spectrum? Well, let's try. With the new vector math in ADQL, the database can normalise the spectra and compute a few statistics on them.

    But first: In order to get rid of the source_id-computing CTE above, let me obtain the source_ids I want to work with once and for all, as in:

    SELECT source_id, main_id
    FROM gaia.edr3lite AS l
      JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001)
    

    Memorise the Table List index of the result. With that, you can directly work with the gdr3spec.spectra table and experiment a bit; for me, the index was 10, and hence I use tap_uploads.t10 below.

    Computing some statistics

    The Gaia XP spectra are flux calibrated, and hence I will have to normalise them if I want to compare them. Ignoring all errors and thus in particular the fact that some (few) components are negative, this normalisation is harmless: We just divide by the sum of all vector components. The net result is that, were our spectra continuous, the integral over them would be one. And let's then use the standard deviation and the value of the 19th array element as metrics:

    WITH normalised AS (
    SELECT source_id, main_id,
            flux/arr_sum(flux) as nflux, flux
    FROM gdr3spec.spectra
    JOIN tap_upload.t10
    USING (source_id))
    SELECT
      source_id, main_id,
      flux, nflux,
      arr_avg(nflux*nflux)-POWER(arr_avg(nflux),2) AS sd,
      nflux[19] as em
    FROM normalised
    

    You can see quite a bit of the vector extension here:

    • arr_sum and arr_avg: These work as the aggregate functions in SQL do, just not on tuples but on the components of the vectors.
    • Multiplication of vectors is element-wise, so nflux*nflux computes a vector of the squares of the components of nflux. That's also true for all other basic arithmetic operators. If you know numpy: same thing.
    • You fetch individual elements in the [] notation you probably know from Python or C. Contrary to Python and C, common SQL implementations count indexes from 1 by default, and we are keeping that here (for now). Fortran lovers rejoice!

    Why did I use nflux[19]? Well, there is a reasonably strong emission feature in many spectra at about 580 nm (it's three-time ionised carbon, or C IV in the rotten notation I usually apologise for when talking to non-astronomers), which is, as experts tell me (thanks, Andreas!), rather characteristic for the WC stars that I'm after (whereas the even stronger feature on the left, around 470 nm, can also be Helium).

    If you inspect the spectral coordinate that TOPCAT has on the abscissa (it's a param, so you'll have to go to ViewsTable Parameters to see it), you will see that each spectral bin is 10 nm wide. So, I will hopefully hit the 580 nm feature when fetching the 19th element of the spectral vector.

    If you plot em versus the sd obtained like that, you will see two reasonably distinct groups, where the ascending arm has relatively strong emission around 580 nm and the descending arm does not. I have used the Blob subset feature to select the upper arm into a subset ”upper” that is blue in the following plot. If you click around in the em-sd plot and show the Activated subset in an array plot, you can see things like:

    Screenshot: scatterplot and stacked spectra next to each other

    The activated spectrum (shown here in yellow) has a strong Hα but basically no C IV – and it's safely outside of our carbon subset. Click around a bit on the ascending arm, and you will see that all these spectra have a bump around array element 18 (in TOPCAT's count, which starts at 0).

    Computing a Template on the Server Side

    Whatever the subset of stars that we would like to use to define our group of interest, we would now like to create a template spectrum from them. A plausible way to do that is to sum them all up – that has the nice side effect that stronger sources (which hopefully are less noisy) have a larger weight.

    To compute the template, in the Views → Column Info for the sd/em table, unselect all columns but source_id (that way, you only upload what you absolutely must), and in the main window, select the upper subset in the Row Subset combo box. That way, only the rows in that subset will get uploaded in the following query:

    SELECT summed/arr_sum(summed) AS tpl
    FROM (
      SELECT SUM(flux) AS summed
      FROM gdr3spec.spectra
      JOIN tap_upload.t19
      USING (source_id)) AS q
    

    Again, you will have to adapt the t19 to where your manipulated sd/em table is. If you get an “ambiguous column flux” error (or so) here, you forgot to unselect all columns but source_id in the columns window.

    It pays to briefly appreciate what happens here: The SELECT SUM(flux) is an aggregate function over arrays, meaning that all the arrays are being summed over component-wise. Against that, the sum in summed/arr_sum(summed) is summing within the array. If it helps you, you could imagine having all the arrays in the table stacked. Then, SUM(arr) produces the vertical margin sum, and array_sum(arr) procudes the horizontal margin sum.

    Well, here's the template I got in this way:

    Plot: a single wiggly line

    I give you that in this particular case, you could easily have done the computation on the client side, because you already had the spectra in your table. But the technique also works when you don't, and it will scale to millions of arrays (although you will have to carefully think about numerics when doing such enormous sums).

    Also – I cannot lie – I simply had to have a pretext for showing you aggregate functions over arrays.

    Finding Similar Objects

    Now that we have the template, can we find objects that have similar properties? Sure: We upload the array and compute some metric, perhaps the (squared) euclidian distances to normalised spectra. If the template is in TOPCAT's table 25, you can write:

    SELECT TOP 200000
      source_id,
      arr_sum(
        arr_map(
          power(x,2),tpl-flux/arr_sum(flux))) AS dist2,
      arr_sum(flux_error/flux) AS errs
    FROM gdr3spec.spectra, tap_upload.t25
    

    This will compute the distances between (conceptually randomly drawn) 200000 spectra and your template. I am also requesting the sum of the (relative) flux errors as a measure of how likely it is that wild wiggles actually are just artefacts.

    There is one array-related feature in that query I have not yet mentioned: arr_map. This applies an expression to all components to a vector, pretty much like python's map function, and my attempt to have some (perhaps somewhat lame) substitute for numpy's ufunctions.

    I am rather sure we really need something like this. SQL has no notion of defining functions in queries. That is usually welcome, as otherwise it would quicky become Turing-complete, which would be bad for what it is designed to do. Here, however, that is a problem, because we do not have a clean way to write the expression be be computed for each component. For now, I have decreed that the first argument of map is an expression over a formal x. This is ugly not only because it will be confusing when there's an actual x in a table or query. I suspect with a bit more thought and creativity, one can find a better solution that still does not require a re-write of half the SQL grammar. But then let's see – perhaps this makeshift hack proves to be less troublesome than I expect.

    Note that on my server, you will only get back 20000 matches by default; you would have to adapt Max Rows to actually retrieve 200'000, and then you also must switch to Asynchronous mode. This will then actually take a non-trivial amount of CPU and disk I/O; going through the entire set of 2e8 rows will be a matter of hours or so. Hence, I'm grateful if you do all-sky scans only after having tested your queries on much smaller subsets.

    I've done this for 500'000 rows (which took a few minutes), which might bring up a few C IV-strong WR stars (these beasts are rare, you know). The result in the dist2/errs plane is (logplot zoomed a bit):

    Plot: a gigantic point cloud with a few outliers.

    Well: at least there are a few promising cases. Which would conclude this little demo for the ADQL vector proposal. Looking at what we have found here is another story.

    Still, I could not resist having another look at what my box has found there. There is a rather clear cut in the plot at perhaps 0.009, and thus I created a subset interesting consisting of objects for which dist2<0.009 (which is 18 objects for me) and did the trick above, only uploading this interesting subset with only the source_id column to resolve to Gaia DR3 (lite) rows:

    SELECT *
    FROM gaia.dr3lite
    JOIN tap_upload.t33
    USING (source_id)
    

    And then I wondered whether any of these were known to SIMBAD and switched to their TAP service:

    SELECT
      tc.*, otype
    FROM basic AS db
     RIGHT OUTER JOIN TAP_UPLOAD.t34 AS tc
     ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                   CIRCLE('ICRS', tc.ra, tc.dec, 1./3600.))
    

    Note the use of RIGHT OUTER JOIN to ensure we won't lose any matches on the way; if this weren't such a small table, you'd be better off just uploading the positions and then doing a local match to recover the rest of the table, by the way.

    As to what's coming back: Well, a bunch of white dwarf candidates, a “blue“ star, a few objects SIMBAD knows as quasars (that at least makes sense, because it's rather likely that some of them have lines redshifted into my C IV window), and a few unclassified objects. Whether SIMBAD is wrong on at least some of them, whether the positional crossmatch fetched unrelated objects, or whether I got it all wrong I will not decide here. Let me give you my candidates as a VOTable, though.

    You now know what you have to do to add nice, if low-resolution, spectra to them.

    Slices?

    A notable absence from the current vector extension is slicing. I think we should have it – in this example, this would be really useful when summing different spectral regions without having to write long sums (“synthetic broadband photometry“).

    I have not put it in yet mainly because I am not sure if Python-like syntax (nflux[4:7]) is a good idea when we have 1-based arrays. Also: Do we want to keep the upper index out? That's certainly the right thing for Python (where you want a[:3]+a[3:] == a), but is it here? Speaking of which: Should we require support of half-open slices? Should we rather have a function arr_slice? With what arguments?

    I'd be curious about other peoples' thoughts on slicing.

    [1]

    In case you wonder how I came up with the WR*: You can simply run something like:

    SELECT otype, label
    FROM otypedef
    WHERE description LIKE '%Wolf%'
    

    Once Simbad upgrades to ADQL 2.1, you probably want to replace the LIKE with ILIKE for robustness.

    [2]For the incurably curious, you can learn more about the underlying problem at https://github.com/segasai/q3c/issues/30.
  • Find a Dust-Free Window Using ADQL

    Five sky images, all of them showing star clusters

    Five of the seven patches of the sky that Bayestar 17 considers least obscured by dust in Aladin's WISE color HiPSes. There clearly is a pattern here. This post is about how you'll find these (and the credible ones, too).

    The upcoming AG-Tagung in Bremen will have another puzzler, and while concocting the problem I needed to find a spot on the sky where there is very little interstellar extinction. What looks like a quick query turned out to require a few ADQL tricks that I thought I might show in this little post; they will come in handy in many situations.

    First, I needed to find data on where on the sky there is dust. Had I not known about the extinction maps I've blogged about in 2018, I would probably have looked for extinction maps in the Registry, which might have led me to the Bayestar 17 map on my service eventually, too. The way it was, I immediately fired up TOPCAT and pointed it to the TAP service at http://dc.g-vo.org/tap (the “GAVO DC TAP“ of the TAP service list) and went to the column metadata of the prdust.map_union table.

    Browsing the descriptions, the relevant columns here are healpix (which will give me the position) and best_fit. That latter thing is an array of reddening E(B − V) (i.e., higher values mean more dust) per distance bin, where the bins are 0.5 mag of distance modulus wide. I decided I'd settle for bin 20, corresponding to a kiloparsec. Dust further away than that will not trouble me much in the puzzler.

    Finding the healpixes in the rows with the smallest best_fit[20] should be easy; it is a minor variant of a classic from the ADQL course:

    SELECT TOP 20 healpix
    FROM prdust.map_union
    ORDER BY best_fit[20] ASC
    

    Except that my box replies with an error message reading “Expected end of text, found '[' (at char 61), (line:3, col:18)”.

    Hu? Well… if you look, then the problem is where I ask to sort by an array element. And indeed, it turns out that DaCHS, the software driving this site, will not let you sort by array elements yet. This is arguably a bug, and in all likelihood I will have fixed it by the time your read this. But there is a technique to defeat this and similar cases that every astronomer should know about: subqueries, which turn any query into something you can work with as if it were a table. In this case:

    SELECT TOP 30 healpix, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    – the “AS q“ gives the name of the “virtual” table resulting from the query a name. It is mandatory here. Do not be tempted to leave out the “AS” – that that is even legal is one of the major blunders of the SQL standard.

    The result is looking good:

    # healpix extinction
    1021402 0.00479
    1021403 0.0068
    418619  0.00707
    ...
    

    – so, we have the healpixes for which the extinction works out to be minimal. It is also reassuring that the two healpixes with the clearest sky (by this metric) are next to each other – where there are clear skies, it's likely that there are more clear skies nearby.

    But then… where exactly are these patches? The column description says “The healpix (in galactic l, b) for which this data applies. This is of the order given in the hpx_order column”. Hm.

    To go from HEALPix to positions, there is the ivo_healpix_center user defined function (UDF) on many ADQL services; it is part of the IVOA's UDF catalogue, so whenever you see it, it will do the same thing. And where would you see it? Well, in TOPCAT, UDFs show up in the Service tab with a signature and a short description. In this case:

    ivo_healpix_center(hpxOrder INTEGER, hpxIndex BIGINT) -> POINT
    
      returns a POINT corresponding to the center of the healpix with the
      given index at the given order.
    

    With this, we can change our query to spit out positions rather than indices:

    SELECT TOP 30 ivo_healpix_center(hpx_order, healpix) AS pos, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    The result is:

    # pos                                    extinction
    "(42.27822580645164, 78.65148926014334)" 0.00479
    "(42.44939271255061, 78.6973986631694)"  0.0068
    "(58.97460937500027, 40.86635677386179)" 0.00707
    ...
    

    That's my positions all right, but they are still in galactic coordinates. That may be fine for many applications, but I'd like to have them in ICRS. Transforming them takes another UDF; this one is not yet standardised and hence has a gavo_ prefix (which means you will only find it on reasonably new services driven by DaCHS).

    On services that have that UDF (and the GAVO DC TAP certainly is one of them), you can write:

    SELECT TOP 30
      gavo_transform('GALACTIC', 'ICRS',
        ivo_healpix_center(hpx_order, healpix)) AS pos,
      extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    That results in:

    # pos                                    extinction
    "(205.6104289782676, 28.392541949473785)" 0.00479
    "(205.55600830161907, 28.42330388161418)" 0.0068
    "(250.47595812552925, 36.43011215633786)" 0.00707
    "(166.10872483007287, 21.232866316024364)" 0.00714
    "(259.3314211312357, 43.09275090468469)" 0.00742
    "(114.66957763676628, 21.603135736808532)" 0.00787
    "(229.69174233173712, 2.0244022486718793)" 0.00793
    "(214.85349325052758, 33.6802370378023)" 0.00804
    "(204.8352084989552, 36.95716352922782)" 0.00806
    "(215.95667870050661, 36.559656879148044)" 0.00839
    "(229.66068062277128, 2.142516479012763)" 0.0084
    "(219.72263539838667, 58.371829835018424)" 0.00844
    ...
    

    If you have followed along, you now have a table of the 30 least reddened patches in the sky according Bayestar17. And you are probably as curious to see them as I was. That curiosity made me start Aladin and select WISE colour imagery, reckoning dust (at the right temperature) would be more conspicuous in WISE's wavelengths then in, say, DSS.

    I then did Views -> Activation Actions and wanted to check “Send Sky Coordinates“ to make Aladin show the sky at the position of my patches. This is usually preconfigured by TOPCAT to just work when tables have positions. Alas: at least in versions up to 4.8, TOPCAT does not know about points (in the ADQL sense) when making clever guesses there.

    But there is a workaround: Select “Send Sky Coordinates” in the Activation Actions window and then type pos[0] in “RA Column“, and pos[1] in “Dec Column” – this works because under the hood, VOTable points are just 2-arrays. That done, you can check the activation action.

    After these preparations, when you click through the first few results, you will find objects like those in the opending image (and also a few fairly empty fields). Stellar clusters are relatively rare on the sky, so their prevalence in these patches quite clearly shows that Bayestar's model has a bit of a fixation about them that's certainly not related to dust.

    Which goes to serve as another example of Demleitner's law 567: “In any table, the instances with the most extreme values are broken with a likelihood of 0.567”.

  • Towards Data Discovery in pyVO

    When I struggled with ways to properly integrate TAP services – which may have hundreds or thousands of different resources in one service – into the VO Registry without breaking what we already had, I realised that there are really two fundamentally different modes of using the VO Registry. In Discovering Data Collections's abstract I wrote:

    the Registry must support both VO-wide discovery of services by type ("service enumeration") and discovery by data collection ("data discovery").

    To illustrate the difference in a non-TAP case, suppose I have archived images of lensed quasars from Telescopes A, B, and C. All these image collections are resources in their own right and should be separately findable when people look for “resources with data from Telescope A“ or perhaps “images obtained between 2011-01-01 and 2011-12-31”.

    However, when a machine wants to find all images at a certain position, publishing the three resources through three different services would mean that that machine has to do three requests where one would work just as well. That is very relevant when you think about how the VO will evolve: At this point there are 342 SIAP services in the VO, and when you read this, that number may have grown further. Adding one service per collection will simply not scale when we want to keep the possibility of all-VO searches. Since I claim that is a very desirably thing, we need to enable collective services covering multiple subordinate resources.

    So, while in the first (“data discovery”) case one wants to query (or at least discover) the three resources separately, in the second case they should be ignored, and only a collective “images of lensed quasars” service should be queried.

    The technical solution to this requirement was creating “auxliary capabilities” as discribed in the endorsed note on discoving data collections cited above. But these of course need client support; VO clients up to now by and large do service enumeration, as that has been what we started with in the VO Registry. Client support would, roughly, mean that clients would present their users with data collections, and then offer the various ways to to access them.

    There are quite a number of technicalities involved in why that's not terribly straightforward for the “big” clients like TOPCAT and Aladin (though Aladin's discovery tree already comes rather close).

    Now that quite a number of people use pyVO interactively in jupyter notebooks, extending pyVO's registry interface to do data discovery in addition to the conventional service enumeration becomes an attractive target to have data discovery in practice.

    I have hence created pyVO PR #289. I think some the rough edges will need to be smoothed out before it can be merged, but meanwhile I'd be grateful if you could try it out already. To facilitate that, I have prepared a jupyter notebook that shows the basic ideas.

    Followup (2023-12-15)

    I have just prepared a slightly updated version of the notebook.

    To run it while the PR is not merged, you need to install the forked pyVO. In order to not clobber your main installation, you can install astropy using your package manager and then do the following (assuming your shell is bash or something suitably similar):

    virtualenv --system-site-packages try-discoverdata
    . try-discoverdata/bin/activate
    cd try-discoverdata
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout add-discoverdata
    python3 setup.py develop
    ipython3 notebook
    

    That should open a browser window in which you can open the notebook (you probably want to download it into the pyvo checkout in order to make the notebook selector see it). Enjoy!

  • Spectral Units in ADQL

    math formulae.

    In case you find the piece of Python given below too hard to read: It's just this table of conversion expressions between the different SI units we are dealing with here.

    Astronomers these days work all along the electromagnetic spectrum (and beyond, of course). Depending on where they observe, they will have very different instrumentation, and hence some see their messengers very naturally as waves, others quite as naturally as particles, others just as electrons flowing out of a CCD that is sitting behind a filter.

    In consequence, when people say where in the spectrum they are, they use very different notions. A radio astronomer will say “I'm observing at 21 cm” or “at 50 GHz“. There's an entire field named after a wavelength, “submillimeter“, and blueward of that people give their bands in micrometers. Optical astronomers can't be cured of their Ångström habit. Going still more high-energy, after an island of nanometers in the UV you end up in the realm of keV in X-ray, and then MeV, GeV, TeV and even EeV.

    However, there is just one VO (or at least that's where we want to go). Historically, the VO has had a slant towards optical astronomy, which gives us the legacy of having wavelengths in far too many places, including Obscore. Retrospectively, this was an unfortunate choice not only because it makes us look optical bigots, but in particular because in contrast to energy and, by ν = E/h, frequency, messenger wavelength depends on the medium you work in, and I shudder to think how many wavelengths in my data center actually are air wavelengths rather than vacuum wavelengths. Also, as you go beyond photons, energy really is the only thing that reasonably characterises all messengers alike (well, even that still isn't quite settled for gravitational waves as long as we're not done with a quantum theory of gravitation).

    Well – the wavelength milk is spilled. Still, the VO has been boldly expanding its reach beyond the optical and infrared windows (recently, with neutrinos and gravitational waves, not to mention EPN-TAP's in-situ measurements in the solar system, even beyond the electromagnetic spectrum). Which means we will have to accomodate the various customs regarding spectral units described above. Where there are “thick” user interfaces, these can care about that. For instance, my datalink XSLT and javascript lets people constrain spectral cutouts (along BAND) in a variety of units (Example).

    But what if the UI is as shallow as it is in ADQL, where you deal with whatever is in the underlying database tables? This has come up again at last week's EuroVO Technology Forum in virtual Strasbourg in the context of making Obscore more attractive to radio astronomers. And thus I've sat down and taught DaCHS a new user defined function to address just that.

    Up front: When you read this in 2022 or beyond and everything has panned out, the function might be called ivo_specconv already, and perhaps the arguments have changed slightly. I hope I'll remember to update this post accordingly. If not, please poke me to do so.

    The function I'm proposing is, mainly, gavo_specconv(expr, target_unit). All it does is convert the SQL expression expr to the (spectral) target_unit if it knows how to do that (i.e., if the expression's unit and the target unit are spectral units properly written in VOUnit) and raise an error otherwise.

    So, you can now post:

    SELECT TOP 5 gavo_specconv(em_min, 'GHz') AS nu
    FROM ivoa.obscore
    WHERE gavo_specconv((em_min+em_max)/2, 'GHz')
        BETWEEN 1 AND 2
      AND obs_collection='VLBA LH sources'
    

    to the TAP service at http://dc.g-vo.org/tap. You will get your result in GHz, and you write your constraint in GHz, too. Oh, and see below on the ugly constraint on obs_collection.

    Similarly, an X-ray astronomer would say, perhaps:

    SELECT TOP 5 access_url, gavo_specconv(em_min, 'keV') AS energy
    FROM ivoa.obscore
    WHERE gavo_specconv((em_min+em_max)/2, 'keV')
      BETWEEN 0.5 AND 2
      AND obs_collection='RASS'
    

    This works because the ADQL translator can figure out the unit of its first argument. But, perhaps regrettably, ADQL has no notion of literals with units, and so there is no way to meaningfully say the equivalent of gavo_specconv(656, 'Hz') to get Hα in Hz, and you will receive a (hopefully helpful) error message if you try that.

    However, this functionality is highly desirable not the least because the queries above are fairly inefficient. That's why I added the funny constraints on the collection: without them, the queries will take perhaps half a minute and thus require async operation on my box.

    The (fundamental) reason for that is that postgres is not smart enough to work out it could be using an index on em_min and em_max if it sees something like nu between 3e8/em_min and 3e7/em_max by re-writing the constraint into 3e8/nu between em_min and em_max (and think really hard about whether this is equivalent in the presence of NULLs). To be sure, I will not teach that to my translation layer either. Not using indexes, however, is a recipe for slow queries when the obscore table you query has about 85 million rows (hi there in 2050: yes, that was a sizable table in our day).

    To let users fix what's too hard for postgres (or, for that matter, the translation engine when it cannot figure out units), there is a second form of gavo_specconv that takes a third argument: gavo_specconv(expr, unit_of_expr, target_unit). With that, you can write queries like:

    SELECT TOP 5 gavo_specconv(em_min, 'Angstrom') AS nu
    FROM ivoa.obscore
    WHERE gavo_specconv(5000, 'Angstrom', 'm')
      BETWEEN em_min AND em_max
    

    and hope the planner will use indexes. Full disclosure: Right now, I don't have indexes on the spectral limits of all tables contributing to my obscore table, so this particular query only looks fast because it's easy to find five datasets covering 500 nm – but that's an oversight I'll fix soon.

    Of course, to make this functionality useful in practice, it needs to be available on all obscore services (say) – only then can people run all-VO obscore searches without the optical bias. The next step (before Bambi-eyeing the TAP implementors) therefore would be to get it into the catalogue of ADQL user defined functions.

    For this, one would need to specify a bit more carefully what units must minimally be supported. In DaCHS, I have built this on a full implementation of VOUnits, which means you can query using attoparsecs of wavelength and get your result in dekaerg (which is a microjoule: 1 daerg = 1 uJ in VOUnits – don't you just love this?):

    SELECT gavo_specconv(
      (spectral_start+spectral_end)/2, 'daerg')
      AS energy
    FROM rr.stc_spectral
    WHERE gavo_specconv(0.0002, 'apc', 'J')
      BETWEEN spectral_start AND spectral_end
    

    (stop computing: an attoparsec is about 3 cm). This, incidentally, queries the draft RegTAP extension for the VODataService 1.2 coverage in space, time, and spectrum, which is another reason I'm proposing this function: I'm not quite sure how well my rationale that using Joules of energy is equally inconvenient for all communities will be generally received. The real rationale – that Joule is the SI unit for energy – I don't dare bring forward in the first place.

    Playing with wavelengths in AU (you can do that, too; note, though, that VOUnit forbids prefixes on AU, so don't even try mAU) is perhaps entertaining in a slightly twisted way, but admittedly poses a bit of a challenge in implementation when one does not have full VOUnits available. I'm currently thinking that m, nm, Angstrom, MHz, GHz, keV and MeV (ach! No Joule! But no erg, either!) plus whatever spectral units are in use in the local tables would about cover our use cases. But I'd be curious what other people think.

    Since I found the implementation of this a bit more challenging than I had at first expected, let me say a few words on how the underlying code works; I guess you can stop reading here unless you are planning to implement something like this.

    The fundamental trouble is that spectral conversions are non-linear. That means that what I do for ADQL's IN_UNIT – just compute a conversion factor and then multiply that to whatever expression is in its first argument – will not work. Instead, one has to write a new expression. And building these expressions becomes involved because there are thousands of possible combinations of input and output units.

    What I ended up doing is adopting standard (i.e., SI) units for energy (J), wavelength (m), and frequency (Hz) as common bases, and then first convert the source and target units to the applicable standard unit. This entails trying to convert each input unit to each standard unit until a conversion actually works, which in DaCHS' Python looks like this:

    def toStdUnit(fromUnit):
        for stdUnit in ["J", "Hz", "m"]:
            try:
                 factor = base.computeConversionFactor(
                     fromUnit, stdUnit)
            except base.IncompatibleUnits:
                continue
            return stdUnit, factor
    
        raise common.UfuncError(
            f"specconv: {fromUnit} is not a spectral unit understood here")
    

    The VOUnits code is hidden away in base.computeConversionFactor, which raises an IncompatibleUnits when a conversion is impossible; hence, in the end, as a by-product this function also determines what kind of spectral value (energy, frequency, or wavelength) I am dealing with.

    That accomplished, all I need to do is look up the conversions between the basic units, which can be done in a single dictionary mapping pairs of standard units to the conversion expression templates. I have not tried to make these templates particularly pretty, but if you squint, you can still, I hope, figure out this is actually what the opening image shows:

    SPEC_CONVERSION = {
        ("J", "m"): "h*c/(({expr})*{f})",
        ("J", "Hz"): "({expr})*{f}/h",
        ("J", "J"): "({expr})*{f}",
        ("Hz", "m"): "c/({expr})/{f}",
        ("Hz", "Hz"): "{f}*({expr})",
        ("Hz", "J"): "h*{f}*({expr})",
        ("m", "m"): "{f}*({expr})",
        ("m", "Hz"): "c/({expr})/{f}",
        ("m", "J"): "h*c/({expr})/{f}",}
    

    expr is (conceptually) replaced by the first argument of the UDF, and f is the conversion factor between the input unit and the unit expr is in. Note that thankfully, no additive operators are involved and thus all this is numerically well-conditioned. Hence, I can afford not attempting to simplify any of the expressions involved.

    The rest is essentially book-keeping, where I'm using the ADQL parser to turn the expression into a tree fragment and then fiddling in the tree fragment for expr into that. The result then replaces the UDF function call in the syntax tree. You can review all this in context in DaCHS' ufunctions.py, starting at the definition of toStdUnit.

    Sure: this is no Turing award material. But perhaps these notes are useful when people want to put this kind of thing into their ADQL engines. Which I'd consider a Really Good Thing™.

  • Crazy Shapes in TAP

    OpenNGC shapes

    A complex shape from OpenNGC: MOCs need not be convex, or simply connected, or anything.

    So far when you did spherical geometry in ADQL, you had points, circles, and polygons as data types, and you could test for intersection and containment as operations. This feature set is a bit unsatisfying because there are no (algebraic) groups in this picture: When you join or intersect two circles, the result only is a circle if one contains the other. With non-intersecting polygons, you will again not have a (simply connected) spherical polygon in the end.

    Enter MOCs (which I've mentioned a few times before on this blog): these are essentially arbitrary shapes on the sky, in practice represented through lists of pixels, cleverly done so they can be sufficiently precise and rather compact at the same time. While MOCs are powerful and surprisingly simple in practice, ADQL doesn't know about them so far, which limits quite a bit what you can do with them. Well, DaCHS would serve them since about 1.3 if you managed to push them into the database, but there were no operations you could do on them.

    Thanks to work done by credativ (who were really nice to work with), funded with some money we had left from our previous e-inf-astro project (BMBF FKZ 05A17VH2) on the pgsphere database extension, this has now changed. At least on the GAVO data center, MOCs are now essentially first-class citizens that you can create, join, and intersect within ADQL, and you can retrieve the results. All operators of DaCHS services are just a few updates away from being able to offer the same.

    So, what can you do? To follow what's below, get a sufficiently new TOPCAT (4.7 will do) and open its TAP client on http://dc.g-vo.org/tap (a.k.a. GAVO DC TAP).

    Basic MOC Operations in TAP

    First, let's make sure you can plot MOCs; run

    SELECT name, deepest_shape
    FROM openngc.shapes
    

    Then do Graphics/Sky Plot, and in the window that pops up then, Layers/Add Area Control. Then select your new table in the Position tab, and finally choose deepest_shape as area (yeah, this could become a bit more automatic and probably will over time). You will then see the footprints of a few NGC objects (OpenNGC's author Mattia Verga hasn't done all yet; he certainly welcomes help on OpenNGC's version control repo), and you can move around in the plot, yielding perhaps something like Fig. 1.

    Now let's color these shapes by object class. If you look, openngc.data has an obj_type column – let's group on it:

    SELECT
      obj_type,
      shape,
      AREA(shape) AS ar
    FROM (
      SELECT obj_type, SUM(deepest_shape) AS shape
      FROM openngc.shapes
      NATURAL JOIN openngc.data
      GROUP BY obj_type) AS q
    

    (the extra subquery is a workaround necessary because the area function wants a geometry or a column reference, and ADQL doesn't allow aggregate functions – like sum – as either of these).

    In the result you will see that so far, contours for about 40 square degrees of star clusters with nebulae have been put in, but only 0.003 square degrees of stellar associations. And you can now plot by the areas covered by the various sorts of objects; in Fig. 2, I've used Subsets/Classify by Column in TOPCAT's Row Subsets to have colours indicate the different object types – a great workaround when one deals with categorial variables in TOPCAT.

    MOCs and JOINs

    Another table that already has MOCs in them is rr.stc_spatial, which has the coverage of VO resources (and is the deeper reason I've been pushing improved MOC support in pgsphere – background); this isn't available for all resources yet , but at least there are about 16000 in already. For instance, here's how to get the coverage of resources talking about planetary nebulae:

    SELECT ivoid, res_title, coverage
    FROM rr.subject_uat
      NATURAL JOIN rr.stc_spatial
      NATURAL JOIN rr.resource
    WHERE uat_concept='planetary-nebulae'
      AND AREA(coverage)<20
    

    (the rr.subject_uat table is a local extension to RegTAP that will be the subject of some future blog post; you could also use rr.res_subject, but because people still use wildly different keyword schemes – if any –, that wouldn't be as much fun). When plotted, that's the left side of Fig. 3. If you do that yourself, you will notice that the resolution here is about one degree, which is a special property of the sort of MOCs I am proposing for the Registry: They are of order 6. Resolution in MOC goes up with order, doubling with every step. Thus MOCs of order 7 have a resolution of about half a degree, MOCs of order 5 a resolution of about two degrees.

    One possible next step is fetch the intersection of each of these coverages with, say, the DFBS (cf. the post on Byurakan spectra). That would look like this:

    SELECT
      ivoid,
      res_title,
      gavo_mocintersect(coverage, dfbscoverage) as ovrlp
    FROM (
      SELECT ivoid, res_title, coverage
      FROM rr.subject_uat
      NATURAL JOIN rr.stc_spatial
      NATURAL JOIN rr.resource
      WHERE uat_concept='planetary-nebulae'
      AND AREA(coverage)<20) AS others
    CROSS JOIN (
      SELECT coverage AS dfbscoverage
      FROM rr.stc_spatial
      WHERE ivoid='ivo://org.gavo.dc/dfbsspec/q/spectra') AS dfbs
    

    (the DFBS' identifier I got with a quick query on WIRR). This uses the gavo_mocintersect user defined function (UDF), which takes two MOCs and returns a MOC of their common pixels. Which is another important part why MOCs are so cool: together with union and intersection, they form groups. It should not come as a surprise that there is also a gavo_mocunion UDF. The sum aggregate function we've used in our grouping above is (conceptually) built on that.

    Planetary Nebula footprint and plate matches

    Fig. 3: Left: The common footprint of VO resources declaring a subject of planetary-nebula (and declaring a footprint). Right bottom: Heidelberg plates intersecting this, and, in blue, level-6 intersections. Above this, an enlarged detail from this plot.

    You can also convert polygons and circles to MOCs using the (still DaCHS-only) MOC constructor. For instance, you could compute the coverage of all resources dealing with planetary nebulae, filtering against obviously over-eager ones by limiting the total area, and then match that against the coverages of images in, say, the Königstuhl plate achives HDAP. Watch this:

    SELECT
      im.*,
      gavo_mocintersect(MOC(6, im.coverage), pn_coverage) as ovrlp
    FROM (
      SELECT SUM(coverage) AS pn_coverage
      FROM rr.subject_uat
      NATURAL JOIN rr.stc_spatial
      WHERE uat_concept='planetary-nebulae'
      AND AREA(coverage)<20) AS c
    JOIN lsw.plates AS im
    ON 1=INTERSECTS(pn_coverage, MOC(6, coverage))
    

    – so, the MOC(order, geo) function should give you a MOC for other geometries. There are limits to this right now because of limitations of the underlying MOC library; in particular, non-convex polygons are not supported right now, and there are precision issue. We hope this will be rectified soon-ish when we base pgsphere's MOC operations on the CDS HEALPix library. Anyway, the result of this is plotted on the right of Fig. 3.

    Open Ends

    In case you have MOCs from the outside, you can also construct MOCs from literals, which happen to be the ASCII MOCs from the standard. This could look like this:

    SELECT TOP 1
      MOC('4/30-33 38 52 7/324-934') AS ar
    FROM tap_schema.tables
    

    For now, you cannot combine MOCs in CONTAINS and INTERSECTS expressions directly; this is mainly because in such an operation, the machine as to decide on the order of the MOC the other geometries are converted to (and computing the predicates between geometry and MOC directly is really painful). This means that if you have a local table with MOCs in a column cmoc that you want to compare against a polygon-valued column coverage in a remote table like this:

    SELECT db.* FROM
      lsw.plates AS db
      JOIN tap_upload.t6
    ON 1=CONTAINS(coverage, cmoc) -- fails!
    

    you will receive a rather scary message of the type “operator does not exist: spoly <@ smoc”. To fix it (until we've worked out how to reasonably let the computer do that), explicitly convert the polygon:

    SELECT db.* FROM
      lsw.plates AS db
      JOIN tap_upload.t6
    ON 1=CONTAINS(MOC(7, coverage), cmoc)
    

    (be stingy when choosing the order here – MOCs that already exist are fast, but making them at high order is expensive).

    Having said all that: what I've written here is bleeding-edge, and it is not standardised yet. I'd wager, though, that we will see MOCs in ADQL relatively soon, and that what we will see will not be too far from this experiment. Well: Some rough edges, I'd hope, will still be smoothed out.

    Getting This on Your Own DaCHS Installation

    If you are running a DaCHS installation, you can contribute to takeup (and if not, you can stop reading here). To do that, you need to upgrade to DaCHS's latest beta (anything newer than 2.1.4 will do) to have the ADQL extension, and, even more importantly, you need to install the postgresql-postgres package from our release repository (that's version 1.1.4 or newer; in a few weeks, getting it from Debian testing would work as well).

    You will probably not get that automatically, because if you followed our normal installation instructions, you will have a package called postgresql-11-pgsphere installed (apologies for this chaos; as ususal, every single step made sense). The upshot is that with our release repo added, sudo apt install postgresql-pgsphere should give you the new code.

    That's not quite enough, though, because you also need to acquaint the database with the new functions. This can only be done with database administrator privileges, which DaCHS by design does not possess. What DaCHS can do is figure out the commands to do that when it is called as dachs upgrade -e. Have a look at the output, and if you are satisfied it is about what to expect, just pipe it into psql as a superuser; in the default installation, dachsroot would be sufficiently privileged. That is:

    dachs upgrade -e | psql gavo   # as dachsroot
    

    If running:

    select top 1 gavo_mocunion(moc('1/3'), moc('2/9'))
    from tap_schema.tables
    

    through your TAP endpoint returns '1/3 2/9', then all is fine. For entertainment, you might also make sure that gavo_mocintersect(moc('1/3'), moc('2/13')) is 2/13 as expected, and that if you intersect with 2/3 you get back an empty string.

    So – let's bring MOCs to ADQL!

« Page 2 / 4 »