Articles from Demo

  • Updates to GAVO's Tutorials

    Over the years, GAVO has produced a number of VO tutorials, i.e., texts that introduce some technique related to using the Virtual Observatory, preferably within some halfway plausible scenario. In effect, they are software documentation, and as software itself, software documentation suffers from bit rot. To work against that, the tutorials have to be revised occasionally.

    My two student assistants Sonja Gabriel and Chuanming Mao have recently done some of that revising. Let me use this opportunity to show off some of these freshly polished tutorials.

    A classic one (that has, if I may say so myself, aged rather well), is Adding catalog data to object lists using the VO. This is a thinly disguised introduction to TAP uploads, arguably the most powerful of all the VO tech to date. If you have come to this place without ever having done a TAP upload, you owe it to yourself to at least skim the tutorial and quickly follow along the few steps to do positional crossmatches with just about any astronomical catalog and with just about any level of sophistication.

    part of a screenshot: a histogram, a sky photo with overplotted points

    Another classic – it has its roots in the original Italian VO Days[1] – is TOPCAT and Aladin working together. It is using SDSS data of some galaxy cluster to try and get you to to send around data and positions between different programs using SAMP. If you are reading VO blogs, it is not unlikely this kind of thing will make you yawn. But at VO Days, it's little things like this that usually most immediately appeal to students and researchers alike.

    part of a screenshot: a color-magnitude diagram is a very narrow main sequence, and a proper motion plot

    From a tech point of view, Explore the Pleiades with TOPCAT and Aladin also mainly looks at SAMP (perhaps even somewhat less convincingly), but it's such a striking demo of what an amazing instrument Gaia is, and it's a nice introduction to TOPCAT's VO interface and subsetting facility that it's definitely worth a look, in particular as a showcase of having instant results with the VO.

    circular cloud of red crosses and blue circles in a celestial coordinate system

    An entirely different topic (well: it also employs SAMP for a moment) is covered by Data Discovery Using the Virtual Observatory Registry. This is trying to motivate looking for data collections in the VO Registry (in the form of our Browser interface to it). This tutorial has grown quite a bit during the review and now includes two sections joining data from different resources for various purposes. One section illustrates how systematics of quasar redshifts might be looked into using different sources, the other investigates the Tully-Fisher relationship in different spectral bands.

    A TOPCAT-plotted histogram with a sharp peak around 39.5 AU and a much wider one around 44.

    The tutorial on Asteroids in the Solar System was entriely overhauled. It was (and still is) mainly intended to be used in schools, and thus it originally just built on things that ran in a web browser. As is typical of things in web browsers, they have long since vanished. Hence, a rather fundamental update was necessary anyway. While we were looking for interesting things to do – the plot above, by the way, is the distribution of major halfaxes in the Kuiper belt –, we ended up even includeding a brief bit on ADQL.

    Due to its school focus, we are also offering this particular text in German as well as in English. If you are an Astronomy teacher with particularly motivated pupils , we would like to hear from you…

    An aladin window showing two aligned photos of the ring nebula in Lyra

    The last revised tutorial I would like to mention also has a somewhat special (main) target audience: Astrometric Calibration using Aladin. Admittedly, automatic, or “blind” calibration has become really great, and I think getting their images located on the sky is not much of a problem even for amateurs any more, thanks in part to services like astrometry.net. But then – sometimes there is nothing like a good, old manual, ummm, “plate” solution. Aladin and the VO make that lot less tedious than it used to be.

    Of course, I cannot have a post on tutorials without mentioning the VO Text Treasures, a web page that shows the educational material currently registered in the VO Registry. This little page also accounts for bit rot: You can sort by the time last inspected there, and thanks to Sonja's and Chuanming's efforts, our tutorials look very good in that representation at the moment.

    In case you have some material suitable for WIRR yourself: Please register it, too. Send me a mail and I will lend you a hand (or, if you are a VO pro, directly read the pertinent standard).

    [1]That's block courses on VO matters lasting a day or two. If you are in Germany, you can book us for your very own one!
  • HEALPix Maps: In General and in Gaia

    blue and reddish pixels drawing a bar on the sky.

    A map of average Gaia colours in HEALPixes 2/83 and 2/86 (Orion south-east). This post tells you how to (relatively) quickly produce such maps.

    This year's puzzler for the AG Tagung turned out to be a valuable source of interesting ADQL queries. I have already written about finding dusty spots on the sky, and in the puzzler solution, I had promised some words on creating dust maps, or, more generally, HEALPix maps of any sort.

    Making HEALPix maps with Gaia source_ids

    The basic technique is explained in Mark Taylor's classical ADASS poster from 2016. On GAVO's TAP service (access URL http://dc.g-vo.org/tap), you will also find an example for that (in TOPCAT's TAP window, check the Service-provided section unter the Examples button for it). However, once you have Gaia source_ids, there is something a lot faster and arguably not much less convenient. Let me quote the footnote on source_id from my DR3 lite table:

    For the contents of Gaia DR3, the source ID consists of a 64-bit integer, least significant bit = 1 and most significant bit = 64, comprising:

    • a HEALPix index number (sky pixel) in bits 36 - 63; by definition the smallest HEALPix index number is zero.
    • […]

    This means that the HEALpix index level 12 of a given source is contained in the most significant bits. HEALpix index of 12 and lower levels can thus be retrieved as follows:

    • [...]
    • HEALpix [at] level n = (source_id)/(235⋅412 − n).

    That is: Once you have a Gaia source_id, you an compute HEALpix indexes on levels 12 or less by a simple integer division! I give you that the more-than-35-bit numbers you have to divide by do look a bit scary – but you can always come back here for cutting and pasting:

    HEALPix level Integer-divide source id by
    12 34359738368
    11 137438953472
    10 549755813888
    9 2199023255552
    8 8796093022208
    7 35184372088832
    6 140737488355328
    5 562949953421312
    4 2251799813685248
    3 9007199254740992
    2 36028797018963968

    If you know – and that is very valuable knowledge far beyond this particular application – that you can simply jump between HEALPix indexes of different levels by multiplying with or integer-dividing by four, the general formula in the footnote actually becomes rather memorisable. Let me illustrate that with an example in Python. HEALPix number 3145 on level 6 is:

    >>> 3145//4  # ...within this HEALPix on level 5...
    786
    >>> 3145*4, (3145+1)*4  # ..and covers these on level 7...
    (12580, 12584)
    

    Simple but ingenious.

    You can immediately exploit this to make HEALPix maps like the one in the puzzler. This piece of ADQL does the job within a few seconds on the GAVO DC TAP service[1]:

    SELECT source_id/8796093022208 AS pix,
      AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.edr3lite
    WHERE distance(ra, dec, 246.7, -24.5)<2
    GROUP by pix
    

    Using the table above, you see that the horrendous 8796093022208 is the code for HEALPix level 8. When you remember (and you should) that HEALPix level 6 corresponds to a linear dimension of about 1 degree and each level is a factor of two in linear dimension, you see that the map ought to have a resolution of about 1/8th of a degree.

    HEALPix to Screen Pixel

    How do you plot this? Well, in TOPCAT, do GraphicsSky Plot, and then in the plot window LayersAdd HEALPix control (there are icons for both of these, too). You then have to manually configure the plot for the table you just retrieved: Set the Level to 8, the index to pix and the Value to avgcol – we're working on making the annotation a bit richer so that TOPCAT has a chance to figure this out by itself.

    With a bit of extra configuration, you get the following map of average colours (really: dust concentration):

    Plot: Black and reddish pixels showing a bit of structure

    This is not totally ideal, as at the border of the cone, certain Healpixes are only partially covered, which makes statistics unnecessarily harder.

    Positional Constraints using source_ids

    Due to Gaia's brilliant numbering scheme, we can do analysis by HEALpix, too, circumventing (among other things) this problem. Say you are interested in the vicinity of the M42 and would like to investigate a patch of about 8 degrees. By our rule of thumb, 8 degrees is three levels up from the one-degree level 6. To find the corresponding HEALpix index, on DaCHS servers with their gavo_simbadpoint UDF you could say:

    SELECT TOP 1 ivo_healpix_index(3, gavo_simbadpoint('M42'))
    FROM tap_schema.tables
    

    Hu, you ask, what's tap_schema.tables to do with this? Well, nothing, really. It's just that ADQL's syntax requires selecting from a table, even if what we select is completely independent of any table, as for instance the index of M42's 3-HEALpix. The hack above picks in a table guaranteed to exist on all TAP services, and the TOP 1 makes sure we only compute the value once. In case you ever feel the need to abuse a TAP service as a calculator: Keep this trick in mind.

    The result, 334, you could also have found more graphically, as follows:

    1. Start Aladin
    2. Check OverlayHEALPix grid
    3. Enter M42 in Command
    4. Zoom out until you see HEALPix indexes of level 3 in the grid.

    An advantage you have with this method: You see that M42 happens to lie on a border of HEALPixes; perhaps you should include all of 334, 335, 356, and 357 if you were really interested in the Orion Nebula's vicinity.

    We, on the other hand, are just interested in instructive examples, and hence let's just repeat our colour mapping with all Gaia objects from HEALPix 3/334. How do you select these? Well, by source_id's construction, you know their source_ids will be between 334⋅9007199254740992 and (334 + 1)⋅9007199254740992 − 1:

    SELECT source_id/8796093022208 AS pix,
      AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.edr3lite
    WHERE source_id BETWEEN 334*9007199254740992 AND 335*9007199254740992-1
    GROUP by pix
    

    This is computationally cheap (though Postgres, not being a column store still has to do quite a bit of I/O; note how much faster this query is when you run it again and all the tuples are already in memory). Even going to HEALPix level 2 would in general still be within our sync time limit. The opening figure was produced with the constraint:

    source_id BETWEEN 83*36028797018963968 AND 84*36028797018963968-1
    OR source_id BETWEEN 86*36028797018963968 AND 87*36028797018963968-1
    

    – and with a sync query.

    Aggregating over a Non-HEALPix

    One last point: The constraints we have just been using are, in effect, positional constraints. You can also use them as quick and in some sense rather unbiased sampling tools.

    For instance, if you would like so see how the reddening in one of the “dense“ spots in the opening picture behaves with distance, you could first pick a point – α = 98, δ = 4, say –, then convert that to a level 7 healpix as above (that's/88974) and then write:

    SELECT ROUND(r_med_photogeo/200)*200 AS distbin, COUNT(*) as n,
        AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.dr3lite
    JOIN gedr3dist.main USING (source_id)
    WHERE source_id BETWEEN 88974*35184372088832 and 88975*35184372088832-1
    GROUP BY distbin
    

    This is creating 200 pc bins in distance based on the estimates in the gedr3dist.main table (note that this adds subtle correlations, because these estimates already contain Gaia colour information). Since quite a few of these bins will be very sparsely populated, I'm also fetching the number of objects contributing. And then I plot the whole thing, using the conventional (n) ⁄ n as a rough estimate for the relative error:

    Plot: A line that first slowly declines, then rises quite a bit, then flattens out and becomes crazy as errors start to dominate.

    This plot immediatly shows that colour systematics are not exclusively due to dust, as in that case things would only get redder all the time. The blueward trend up to 700 pc is reasonably well explained by the brighter, bluer upper main sequence becoming more dominant in the population sampled as red dwarfs become too faint for Gaia.

    The strong reddening setting in after that is rather certainly due to the Orion complex, though I would perhaps not have expected it to reach out to 2 kpc (the conventional distance to M42 is about 0.5 kpc); without having properly thought about it, I'll chalk it off as “the Orion arm“. And after that, it's again what I'd call Malmquist-blueing until the whole things dissolves into noise.

    In conclusion: Did you know you can group by both healpix and distbin at the same time? I am sure there are interesting structures to be found in what you will get from such a query…

    [1]You may be tempted to write source_id/(POWER(2, 35)*POWER(4, 3) here for clarity. Resist that temptation. POWER returns floating point numbers. If you have one float in a division, not even a ROUND will get you back into the integer division realm, and the whole trick implodes. No, you will need the integer literals for now.
  • Computing Residuals of an Astrometric Calibration

    Two plots, left a fairly good correlation, right a cloudy wave

    The kind of plot you can make following the recipe given here: Left, a comparison of the photometry, right, a positional residuals, not taking into account the SIP plate solution, when comparing the HDAP plate B3261a against Gaia DR3. Note that the cut-off a 4 arcsec is because of the match radius when obtaining the calibrator stars.

    I recently had to assess the quality of the astrometric calibration of a photographic plate. What I am going to show you in this post will of course work just as well for CCD frames, and if these have a sufficiently large field of view, this may be an issue for them as well. However, the sort of data that needs this assessment most typically are scans of plates, as these tend to have a “wobble”, systematic offsets in the scan direction resulting from imperfections in the mechanics.

    Prerequisites: An astronomical frame with a calibration in ICRS (or some frame not very far from it), called my-image.fits in the following, SExtractor (in Debian and derivatives: apt install source-extractor – long live Debian Astro; since it's called source-extractor in Debian, that's what I'll use here, too), and of course TOPCAT.

    Step 1: Extract Sources. Source extraction is of course a high science, and if you know better than me, by all means do it the way you think is appropriate. Meanwhile, the following might very well work for you sufficiently well.

    Create a working directory and enter it. Then, to create a file telling source-extractor what columns you would like to see, write the following to a file default.param:

    ALPHA_SKY
    DELTA_SKY
    X_IMAGE
    Y_IMAGE
    MAG_ISO
    FLUX_AUTO
    ELONGATION
    

    Next, give a few parameters to source-extractor; depending on the sort of image you have, you may want to play around with DETECT_MINAREA (how many pixels need to show a signal to register as a source) and DETECT_THRESH (how many sigmas a pixel has to be above the background to register as a candidate for belonging to a source). Meanwhile, write the following into a file default.control:

    CATALOG_TYPE     FITS_1.0
    CATALOG_NAME     img.axy
    PARAMETERS_NAME  default.param
    FILTER           N
    DETECT_MINAREA   30
    DETECT_THRESH    4
    SEEING_FWHM      1.2
    

    – but if the following call gives you a few hundred sources, that ought to work for the present purpose.

    Then run:

    source-extractor -c default.control my-image.fits
    

    This will give you a catalogue of extracted objects in the file img.axy.

    Step 2: Fix source-extractor's output. Load that img.axy into TOPCAT. Regrettably, source-extractor does not add any useful metadata to the columns of its output table. To add the absolute bare minimum, in TOPCAT go to ViewsColumn Info. In that window, check UCD in the Display menu, and then put pos.eq.ra and pos.eq.dec into the UCD fields of the ALPHA_SKY and DELTA_SKY columns, respectively; double click to change fields in TOPCAT.

    To see if you have done the annotation right, in TOPCAT's main window, click GraphicsSky Plot. If the objects show up, you have just provided enough annotation to let TOPCAT figure out the position for each row.

    Step 3: Get calibrators. We will now try to add counterparts for Gaia DR3 to the extracted sources. To do that, click VOTable Access Protocol, and in the window popping up double click the entry for the GAVO DC TAP.

    In the Find box, type dr3lite to look for this site's version of the Gaia DR3 source catalogue. Click on gaia.dr3lite to select that table, and then select the Columns pane. This should show some of the Gaia DR3 columns.

    Now ExamplesUpload Join will generate a query that will cross-match your extracted sources with the Gaia sources. You should edit it a bit, only selecting the columns you will actually need, removing the TOP 1000 (at least on large images with more than 1000 sources), and reducing the match radius a bit when the calibration is not actually completely off and your epoch is sufficiently close to J2000.

    Hint: you can control-click in the Columns pane and then use the Cols button to insert all the column names in one go[1]. For me, the resulting query would be:

    SELECT
       source_id, ra, dec, phot_bp_mean_mag,
       tc.*
       FROM gaia.dr3lite AS db
       JOIN TAP_UPLOAD.t1 AS tc
       ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                     CIRCLE('ICRS', tc.ALPHA_SKY, tc.DELTA_SKY, 4./3600.))
    

    This should result in about as many matches as your extraction had – a few more is ok, because you will have some spurious matches, a few less is ok, too, as there are always some outliers and artefacts, but you should clearly not pull a magnitude more or less objects here than you put in; fiddle with the match radius as necessary.

    See if there is a rough correlation between the Gaia calibrators and your extracted sources by plotting phot_bp_mean_mag against MAG_ISO. Absent more information, MAG_ISO, source-extractor's guess for the magnitude of the extracted object, will be just some crazy number, but it should have some discernable correlation with the actual magnitude. Do not expect too much here, in particular with old plates, for which good photometry is a science of their own.

    In my example, this looked like this:

    Plot: a rough correlation in red with a green tail

    The green points certainly are spurious matches; this observation did not reach beyond 14th magnitude or so, and there are many weak stars on the sky, so a few of them will show up in just about any cross match. See the opening picture for an example with a better correlation.

    Step 4: Do the correlation plot. Do GraphicsPlane Plot and then plot ra-alpha_sky or dec-delta_sky against X_IMAGE or Y_IMAGE. You could get something like this:

    Plot: A single wavy thing

    This rather certainly reflects some optical distortion; source-extractor regrettably does not take into account SIP corrections yet, so it is likely that a large part of this would be taken care of by the polynomials of the plate solution (the github issue I am linking to tells you how to be sure).

    But it can also look like this:

    Plot: Multiple wobbles

    This certainly is not the result of a lens or anything optical at all. It's the scanner's gears that you are looking at here. With an amplitude of perhaps three arcseconds this is rather excessive here; but something like this you will rather likely see even on good scanners – though it may essentially be invisible, as of the Heidelberg scanner we used for HDAP:

    Plot: A vertical cloud with no discernible structure.
    [1]I'm using the BP magnitude in the query below as most historical plates tend to be “blue sensitive“ (in some sense). Hence, BP magnitudes should be a bit closer to what source-extractor has extracted.
  • A Proposed Vector Extension for ADQL

    When I showed off my rendering of the Gaia DR 3 XP spectra a month ago, I promised I would later show how my proposal for a Vector extension to ADQL would enable quite a bit of interesting functionality on that table. Let me make good on this promise with a little project to find candidates for Wolf-Rayet stars, more specifically, WC stars. I give you that's a bit cheap because they have very distinctive spectral features, but then this is supposed to be a quick educational posting, not a science paper.

    Before I start, I probably should stress that in this context I am using the word “vector” like a computer guy would. We are talking about one-dimensional arrays here, not about the vectors the mathematicians have given us (as in “elements of vector spaces“).

    Getting Spectra for Wolf-Rayet Stars

    Let us first produce a list of Wolf-Rayet stars (of any denomination) using SIMBAD. So, start TOPCAT, open the TAP dialog and find the SIMBAD TAP server. Run:

    SELECT main_id, ra, dec
    FROM basic
    WHERE otype='WR*'
    

    there[1]. In the next step, we will need the Gaia DR3 source_ids for these objects, and so it would be nice to pull them immediately from Simbad; you could in principle do that by running:

    SELECT main_id, ra, dec, id
    FROM basic
      JOIN ident
      ON (oid=oidref)
    WHERE otype='WR*'
      AND id LIKE 'Gaia DR3%'
    

    – but for one, fiddling out the actual source_ids from the strings you get in the id column is a bit tedious, and then quite a few of these objects don't have Gaia ids yet: The first query returns 1548 at the moment, the second 696.

    If we had the source_ids, we could immediately join with the gdr3spec.spectra table at the GAVO DC TAP (http://dc.g-vo.org/tap). As I mentioned a month ago, this is a physical table just consisting of the source id and arrays of flux and flux errors. There is also gdr3spec.withpos that has positions and the spectra; but that's a view, and for the time being that means that the planner will quite likely get confused when positional constraints come into play[2]. The result would be queries running for half an hour when a few seconds would do just as well.

    On services already supporting ADQL 2.1, you can usually work around problems of this kind by re-writing your query to use CTEs (“WITH”), because these often work as planner barriers. In the present case, we first get source_ids for our Simbad objects in a CTE and then use these to join with our spectra table, like this:

    WITH wrids AS (
            SELECT source_id, main_id
            FROM gaia.edr3lite AS l
        JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001))
    SELECT main_id, source_id, flux, flux_error
    FROM gdr3spec.spectra
    JOIN wrids USING (source_id)
    

    As usual, you will probably have to adapt the number in what is tap_upload.t1 here to the table index you have for your SIMBAD result.

    This yields 574 spectra at the moment, within a few seconds. Spectra for about a third of our collection of objects: I'd say that's quite impressive.

    Investigating the spectra

    The September post already discussed a few aspects of array plotting. In short: try the plane plot and an XYArray control. Modern TOPCATs (and for what I'm doing here it's wise to use something newer than 4.8-7) will automatically figure out suitable columns for the x and y axes (except it forgets to label the spectral coordinate with its unit, nm):

    Plot: lots of wiggly lines

    There's a quite a bit of crowding here; finding global characteristics perhaps works better when you switch the ordinate to logarithmic, use transparent shading (in the Form tab) and raise the opaque limit a bit. This could give you something like:

    Plot: hazy structures in blue

    You can see that at least quite a few objects have nice and strong emission lines, as I had hoped for when choosing WC stars for this example. What if we could pick them out to build a template spectrum? Well, let's try. With the new vector math in ADQL, the database can normalise the spectra and compute a few statistics on them.

    But first: In order to get rid of the source_id-computing CTE above, let me obtain the source_ids I want to work with once and for all, as in:

    SELECT source_id, main_id
    FROM gaia.edr3lite AS l
      JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001)
    

    Memorise the Table List index of the result. With that, you can directly work with the gdr3spec.spectra table and experiment a bit; for me, the index was 10, and hence I use tap_uploads.t10 below.

    Computing some statistics

    The Gaia XP spectra are flux calibrated, and hence I will have to normalise them if I want to compare them. Ignoring all errors and thus in particular the fact that some (few) components are negative, this normalisation is harmless: We just divide by the sum of all vector components. The net result is that, were our spectra continuous, the integral over them would be one. And let's then use the standard deviation and the value of the 19th array element as metrics:

    WITH normalised AS (
    SELECT source_id, main_id,
            flux/arr_sum(flux) as nflux, flux
    FROM gdr3spec.spectra
    JOIN tap_upload.t10
    USING (source_id))
    SELECT
      source_id, main_id,
      flux, nflux,
      arr_avg(nflux*nflux)-POWER(arr_avg(nflux),2) AS sd,
      nflux[19] as em
    FROM normalised
    

    You can see quite a bit of the vector extension here:

    • arr_sum and arr_avg: These work as the aggregate functions in SQL do, just not on tuples but on the components of the vectors.
    • Multiplication of vectors is element-wise, so nflux*nflux computes a vector of the squares of the components of nflux. That's also true for all other basic arithmetic operators. If you know numpy: same thing.
    • You fetch individual elements in the [] notation you probably know from Python or C. Contrary to Python and C, common SQL implementations count indexes from 1 by default, and we are keeping that here (for now). Fortran lovers rejoice!

    Why did I use nflux[19]? Well, there is a reasonably strong emission feature in many spectra at about 580 nm (it's three-time ionised carbon, or C IV in the rotten notation I usually apologise for when talking to non-astronomers), which is, as experts tell me (thanks, Andreas!), rather characteristic for the WC stars that I'm after (whereas the even stronger feature on the left, around 470 nm, can also be Helium).

    If you inspect the spectral coordinate that TOPCAT has on the abscissa (it's a param, so you'll have to go to ViewsTable Parameters to see it), you will see that each spectral bin is 10 nm wide. So, I will hopefully hit the 580 nm feature when fetching the 19th element of the spectral vector.

    If you plot em versus the sd obtained like that, you will see two reasonably distinct groups, where the ascending arm has relatively strong emission around 580 nm and the descending arm does not. I have used the Blob subset feature to select the upper arm into a subset ”upper” that is blue in the following plot. If you click around in the em-sd plot and show the Activated subset in an array plot, you can see things like:

    Screenshot: scatterplot and stacked spectra next to each other

    The activated spectrum (shown here in yellow) has a strong Hα but basically no C IV – and it's safely outside of our carbon subset. Click around a bit on the ascending arm, and you will see that all these spectra have a bump around array element 18 (in TOPCAT's count, which starts at 0).

    Computing a Template on the Server Side

    Whatever the subset of stars that we would like to use to define our group of interest, we would now like to create a template spectrum from them. A plausible way to do that is to sum them all up – that has the nice side effect that stronger sources (which hopefully are less noisy) have a larger weight.

    To compute the template, in the Views → Column Info for the sd/em table, unselect all columns but source_id (that way, you only upload what you absolutely must), and in the main window, select the upper subset in the Row Subset combo box. That way, only the rows in that subset will get uploaded in the following query:

    SELECT summed/arr_sum(summed) AS tpl
    FROM (
      SELECT SUM(flux) AS summed
      FROM gdr3spec.spectra
      JOIN tap_upload.t19
      USING (source_id)) AS q
    

    Again, you will have to adapt the t19 to where your manipulated sd/em table is. If you get an “ambiguous column flux” error (or so) here, you forgot to unselect all columns but source_id in the columns window.

    It pays to briefly appreciate what happens here: The SELECT SUM(flux) is an aggregate function over arrays, meaning that all the arrays are being summed over component-wise. Against that, the sum in summed/arr_sum(summed) is summing within the array. If it helps you, you could imagine having all the arrays in the table stacked. Then, SUM(arr) produces the vertical margin sum, and array_sum(arr) procudes the horizontal margin sum.

    Well, here's the template I got in this way:

    Plot: a single wiggly line

    I give you that in this particular case, you could easily have done the computation on the client side, because you already had the spectra in your table. But the technique also works when you don't, and it will scale to millions of arrays (although you will have to carefully think about numerics when doing such enormous sums).

    Also – I cannot lie – I simply had to have a pretext for showing you aggregate functions over arrays.

    Finding Similar Objects

    Now that we have the template, can we find objects that have similar properties? Sure: We upload the array and compute some metric, perhaps the (squared) euclidian distances to normalised spectra. If the template is in TOPCAT's table 25, you can write:

    SELECT TOP 200000
      source_id,
      arr_sum(
        arr_map(
          power(x,2),tpl-flux/arr_sum(flux))) AS dist2,
      arr_sum(flux_error/flux) AS errs
    FROM gdr3spec.spectra, tap_upload.t25
    

    This will compute the distances between (conceptually randomly drawn) 200000 spectra and your template. I am also requesting the sum of the (relative) flux errors as a measure of how likely it is that wild wiggles actually are just artefacts.

    There is one array-related feature in that query I have not yet mentioned: arr_map. This applies an expression to all components to a vector, pretty much like python's map function, and my attempt to have some (perhaps somewhat lame) substitute for numpy's ufunctions.

    I am rather sure we really need something like this. SQL has no notion of defining functions in queries. That is usually welcome, as otherwise it would quicky become Turing-complete, which would be bad for what it is designed to do. Here, however, that is a problem, because we do not have a clean way to write the expression be be computed for each component. For now, I have decreed that the first argument of map is an expression over a formal x. This is ugly not only because it will be confusing when there's an actual x in a table or query. I suspect with a bit more thought and creativity, one can find a better solution that still does not require a re-write of half the SQL grammar. But then let's see – perhaps this makeshift hack proves to be less troublesome than I expect.

    Note that on my server, you will only get back 20000 matches by default; you would have to adapt Max Rows to actually retrieve 200'000, and then you also must switch to Asynchronous mode. This will then actually take a non-trivial amount of CPU and disk I/O; going through the entire set of 2e8 rows will be a matter of hours or so. Hence, I'm grateful if you do all-sky scans only after having tested your queries on much smaller subsets.

    I've done this for 500'000 rows (which took a few minutes), which might bring up a few C IV-strong WR stars (these beasts are rare, you know). The result in the dist2/errs plane is (logplot zoomed a bit):

    Plot: a gigantic point cloud with a few outliers.

    Well: at least there are a few promising cases. Which would conclude this little demo for the ADQL vector proposal. Looking at what we have found here is another story.

    Still, I could not resist having another look at what my box has found there. There is a rather clear cut in the plot at perhaps 0.009, and thus I created a subset interesting consisting of objects for which dist2<0.009 (which is 18 objects for me) and did the trick above, only uploading this interesting subset with only the source_id column to resolve to Gaia DR3 (lite) rows:

    SELECT *
    FROM gaia.dr3lite
    JOIN tap_upload.t33
    USING (source_id)
    

    And then I wondered whether any of these were known to SIMBAD and switched to their TAP service:

    SELECT
      tc.*, otype
    FROM basic AS db
     RIGHT OUTER JOIN TAP_UPLOAD.t34 AS tc
     ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                   CIRCLE('ICRS', tc.ra, tc.dec, 1./3600.))
    

    Note the use of RIGHT OUTER JOIN to ensure we won't lose any matches on the way; if this weren't such a small table, you'd be better off just uploading the positions and then doing a local match to recover the rest of the table, by the way.

    As to what's coming back: Well, a bunch of white dwarf candidates, a “blue“ star, a few objects SIMBAD knows as quasars (that at least makes sense, because it's rather likely that some of them have lines redshifted into my C IV window), and a few unclassified objects. Whether SIMBAD is wrong on at least some of them, whether the positional crossmatch fetched unrelated objects, or whether I got it all wrong I will not decide here. Let me give you my candidates as a VOTable, though.

    You now know what you have to do to add nice, if low-resolution, spectra to them.

    Slices?

    A notable absence from the current vector extension is slicing. I think we should have it – in this example, this would be really useful when summing different spectral regions without having to write long sums (“synthetic broadband photometry“).

    I have not put it in yet mainly because I am not sure if Python-like syntax (nflux[4:7]) is a good idea when we have 1-based arrays. Also: Do we want to keep the upper index out? That's certainly the right thing for Python (where you want a[:3]+a[3:] == a), but is it here? Speaking of which: Should we require support of half-open slices? Should we rather have a function arr_slice? With what arguments?

    I'd be curious about other peoples' thoughts on slicing.

    [1]

    In case you wonder how I came up with the WR*: You can simply run something like:

    SELECT otype, label
    FROM otypedef
    WHERE description LIKE '%Wolf%'
    

    Once Simbad upgrades to ADQL 2.1, you probably want to replace the LIKE with ILIKE for robustness.

    [2]For the incurably curious, you can learn more about the underlying problem at https://github.com/segasai/q3c/issues/30.
  • Find a Dust-Free Window Using ADQL

    Five sky images, all of them showing star clusters

    Five of the seven patches of the sky that Bayestar 17 considers least obscured by dust in Aladin's WISE color HiPSes. There clearly is a pattern here. This post is about how you'll find these (and the credible ones, too).

    The upcoming AG-Tagung in Bremen will have another puzzler, and while concocting the problem I needed to find a spot on the sky where there is very little interstellar extinction. What looks like a quick query turned out to require a few ADQL tricks that I thought I might show in this little post; they will come in handy in many situations.

    First, I needed to find data on where on the sky there is dust. Had I not known about the extinction maps I've blogged about in 2018, I would probably have looked for extinction maps in the Registry, which might have led me to the Bayestar 17 map on my service eventually, too. The way it was, I immediately fired up TOPCAT and pointed it to the TAP service at http://dc.g-vo.org/tap (the “GAVO DC TAP“ of the TAP service list) and went to the column metadata of the prdust.map_union table.

    Browsing the descriptions, the relevant columns here are healpix (which will give me the position) and best_fit. That latter thing is an array of reddening E(B − V) (i.e., higher values mean more dust) per distance bin, where the bins are 0.5 mag of distance modulus wide. I decided I'd settle for bin 20, corresponding to a kiloparsec. Dust further away than that will not trouble me much in the puzzler.

    Finding the healpixes in the rows with the smallest best_fit[20] should be easy; it is a minor variant of a classic from the ADQL course:

    SELECT TOP 20 healpix
    FROM prdust.map_union
    ORDER BY best_fit[20] ASC
    

    Except that my box replies with an error message reading “Expected end of text, found '[' (at char 61), (line:3, col:18)”.

    Hu? Well… if you look, then the problem is where I ask to sort by an array element. And indeed, it turns out that DaCHS, the software driving this site, will not let you sort by array elements yet. This is arguably a bug, and in all likelihood I will have fixed it by the time your read this. But there is a technique to defeat this and similar cases that every astronomer should know about: subqueries, which turn any query into something you can work with as if it were a table. In this case:

    SELECT TOP 30 healpix, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    – the “AS q“ gives the name of the “virtual” table resulting from the query a name. It is mandatory here. Do not be tempted to leave out the “AS” – that that is even legal is one of the major blunders of the SQL standard.

    The result is looking good:

    # healpix extinction
    1021402 0.00479
    1021403 0.0068
    418619  0.00707
    ...
    

    – so, we have the healpixes for which the extinction works out to be minimal. It is also reassuring that the two healpixes with the clearest sky (by this metric) are next to each other – where there are clear skies, it's likely that there are more clear skies nearby.

    But then… where exactly are these patches? The column description says “The healpix (in galactic l, b) for which this data applies. This is of the order given in the hpx_order column”. Hm.

    To go from HEALPix to positions, there is the ivo_healpix_center user defined function (UDF) on many ADQL services; it is part of the IVOA's UDF catalogue, so whenever you see it, it will do the same thing. And where would you see it? Well, in TOPCAT, UDFs show up in the Service tab with a signature and a short description. In this case:

    ivo_healpix_center(hpxOrder INTEGER, hpxIndex BIGINT) -> POINT
    
      returns a POINT corresponding to the center of the healpix with the
      given index at the given order.
    

    With this, we can change our query to spit out positions rather than indices:

    SELECT TOP 30 ivo_healpix_center(hpx_order, healpix) AS pos, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    The result is:

    # pos                                    extinction
    "(42.27822580645164, 78.65148926014334)" 0.00479
    "(42.44939271255061, 78.6973986631694)"  0.0068
    "(58.97460937500027, 40.86635677386179)" 0.00707
    ...
    

    That's my positions all right, but they are still in galactic coordinates. That may be fine for many applications, but I'd like to have them in ICRS. Transforming them takes another UDF; this one is not yet standardised and hence has a gavo_ prefix (which means you will only find it on reasonably new services driven by DaCHS).

    On services that have that UDF (and the GAVO DC TAP certainly is one of them), you can write:

    SELECT TOP 30
      gavo_transform('GALACTIC', 'ICRS',
        ivo_healpix_center(hpx_order, healpix)) AS pos,
      extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    That results in:

    # pos                                    extinction
    "(205.6104289782676, 28.392541949473785)" 0.00479
    "(205.55600830161907, 28.42330388161418)" 0.0068
    "(250.47595812552925, 36.43011215633786)" 0.00707
    "(166.10872483007287, 21.232866316024364)" 0.00714
    "(259.3314211312357, 43.09275090468469)" 0.00742
    "(114.66957763676628, 21.603135736808532)" 0.00787
    "(229.69174233173712, 2.0244022486718793)" 0.00793
    "(214.85349325052758, 33.6802370378023)" 0.00804
    "(204.8352084989552, 36.95716352922782)" 0.00806
    "(215.95667870050661, 36.559656879148044)" 0.00839
    "(229.66068062277128, 2.142516479012763)" 0.0084
    "(219.72263539838667, 58.371829835018424)" 0.00844
    ...
    

    If you have followed along, you now have a table of the 30 least reddened patches in the sky according Bayestar17. And you are probably as curious to see them as I was. That curiosity made me start Aladin and select WISE colour imagery, reckoning dust (at the right temperature) would be more conspicuous in WISE's wavelengths then in, say, DSS.

    I then did Views -> Activation Actions and wanted to check “Send Sky Coordinates“ to make Aladin show the sky at the position of my patches. This is usually preconfigured by TOPCAT to just work when tables have positions. Alas: at least in versions up to 4.8, TOPCAT does not know about points (in the ADQL sense) when making clever guesses there.

    But there is a workaround: Select “Send Sky Coordinates” in the Activation Actions window and then type pos[0] in “RA Column“, and pos[1] in “Dec Column” – this works because under the hood, VOTable points are just 2-arrays. That done, you can check the activation action.

    After these preparations, when you click through the first few results, you will find objects like those in the opending image (and also a few fairly empty fields). Stellar clusters are relatively rare on the sky, so their prevalence in these patches quite clearly shows that Bayestar's model has a bit of a fixation about them that's certainly not related to dust.

    Which goes to serve as another example of Demleitner's law 567: “In any table, the instances with the most extreme values are broken with a likelihood of 0.567”.

Page 1 / 3 »