Articles from Demo

  • HEALPix Maps: In General and in Gaia

    blue and reddish pixels drawing a bar on the sky.

    A map of average Gaia colours in HEALPixes 2/83 and 2/86 (Orion south-east). This post tells you how to (relatively) quickly produce such maps.

    This year's puzzler for the AG Tagung turned out to be a valuable source of interesting ADQL queries. I have already written about finding dusty spots on the sky, and in the puzzler solution, I had promised some words on creating dust maps, or, more generally, HEALPix maps of any sort.

    Making HEALPix maps with Gaia source_ids

    The basic technique is explained in Mark Taylor's classical ADASS poster from 2016. On GAVO's TAP service (access URL http://dc.g-vo.org/tap), you will also find an example for that (in TOPCAT's TAP window, check the Service-provided section unter the Examples button for it). However, once you have Gaia source_ids, there is something a lot faster and arguably not much less convenient. Let me quote the footnote on source_id from my DR3 lite table:

    For the contents of Gaia DR3, the source ID consists of a 64-bit integer, least significant bit = 1 and most significant bit = 64, comprising:

    • a HEALPix index number (sky pixel) in bits 36 - 63; by definition the smallest HEALPix index number is zero.
    • […]

    This means that the HEALpix index level 12 of a given source is contained in the most significant bits. HEALpix index of 12 and lower levels can thus be retrieved as follows:

    • [...]
    • HEALpix [at] level n = (source_id)/(235⋅412 − n).

    That is: Once you have a Gaia source_id, you an compute HEALpix indexes on levels 12 or less by a simple integer division! I give you that the more-than-35-bit numbers you have to divide by do look a bit scary – but you can always come back here for cutting and pasting:

    HEALPix level Integer-divide source id by
    12 34359738368
    11 137438953472
    10 549755813888
    9 2199023255552
    8 8796093022208
    7 35184372088832
    6 140737488355328
    5 562949953421312
    4 2251799813685248
    3 9007199254740992
    2 36028797018963968

    If you know – and that is very valuable knowledge far beyond this particular application – that you can simply jump between HEALPix indexes of different levels by multiplying with or integer-dividing by four, the general formula in the footnote actually becomes rather memorisable. Let me illustrate that with an example in Python. HEALPix number 3145 on level 6 is:

    >>> 3145//4  # ...within this HEALPix on level 5...
    786
    >>> 3145*4, (3145+1)*4  # ..and covers these on level 7...
    (12580, 12584)
    

    Simple but ingenious.

    You can immediately exploit this to make HEALPix maps like the one in the puzzler. This piece of ADQL does the job within a few seconds on the GAVO DC TAP service[1]:

    SELECT source_id/8796093022208 AS pix,
      AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.edr3lite
    WHERE distance(ra, dec, 246.7, -24.5)<2
    GROUP by pix
    

    Using the table above, you see that the horrendous 8796093022208 is the code for HEALPix level 8. When you remember (and you should) that HEALPix level 6 corresponds to a linear dimension of about 1 degree and each level is a factor of two in linear dimension, you see that the map ought to have a resolution of about 1/8th of a degree.

    HEALPix to Screen Pixel

    How do you plot this? Well, in TOPCAT, do GraphicsSky Plot, and then in the plot window LayersAdd HEALPix control (there are icons for both of these, too). You then have to manually configure the plot for the table you just retrieved: Set the Level to 8, the index to pix and the Value to avgcol – we're working on making the annotation a bit richer so that TOPCAT has a chance to figure this out by itself.

    With a bit of extra configuration, you get the following map of average colours (really: dust concentration):

    Plot: Black and reddish pixels showing a bit of structure

    This is not totally ideal, as at the border of the cone, certain Healpixes are only partially covered, which makes statistics unnecessarily harder.

    Positional Constraints using source_ids

    Due to Gaia's brilliant numbering scheme, we can do analysis by HEALpix, too, circumventing (among other things) this problem. Say you are interested in the vicinity of the M42 and would like to investigate a patch of about 8 degrees. By our rule of thumb, 8 degrees is three levels up from the one-degree level 6. To find the corresponding HEALpix index, on DaCHS servers with their gavo_simbadpoint UDF you could say:

    SELECT TOP 1 ivo_healpix_index(3, gavo_simbadpoint('M42'))
    FROM tap_schema.tables
    

    Hu, you ask, what's tap_schema.tables to do with this? Well, nothing, really. It's just that ADQL's syntax requires selecting from a table, even if what we select is completely independent of any table, as for instance the index of M42's 3-HEALpix. The hack above picks in a table guaranteed to exist on all TAP services, and the TOP 1 makes sure we only compute the value once. In case you ever feel the need to abuse a TAP service as a calculator: Keep this trick in mind.

    The result, 334, you could also have found more graphically, as follows:

    1. Start Aladin
    2. Check OverlayHEALPix grid
    3. Enter M42 in Command
    4. Zoom out until you see HEALPix indexes of level 3 in the grid.

    An advantage you have with this method: You see that M42 happens to lie on a border of HEALPixes; perhaps you should include all of 334, 335, 356, and 357 if you were really interested in the Orion Nebula's vicinity.

    We, on the other hand, are just interested in instructive examples, and hence let's just repeat our colour mapping with all Gaia objects from HEALPix 3/334. How do you select these? Well, by source_id's construction, you know their source_ids will be between 334⋅9007199254740992 and (334 + 1)⋅9007199254740992 − 1:

    SELECT source_id/8796093022208 AS pix,
      AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.edr3lite
    WHERE source_id BETWEEN 334*9007199254740992 AND 335*9007199254740992-1
    GROUP by pix
    

    This is computationally cheap (though Postgres, not being a column store still has to do quite a bit of I/O; note how much faster this query is when you run it again and all the tuples are already in memory). Even going to HEALPix level 2 would in general still be within our sync time limit. The opening figure was produced with the constraint:

    source_id BETWEEN 83*36028797018963968 AND 84*36028797018963968-1
    OR source_id BETWEEN 86*36028797018963968 AND 87*36028797018963968-1
    

    – and with a sync query.

    Aggregating over a Non-HEALPix

    One last point: The constraints we have just been using are, in effect, positional constraints. You can also use them as quick and in some sense rather unbiased sampling tools.

    For instance, if you would like so see how the reddening in one of the “dense“ spots in the opening picture behaves with distance, you could first pick a point – α = 98, δ = 4, say –, then convert that to a level 7 healpix as above (that's/88974) and then write:

    SELECT ROUND(r_med_photogeo/200)*200 AS distbin, COUNT(*) as n,
        AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
    FROM gaia.dr3lite
    JOIN gedr3dist.main USING (source_id)
    WHERE source_id BETWEEN 88974*35184372088832 and 88975*35184372088832-1
    GROUP BY distbin
    

    This is creating 200 pc bins in distance based on the estimates in the gedr3dist.main table (note that this adds subtle correlations, because these estimates already contain Gaia colour information). Since quite a few of these bins will be very sparsely populated, I'm also fetching the number of objects contributing. And then I plot the whole thing, using the conventional (n) ⁄ n as a rough estimate for the relative error:

    Plot: A line that first slowly declines, then rises quite a bit, then flattens out and becomes crazy as errors start to dominate.

    This plot immediatly shows that colour systematics are not exclusively due to dust, as in that case things would only get redder all the time. The blueward trend up to 700 pc is reasonably well explained by the brighter, bluer upper main sequence becoming more dominant in the population sampled as red dwarfs become too faint for Gaia.

    The strong reddening setting in after that is rather certainly due to the Orion complex, though I would perhaps not have expected it to reach out to 2 kpc (the conventional distance to M42 is about 0.5 kpc); without having properly thought about it, I'll chalk it off as “the Orion arm“. And after that, it's again what I'd call Malmquist-blueing until the whole things dissolves into noise.

    In conclusion: Did you know you can group by both healpix and distbin at the same time? I am sure there are interesting structures to be found in what you will get from such a query…

    [1]You may be tempted to write source_id/(POWER(2, 35)*POWER(4, 3) here for clarity. Resist that temptation. POWER returns floating point numbers. If you have one float in a division, not even a ROUND will get you back into the integer division realm, and the whole trick implodes. No, you will need the integer literals for now.
  • Computing Residuals of an Astrometric Calibration

    Two plots, left a fairly good correlation, right a cloudy wave

    The kind of plot you can make following the recipe given here: Left, a comparison of the photometry, right, a positional residuals, not taking into account the SIP plate solution, when comparing the HDAP plate B3261a against Gaia DR3. Note that the cut-off a 4 arcsec is because of the match radius when obtaining the calibrator stars.

    I recently had to assess the quality of the astrometric calibration of a photographic plate. What I am going to show you in this post will of course work just as well for CCD frames, and if these have a sufficiently large field of view, this may be an issue for them as well. However, the sort of data that needs this assessment most typically are scans of plates, as these tend to have a “wobble”, systematic offsets in the scan direction resulting from imperfections in the mechanics.

    Prerequisites: An astronomical frame with a calibration in ICRS (or some frame not very far from it), called my-image.fits in the following, SExtractor (in Debian and derivatives: apt install source-extractor – long live Debian Astro; since it's called source-extractor in Debian, that's what I'll use here, too), and of course TOPCAT.

    Step 1: Extract Sources. Source extraction is of course a high science, and if you know better than me, by all means do it the way you think is appropriate. Meanwhile, the following might very well work for you sufficiently well.

    Create a working directory and enter it. Then, to create a file telling source-extractor what columns you would like to see, write the following to a file default.param:

    ALPHA_SKY
    DELTA_SKY
    X_IMAGE
    Y_IMAGE
    MAG_ISO
    FLUX_AUTO
    ELONGATION
    

    Next, give a few parameters to source-extractor; depending on the sort of image you have, you may want to play around with DETECT_MINAREA (how many pixels need to show a signal to register as a source) and DETECT_THRESH (how many sigmas a pixel has to be above the background to register as a candidate for belonging to a source). Meanwhile, write the following into a file default.control:

    CATALOG_TYPE     FITS_1.0
    CATALOG_NAME     img.axy
    PARAMETERS_NAME  default.param
    FILTER           N
    DETECT_MINAREA   30
    DETECT_THRESH    4
    SEEING_FWHM      1.2
    

    – but if the following call gives you a few hundred sources, that ought to work for the present purpose.

    Then run:

    source-extractor -c default.control my-image.fits
    

    This will give you a catalogue of extracted objects in the file img.axy.

    Step 2: Fix source-extractor's output. Load that img.axy into TOPCAT. Regrettably, source-extractor does not add any useful metadata to the columns of its output table. To add the absolute bare minimum, in TOPCAT go to ViewsColumn Info. In that window, check UCD in the Display menu, and then put pos.eq.ra and pos.eq.dec into the UCD fields of the ALPHA_SKY and DELTA_SKY columns, respectively; double click to change fields in TOPCAT.

    To see if you have done the annotation right, in TOPCAT's main window, click GraphicsSky Plot. If the objects show up, you have just provided enough annotation to let TOPCAT figure out the position for each row.

    Step 3: Get calibrators. We will now try to add counterparts for Gaia DR3 to the extracted sources. To do that, click VOTable Access Protocol, and in the window popping up double click the entry for the GAVO DC TAP.

    In the Find box, type dr3lite to look for this site's version of the Gaia DR3 source catalogue. Click on gaia.dr3lite to select that table, and then select the Columns pane. This should show some of the Gaia DR3 columns.

    Now ExamplesUpload Join will generate a query that will cross-match your extracted sources with the Gaia sources. You should edit it a bit, only selecting the columns you will actually need, removing the TOP 1000 (at least on large images with more than 1000 sources), and reducing the match radius a bit when the calibration is not actually completely off and your epoch is sufficiently close to J2000.

    Hint: you can control-click in the Columns pane and then use the Cols button to insert all the column names in one go[1]. For me, the resulting query would be:

    SELECT
       source_id, ra, dec, phot_bp_mean_mag,
       tc.*
       FROM gaia.dr3lite AS db
       JOIN TAP_UPLOAD.t1 AS tc
       ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                     CIRCLE('ICRS', tc.ALPHA_SKY, tc.DELTA_SKY, 4./3600.))
    

    This should result in about as many matches as your extraction had – a few more is ok, because you will have some spurious matches, a few less is ok, too, as there are always some outliers and artefacts, but you should clearly not pull a magnitude more or less objects here than you put in; fiddle with the match radius as necessary.

    See if there is a rough correlation between the Gaia calibrators and your extracted sources by plotting phot_bp_mean_mag against MAG_ISO. Absent more information, MAG_ISO, source-extractor's guess for the magnitude of the extracted object, will be just some crazy number, but it should have some discernable correlation with the actual magnitude. Do not expect too much here, in particular with old plates, for which good photometry is a science of their own.

    In my example, this looked like this:

    Plot: a rough correlation in red with a green tail

    The green points certainly are spurious matches; this observation did not reach beyond 14th magnitude or so, and there are many weak stars on the sky, so a few of them will show up in just about any cross match. See the opening picture for an example with a better correlation.

    Step 4: Do the correlation plot. Do GraphicsPlane Plot and then plot ra-alpha_sky or dec-delta_sky against X_IMAGE or Y_IMAGE. You could get something like this:

    Plot: A single wavy thing

    This rather certainly reflects some optical distortion; source-extractor regrettably does not take into account SIP corrections yet, so it is likely that a large part of this would be taken care of by the polynomials of the plate solution (the github issue I am linking to tells you how to be sure).

    But it can also look like this:

    Plot: Multiple wobbles

    This certainly is not the result of a lens or anything optical at all. It's the scanner's gears that you are looking at here. With an amplitude of perhaps three arcseconds this is rather excessive here; but something like this you will rather likely see even on good scanners – though it may essentially be invisible, as of the Heidelberg scanner we used for HDAP:

    Plot: A vertical cloud with no discernible structure.
    [1]I'm using the BP magnitude in the query below as most historical plates tend to be “blue sensitive“ (in some sense). Hence, BP magnitudes should be a bit closer to what source-extractor has extracted.
  • A Proposed Vector Extension for ADQL

    When I showed off my rendering of the Gaia DR 3 XP spectra a month ago, I promised I would later show how my proposal for a Vector extension to ADQL would enable quite a bit of interesting functionality on that table. Let me make good on this promise with a little project to find candidates for Wolf-Rayet stars, more specifically, WC stars. I give you that's a bit cheap because they have very distinctive spectral features, but then this is supposed to be a quick educational posting, not a science paper.

    Before I start, I probably should stress that in this context I am using the word “vector“ like a computer guy would. We are talking about one-dimensional arrays here, not about the vectors the mathematicians have given us (as in “elements of vector spaces“).

    Getting Spectra for Wolf-Rayet Stars

    Let us first produce a list of Wolf-Rayet stars (of any denomination) using SIMBAD. So, start TOPCAT, open the TAP dialog and find the SIMBAD TAP server. Run:

    SELECT main_id, ra, dec
    FROM basic
    WHERE otype='WR*'
    

    there[1]. In the next step, we will need the Gaia DR3 source_ids for these objects, and so it would be nice to pull them immediately from Simbad; you could in principle do that by running:

    SELECT main_id, ra, dec, id
    FROM basic
      JOIN ident
      ON (oid=oidref)
    WHERE otype='WR*'
      AND id LIKE 'Gaia DR3%'
    

    – but for one, fiddling out the actual source_ids from the strings you get in the id column is a bit tedious, and then quite a few of these objects don't have Gaia ids yet: The first query returns 1548 at the moment, the second 696.

    If we had the source_ids, we could immediately join with the gdr3spec.spectra table at the GAVO DC TAP (http://dc.g-vo.org/tap). As I mentioned a month ago, this is a physical table just consisting of the source id and arrays of flux and flux errors. There is also gdr3spec.withpos that has positions and the spectra; but that's a view, and for the time being that means that the planner will quite likely get confused when positional constraints come into play[2]. The result would be queries running for half an hour when a few seconds would do just as well.

    On services already supporting ADQL 2.1, you can usually work around problems of this kind by re-writing your query to use CTEs (“WITH”), because these often work as planner barriers. In the present case, we first get source_ids for our Simbad objects in a CTE and then use these to join with our spectra table, like this:

    WITH wrids AS (
            SELECT source_id, main_id
            FROM gaia.edr3lite AS l
        JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001))
    SELECT main_id, source_id, flux, flux_error
    FROM gdr3spec.spectra
    JOIN wrids USING (source_id)
    

    As usual, you will probably have to adapt the number in what is tap_upload.t1 here to the table index you have for your SIMBAD result.

    This yields 574 spectra at the moment, within a few seconds. Spectra for about a third of our collection of objects: I'd say that's quite impressive.

    Investigating the spectra

    The September post already discussed a few aspects of array plotting. In short: try the plane plot and an XYArray control. Modern TOPCATs (and for what I'm doing here it's wise to use something newer than 4.8-7) will automatically figure out suitable columns for the x and y axes (except it forgets to label the spectral coordinate with its unit, nm):

    Plot: lots of wiggly lines

    There's a quite a bit of crowding here; finding global characteristics perhaps works better when you switch the ordinate to logarithmic, use transparent shading (in the Form tab) and raise the opaque limit a bit. This could give you something like:

    Plot: hazy structures in blue

    You can see that at least quite a few objects have nice and strong emission lines, as I had hoped for when choosing WC stars for this example. What if we could pick them out to build a template spectrum? Well, let's try. With the new vector math in ADQL, the database can normalise the spectra and compute a few statistics on them.

    But first: In order to get rid of the source_id-computing CTE above, let me obtain the source_ids I want to work with once and for all, as in:

    SELECT source_id, main_id
    FROM gaia.edr3lite AS l
      JOIN tap_upload.t1 AS u
            ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001)
    

    Memorise the Table List index of the result. With that, you can directly work with the gdr3spec.spectra table and experiment a bit; for me, the index was 10, and hence I use tap_uploads.t10 below.

    Computing some statistics

    The Gaia XP spectra are flux calibrated, and hence I will have to normalise them if I want to compare them. Ignoring all errors and thus in particular the fact that some (few) components are negative, this normalisation is harmless: We just divide by the sum of all vector components. The net result is that, were our spectra continuous, the integral over them would be one. And let's then use the standard deviation and the value of the 19th array element as metrics:

    WITH normalised AS (
    SELECT source_id, main_id,
            flux/arr_sum(flux) as nflux, flux
    FROM gdr3spec.spectra
    JOIN tap_upload.t10
    USING (source_id))
    SELECT
      source_id, main_id,
      flux, nflux,
      arr_avg(nflux*nflux)-POWER(arr_avg(nflux),2) AS sd,
      nflux[19] as em
    FROM normalised
    

    You can see quite a bit of the vector extension here:

    • arr_sum and arr_avg: These work as the aggregate functions in SQL do, just not on tuples but on the components of the vectors.
    • Multiplication of vectors is element-wise, so nflux*nflux computes a vector of the squares of the components of nflux. That's also true for all other basic arithmetic operators. If you know numpy: same thing.
    • You fetch individual elements in the [] notation you probably know from Python or C. Contrary to Python and C, common SQL implementations count indexes from 1 by default, and we are keeping that here (for now). Fortran lovers rejoice!

    Why did I use nflux[19]? Well, there is a reasonably strong emission feature in many spectra at about 580 nm (it's three-time ionised carbon, or C IV in the rotten notation I usually apologise for when talking to non-astronomers), which is, as experts tell me (thanks, Andreas!), rather characteristic for the WC stars that I'm after (whereas the even stronger feature on the left, around 470 nm, can also be Helium).

    If you inspect the spectral coordinate that TOPCAT has on the abscissa (it's a param, so you'll have to go to ViewsTable Parameters to see it), you will see that each spectral bin is 10 nm wide. So, I will hopefully hit the 580 nm feature when fetching the 19th element of the spectral vector.

    If you plot em versus the sd obtained like that, you will see two reasonably distinct groups, where the ascending arm has relatively strong emission around 580 nm and the descending arm does not. I have used the Blob subset feature to select the upper arm into a subset ”upper” that is blue in the following plot. If you click around in the em-sd plot and show the Activated subset in an array plot, you can see things like:

    Screenshot: scatterplot and stacked spectra next to each other

    The activated spectrum (shown here in yellow) has a strong Hα but basically no C IV – and it's safely outside of our carbon subset. Click around a bit on the ascending arm, and you will see that all these spectra have a bump around array element 18 (in TOPCAT's count, which starts at 0).

    Computing a Template on the Server Side

    Whatever the subset of stars that we would like to use to define our group of interest, we would now like to create a template spectrum from them. A plausible way to do that is to sum them all up – that has the nice side effect that stronger sources (which hopefully are less noisy) have a larger weight.

    To compute the template, in the Views → Column Info for the sd/em table, unselect all columns but source_id (that way, you only upload what you absolutely must), and in the main window, select the upper subset in the Row Subset combo box. That way, only the rows in that subset will get uploaded in the following query:

    SELECT summed/arr_sum(summed) AS tpl
    FROM (
      SELECT SUM(flux) AS summed
      FROM gdr3spec.spectra
      JOIN tap_upload.t19
      USING (source_id)) AS q
    

    Again, you will have to adapt the t19 to where your manipulated sd/em table is. If you get an “ambiguous column flux” error (or so) here, you forgot to unselect all columns but source_id in the columns window.

    It pays to briefly appreciate what happens here: The SELECT SUM(flux) is an aggregate function over arrays, meaning that all the arrays are being summed over component-wise. Against that, the sum in summed/arr_sum(summed) is summing within the array. If it helps you, you could imagine having all the arrays in the table stacked. Then, SUM(arr) produces the vertical margin sum, and array_sum(arr) procudes the horizontal margin sum.

    Well, here's the template I got in this way:

    Plot: a single wiggly line

    I give you that in this particular case, you could easily have done the computation on the client side, because you already had the spectra in your table. But the technique also works when you don't, and it will scale to millions of arrays (although you will have to carefully think about numerics when doing such enormous sums).

    Also – I cannot lie – I simply had to have a pretext for showing you aggregate functions over arrays.

    Finding Similar Objects

    Now that we have the template, can we find objects that have similar properties? Sure: We upload the array and compute some metric, perhaps the (squared) euclidian distances to normalised spectra. If the template is in TOPCAT's table 25, you can write:

    SELECT TOP 200000
      source_id,
      arr_sum(
        arr_map(
          power(x,2),tpl-flux/arr_sum(flux))) AS dist2,
      arr_sum(flux_error/flux) AS errs
    FROM gdr3spec.spectra, tap_upload.t25
    

    This will compute the distances between (conceptually randomly drawn) 200000 spectra and your template. I am also requesting the sum of the (relative) flux errors as a measure of how likely it is that wild wiggles actually are just artefacts.

    There is one array-related feature in that query I have not yet mentioned: arr_map. This applies an expression to all components to a vector, pretty much like python's map function, and my attempt to have some (perhaps somewhat lame) substitute for numpy's ufunctions.

    I am rather sure we really need something like this. SQL has no notion of defining functions in queries. That is usually welcome, as otherwise it would quicky become Turing-complete, which would be bad for what it is designed to do. Here, however, that is a problem, because we do not have a clean way to write the expression be be computed for each component. For now, I have decreed that the first argument of map is an expression over a formal x. This is ugly not only because it will be confusing when there's an actual x in a table or query. I suspect with a bit more thought and creativity, one can find a better solution that still does not require a re-write of half the SQL grammar. But then let's see – perhaps this makeshift hack proves to be less troublesome than I expect.

    Note that on my server, you will only get back 20000 matches by default; you would have to adapt Max Rows to actually retrieve 200'000, and then you also must switch to Asynchronous mode. This will then actually take a non-trivial amount of CPU and disk I/O; going through the entire set of 2e8 rows will be a matter of hours or so. Hence, I'm grateful if you do all-sky scans only after having tested your queries on much smaller subsets.

    I've done this for 500'000 rows (which took a few minutes), which might bring up a few C IV-strong WR stars (these beasts are rare, you know). The result in the dist2/errs plane is (logplot zoomed a bit):

    Plot: a gigantic point cloud with a few outliers.

    Well: at least there are a few promising cases. Which would conclude this little demo for the ADQL vector proposal. Looking at what we have found here is another story.

    Still, I could not resist having another look at what my box has found there. There is a rather clear cut in the plot at perhaps 0.009, and thus I created a subset interesting consisting of objects for which dist2<0.009 (which is 18 objects for me) and did the trick above, only uploading this interesting subset with only the source_id column to resolve to Gaia DR3 (lite) rows:

    SELECT *
    FROM gaia.dr3lite
    JOIN tap_upload.t33
    USING (source_id)
    

    And then I wondered whether any of these were known to SIMBAD and switched to their TAP service:

    SELECT
      tc.*, otype
    FROM basic AS db
     RIGHT OUTER JOIN TAP_UPLOAD.t34 AS tc
     ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                   CIRCLE('ICRS', tc.ra, tc.dec, 1./3600.))
    

    Note the use of RIGHT OUTER JOIN to ensure we won't lose any matches on the way; if this weren't such a small table, you'd be better off just uploading the positions and then doing a local match to recover the rest of the table, by the way.

    As to what's coming back: Well, a bunch of white dwarf candidates, a “blue“ star, a few objects SIMBAD knows as quasars (that at least makes sense, because it's rather likely that some of them have lines redshifted into my C IV window), and a few unclassified objects. Whether SIMBAD is wrong on at least some of them, whether the positional crossmatch fetched unrelated objects, or whether I got it all wrong I will not decide here. Let me give you my candidates as a VOTable, though.

    You now know what you have to do to add nice, if low-resolution, spectra to them.

    Slices?

    A notable absence from the current vector extension is slicing. I think we should have it – in this example, this would be really useful when summing different spectral regions without having to write long sums (“synthetic broadband photometry“).

    I have not put it in yet mainly because I am not sure if Python-like syntax (nflux[4:7]) is a good idea when we have 1-based arrays. Also: Do we want to keep the upper index out? That's certainly the right thing for Python (where you want a[:3]+a[3:] == a), but is it here? Speaking of which: Should we require support of half-open slices? Should we rather have a function arr_slice? With what arguments?

    I'd be curious about other peoples' thoughts on slicing.

    [1]

    In case you wonder how I came up with the WR*: You can simply run something like:

    SELECT otype, label
    FROM otypedef
    WHERE description LIKE '%Wolf%'
    

    Once Simbad upgrades to ADQL 2.1, you probably want to replace the LIKE with ILIKE for robustness.

    [2]For the incurably curious, you can learn more about the underlying problem at https://github.com/segasai/q3c/issues/30.
  • Find a Dust-Free Window Using ADQL

    Five sky images, all of them showing star clusters

    Five of the seven patches of the sky that Bayestar 17 considers least obscured by dust in Aladin's WISE color HiPSes. There clearly is a pattern here. This post is about how you'll find these (and the credible ones, too).

    The upcoming AG-Tagung in Bremen will have another puzzler, and while concocting the problem I needed to find a spot on the sky where there is very little interstellar extinction. What looks like a quick query turned out to require a few ADQL tricks that I thought I might show in this little post; they will come in handy in many situations.

    First, I needed to find data on where on the sky there is dust. Had I not known about the extinction maps I've blogged about in 2018, I would probably have looked for extinction maps in the Registry, which might have led me to the Bayestar 17 map on my service eventually, too. The way it was, I immediately fired up TOPCAT and pointed it to the TAP service at http://dc.g-vo.org/tap (the “GAVO DC TAP“ of the TAP service list) and went to the column metadata of the prdust.map_union table.

    Browsing the descriptions, the relevant columns here are healpix (which will give me the position) and best_fit. That latter thing is an array of reddening E(B − V) (i.e., higher values mean more dust) per distance bin, where the bins are 0.5 mag of distance modulus wide. I decided I'd settle for bin 20, corresponding to a kiloparsec. Dust further away than that will not trouble me much in the puzzler.

    Finding the healpixes in the rows with the smallest best_fit[20] should be easy; it is a minor variant of a classic from the ADQL course:

    SELECT TOP 20 healpix
    FROM prdust.map_union
    ORDER BY best_fit[20] ASC
    

    Except that my box replies with an error message reading “Expected end of text, found '[' (at char 61), (line:3, col:18)”.

    Hu? Well… if you look, then the problem is where I ask to sort by an array element. And indeed, it turns out that DaCHS, the software driving this site, will not let you sort by array elements yet. This is arguably a bug, and in all likelihood I will have fixed it by the time your read this. But there is a technique to defeat this and similar cases that every astronomer should know about: subqueries, which turn any query into something you can work with as if it were a table. In this case:

    SELECT TOP 30 healpix, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    – the “AS q“ gives the name of the “virtual” table resulting from the query a name. It is mandatory here. Do not be tempted to leave out the “AS” – that that is even legal is one of the major blunders of the SQL standard.

    The result is looking good:

    # healpix extinction
    1021402 0.00479
    1021403 0.0068
    418619  0.00707
    ...
    

    – so, we have the healpixes for which the extinction works out to be minimal. It is also reassuring that the two healpixes with the clearest sky (by this metric) are next to each other – where there are clear skies, it's likely that there are more clear skies nearby.

    But then… where exactly are these patches? The column description says “The healpix (in galactic l, b) for which this data applies. This is of the order given in the hpx_order column”. Hm.

    To go from HEALPix to positions, there is the ivo_healpix_center user defined function (UDF) on many ADQL services; it is part of the IVOA's UDF catalogue, so whenever you see it, it will do the same thing. And where would you see it? Well, in TOPCAT, UDFs show up in the Service tab with a signature and a short description. In this case:

    ivo_healpix_center(hpxOrder INTEGER, hpxIndex BIGINT) -> POINT
    
      returns a POINT corresponding to the center of the healpix with the
      given index at the given order.
    

    With this, we can change our query to spit out positions rather than indices:

    SELECT TOP 30 ivo_healpix_center(hpx_order, healpix) AS pos, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    The result is:

    # pos                                    extinction
    "(42.27822580645164, 78.65148926014334)" 0.00479
    "(42.44939271255061, 78.6973986631694)"  0.0068
    "(58.97460937500027, 40.86635677386179)" 0.00707
    ...
    

    That's my positions all right, but they are still in galactic coordinates. That may be fine for many applications, but I'd like to have them in ICRS. Transforming them takes another UDF; this one is not yet standardised and hence has a gavo_ prefix (which means you will only find it on reasonably new services driven by DaCHS).

    On services that have that UDF (and the GAVO DC TAP certainly is one of them), you can write:

    SELECT TOP 30
      gavo_transform('GALACTIC', 'ICRS',
        ivo_healpix_center(hpx_order, healpix)) AS pos,
      extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    That results in:

    # pos                                    extinction
    "(205.6104289782676, 28.392541949473785)" 0.00479
    "(205.55600830161907, 28.42330388161418)" 0.0068
    "(250.47595812552925, 36.43011215633786)" 0.00707
    "(166.10872483007287, 21.232866316024364)" 0.00714
    "(259.3314211312357, 43.09275090468469)" 0.00742
    "(114.66957763676628, 21.603135736808532)" 0.00787
    "(229.69174233173712, 2.0244022486718793)" 0.00793
    "(214.85349325052758, 33.6802370378023)" 0.00804
    "(204.8352084989552, 36.95716352922782)" 0.00806
    "(215.95667870050661, 36.559656879148044)" 0.00839
    "(229.66068062277128, 2.142516479012763)" 0.0084
    "(219.72263539838667, 58.371829835018424)" 0.00844
    ...
    

    If you have followed along, you now have a table of the 30 least reddened patches in the sky according Bayestar17. And you are probably as curious to see them as I was. That curiosity made me start Aladin and select WISE colour imagery, reckoning dust (at the right temperature) would be more conspicuous in WISE's wavelengths then in, say, DSS.

    I then did Views -> Activation Actions and wanted to check “Send Sky Coordinates“ to make Aladin show the sky at the position of my patches. This is usually preconfigured by TOPCAT to just work when tables have positions. Alas: at least in versions up to 4.8, TOPCAT does not know about points (in the ADQL sense) when making clever guesses there.

    But there is a workaround: Select “Send Sky Coordinates” in the Activation Actions window and then type pos[0] in “RA Column“, and pos[1] in “Dec Column” – this works because under the hood, VOTable points are just 2-arrays. That done, you can check the activation action.

    After these preparations, when you click through the first few results, you will find objects like those in the opending image (and also a few fairly empty fields). Stellar clusters are relatively rare on the sky, so their prevalence in these patches quite clearly shows that Bayestar's model has a bit of a fixation about them that's certainly not related to dust.

    Which goes to serve as another example of Demleitner's law 567: “In any table, the instances with the most extreme values are broken with a likelihood of 0.567”.

  • Towards Data Discovery in pyVO

    When I struggled with ways to properly integrate TAP services – which may have hundreds or thousands of different resources in one service – into the VO Registry without breaking what we already had, I realised that there are really two fundamentally different modes of using the VO Registry. In Discovering Data Collections's abstract I wrote:

    the Registry must support both VO-wide discovery of services by type ("service enumeration") and discovery by data collection ("data discovery").

    To illustrate the difference in a non-TAP case, suppose I have archived images of lensed quasars from Telescopes A, B, and C. All these image collections are resources in their own right and should be separately findable when people look for “resources with data from Telescope A“ or perhaps “images obtained between 2011-01-01 and 2011-12-31”.

    However, when a machine wants to find all images at a certain position, publishing the three resources through three different services would mean that that machine has to do three requests where one would work just as well. That is very relevant when you think about how the VO will evolve: At this point there are 342 SIAP services in the VO, and when you read this, that number may have grown further. Adding one service per collection will simply not scale when we want to keep the possibility of all-VO searches. Since I claim that is a very desirably thing, we need to enable collective services covering multiple subordinate resources.

    So, while in the first (“data discovery”) case one wants to query (or at least discover) the three resources separately, in the second case they should be ignored, and only a collective “images of lensed quasars” service should be queried.

    The technical solution to this requirement was creating “auxliary capabilities” as discribed in the endorsed note on discoving data collections cited above. But these of course need client support; VO clients up to now by and large do service enumeration, as that has been what we started with in the VO Registry. Client support would, roughly, mean that clients would present their users with data collections, and then offer the various ways to to access them.

    There are quite a number of technicalities involved in why that's not terribly straightforward for the “big” clients like TOPCAT and Aladin (though Aladin's discovery tree already comes rather close).

    Now that quite a number of people use pyVO interactively in jupyter notebooks, extending pyVO's registry interface to do data discovery in addition to the conventional service enumeration becomes an attractive target to have data discovery in practice.

    I have hence created pyVO PR #289. I think some the rough edges will need to be smoothed out before it can be merged, but meanwhile I'd be grateful if you could try it out already. To facilitate that, I have prepared a jupyter notebook that shows the basic ideas.

    To run it while the PR is not merged, you need to install the forked pyVO. In order to not clobber your main installation, you can install astropy using your package manager and then do the following (assuming your shell is bash or something suitably similar):

    virtualenv --system-site-packages try-discoverdata
    . try-discoverdata/bin/activate
    cd try-discoverdata
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout add-discoverdata
    python3 setup.py develop
    ipython3 notebook
    

    That should open a browser window in which you can open the notebook (you probably want to download it into the pyvo checkout in order to make the notebook selector see it). Enjoy!

Page 1 / 3 »