Posts with the Tag User Defined Functions:

  • A New Constraint Class in PyVO's Registry API: UAT

    A scan of a book page: lots of astronomy-relevant topics ranging from "Cronometrie" to "Kosmologie, Relativitätstheorie".  Overlaid a title page stating "Astronomischer Jahresbericht.  Die Literatur des Jahres 1967".

    This was how they did what I am talking about here almost 60 years ago: a page of the table of contents of the “Astronomischer Jahresbericht” for 1967, the last volume before it was turned into the English-language Astronomy and Astrophysics Abstracts, which were the main tool for literature work in astronomy until the ADS came along in the late 1990ies.

    I have recently created a pull request against pyVO to furnish the library with a new constraint to search for data and services: Search by a concept drawn from the Unified Astronomy Thesaurus UAT. This is not entirely different from the classical search by subject keywords that was what everyone did before we had the ADS, which is what I am trying to illustrate above. But it has some twists that, I would argue, still make it valuable even in the age of full-text indexes.

    To make my argument, let me first set the stage.

    Thesauri and the UAT

    (Disclaimer: I am currently a member of the UAT steering committee and therefore cannot claim neutrality. However, I would not claim neutrality otherwise, either: the UAT is not perfect, but it's already great)

    Librarians (and I am one at heart) love thesauri. Or taxonomies. Or perhaps even ontologies. What may sound like things out of a Harry Potter novel are actually ways to organise a part of the world (a “domain”) into “concepts”. If you are suitably minded, you can think of a “concept“ as a subset of the domain; “suitably minded“ here means that you consider the world as a large set of things and a domain a subset of this world. The IVOA Vocabularies specification contains some additional philosophical background on this way of thinking in sect. 5.2.4.

    On this other hand, if you are not suitably minded, a “concept” is not much different from a topic.

    There are differences in how each of thesaurus, taxonomy, and ontology does that organising (and people don't always agree on the differences). Ontologies, for instance, let you link concepts in every way, as in “a (bicycle) (is steered) (using) a (handle bar) (made of) ((steel) or (aluminum))“; every parenthesised phrase would be a node (which is a better term in ontologies than “concept”) in a suitably general ontology, and connecting these nodes creates a fine-graned representation of knowledge about the world.

    That is potentially extremely powerful, but also almost too hard for humans. Check out WordNet for how far one can take ontologies if very many very smart people spend very many years.

    Thesauri, on the other hand, are not as powerful, but they are simpler and within reach for mere humans: there, concepts are simply organised into something like a tree, perhaps (and that is what many people would call a taxonomy) using is-a relationships: A human is a primate is a mammal is a vertebrate is an animal. The UAT actually is using somewhat vaguer notions called “narrower” and “wider”. This lets you state useful if somewhat loose relationships like “asteroid-rotation is narrower than asteroid-dynamics”. For experts: The UAT is using a formalism called SKOS; but don't worry if you can't seem to care.

    The UAT is standing on the shoulders of giants: Before it, there has been the IAU thesaurus in 1993, and an astronomy thesaurus was also produced under the auspices of the IVOA. And then there were (and to some extent still are) the numerous keyword schemes designed by journal publishers that would also count as some sort of taxonomy or astronomy.

    “Numerous” is not good when people have to assign keywords to their journal articles: If A&A use something drastically or only subtly different from ApJ, and MNRAS still something else, people submitting to multiple journals will quite likely lose their patience and diligence with the keywords. For reasons I will discuss in a second, that is a shame.

    Therefore, at least the big American journals have now all switched to using UAT keywords, and I sincerely hope that their international counterparts will follow their example where that has not already happened.

    Why Keywords?

    Of course, you can argue that when you can do full-text searches, why would you even bother with controlled keyword lists? Against that, I would first argue that it is extremely useful to have a clear idea of what a thing is called: For example, is it delta Cephei stars, Cepheids, δ Cep stars or still something else? Full text search would need to be rather smart to be able to sort out terminological turmoil of this kind for you.

    And then you would still not know if W Virginis stars (or should you say “Type II Cepheids”? You see how useful proper terminology is) are included in whatever your author called Cepheids (or whatever they called it). Defining concepts as precisely as possible thus is already great.

    The keyword system becomes even more useful when the hiearchy we see in the Cepheid example becomes visible to computers. If a computer knows that there is some relationship between W Virgins stars and classical Cepheids, it can, for instance, expand or refine your queries (“give me data for all kinds of Cepheids”) as necessary. To give you an idea of how this looks in practice, here is how SemBaReBro displays the Cepheid area in the UAT:

    Arrows between texts like "Type II Cepheid variable stars", "Cepheid variable stars", and "Young disk Cepheid variable stars"

    In that image, only concepts associated with resources in the Registry have a spiffy IVOA logo; that so few VO resources claim to deal with Cepheids tells you that our data providers can probably improve their annotations quite a bit. But that is for another day; the hope is that as more people search using UAT concepts, the data providers will see a larger benefit in choosing them wisely[1].

    By the way, if you are a regular around here, you will have seen images like that before; I have talked about Sembarebro in 2021 already, and that post contains more reasons for having and maintaining vocabularies.

    Oh, and for the definitions of the concepts, you can (in general; in the UAT, there are still a few concepts without definitions) dereference the concept URI, which in the VO is always of the form <vocabulary uri>#<term identifier>, where the vocabulary URI starts with http://www.ivoa.net/rdf, after which there is the vocabulary name.

    Thus, if you point your web browser to https://www.ivoa.net/rdf/uat#cepheid-variable-stars[2], you will learn that a Cepheid is:

    A class of luminous, yellow supergiants that are pulsating variables and whose period of variation is a function of their luminosity. These stars expand and contract at extremely regular periods, in the range 1-50 days [...]

    The UAT constraint

    Remember? This was supposed to be a blog post about a new search constraint in pyVO. Well, after all the preliminaries I can finally reveal that once pyVO PR #649 is merged, you can search by UAT concepts:

    >>> from pyvo import registry
    >>> print(registry.search(registry.UAT("variable-stars")))
    <DALResultsTable length=2010>
                  ivoid               ...
                                      ...
                  object              ...
    --------------------------------- ...
             ivo://cds.vizier/b/corot ...
              ivo://cds.vizier/b/gcvs ...
               ivo://cds.vizier/b/vsx ...
              ivo://cds.vizier/i/280b ...
               ivo://cds.vizier/i/345 ...
               ivo://cds.vizier/i/350 ...
                                  ... ...
                ivo://cds.vizier/v/97 ...
             ivo://cds.vizier/vii/293 ...
       ivo://org.gavo.dc/apass/q/cone ...
    ivo://org.gavo.dc/bgds/l/meanphot ...
         ivo://org.gavo.dc/bgds/l/ssa ...
         ivo://org.gavo.dc/bgds/q/sia ...
    

    In case you have never used pyVO's Registry API before, you may want to skim my post on that topic before continuing.

    Since the default keyword search also queries RegTAP's res_subject table (which is what this constraint is based on), this is perhaps not too exciting. At least there is a built-in protection against typos:

    >>> print(registry.search(registry.UAT("varialbe-stars")))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/msdemlei/gavo/src/pyvo/pyvo/registry/rtcons.py", line 713, in __init__
        raise dalq.DALQueryError(
    pyvo.dal.exceptions.DALQueryError: varialbe-stars does not identify an IVOA uat concept (see http://www.ivoa.net/rdf/uat).
    

    It becomes more exciting when you start exploiting the intrinsic hierarchy; the constraint constructor supports optional keyword arguments expand_up and expand_down, giving the number of levels of parent and child concepts to include. For instance, to discover resources talking about any sort of supernova, you would say:

    >>> print(registry.search(registry.UAT("supernovae", expand_down=10)))
    <DALResultsTable length=593>
                     ivoid                   ...
                                             ...
                     object                  ...
    ---------------------------------------- ...
                       ivo://cds.vizier/b/sn ...
                     ivo://cds.vizier/ii/159 ...
                     ivo://cds.vizier/ii/189 ...
                     ivo://cds.vizier/ii/205 ...
                    ivo://cds.vizier/ii/214a ...
                     ivo://cds.vizier/ii/218 ...
                                         ... ...
               ivo://cds.vizier/j/pasp/122/1 ...
           ivo://cds.vizier/j/pasp/131/a4002 ...
               ivo://cds.vizier/j/pazh/30/37 ...
              ivo://cds.vizier/j/pazh/37/837 ...
    ivo://edu.gavo.org/eurovo/aida_snconfirm ...
                    ivo://mast.stsci/candels ...
    

    There is no overwhelming magic in this, as you can see when you tell pyVO to show you the query it actually runs:

    >>> print(registry.get_RegTAP_query(registry.UAT("supernovae", expand_down=10)))
    SELECT
      [crazy stuff elided]
    WHERE
    (ivoid IN (SELECT DISTINCT ivoid FROM rr.res_subject WHERE res_subject in (
      'core-collapse-supernovae', 'hypernovae', 'supernovae',
      'type-ia-supernovae', 'type-ib-supernovae', 'type-ic-supernovae',
      'type-ii-supernovae')))
    GROUP BY [whatever]
    

    Incidentally, some services have an ADQL extension (a “user defined function“ or UDF) that lets you do these kinds of things on the server side; that is particularly nice when you do not have the power of Python at your fingertips, as for instance interactively in TOPCAT. This UDF is:

    gavo_vocmatch(vocname STRING, term STRING, matchagainst STRING) -> INTEGER
    

    (documentation at the GAVO data centre). There are technical differences, some of which I try to explain in amoment. But if you run something like:

    SELECT ivoid FROM rr.res_subject
    WHERE 1=gavo_vocmatch('uat', 'supernovae', res_subject)
    

    on the TAP service at http://dc.g-vo.org/tap, you will get what you would get with registry.UAT("supernovae", expand_down=1). That UDF also works with other vocabularies. I particularly like the combination of product-type, obscore, and gavo_vocmatch.

    If you wonder why gavo_vocmatch does not go on expanding towards narrower concepts as far as it can go: That is because what pyVO does is semantically somewhat questionable.

    You see, SKOS' notions of what is wider and narrower are not transitive. This means that just because A is wider than B and B is wider than C it is not certain that A is wider than C. In the UAT, this sometimes leads to odd results when you follow a branch of concepts toward narrower concepts, mostly because narrower sometimes means part-of (“Meronymy”) and sometimes is-a (“Hyponymy“). Here is an example discovered by my colleague Adrian Lucy:

    interstellar-medium wider nebulae wider emission-nebulae wider planetary-nebulae wider planetary-nebulae-nuclei

    Certainly, nobody would argue that that the central stars of planetary nebulae somehow are a sort of or are part of the interstellar medium, although each individual relationship in that chain makes sense as such.

    Since SKOS relationships are not transitive, gavo_vocmatch, being a general tool, has to stop at one level of expansion. By the way, it will not do that for the other flavours of IVOA vocabularies, which have other (transitive) notions of narrower-ness. With the UAT constraint, I have fewer scruples, in particular since the expansion depth is under user control.

    Implementation

    Talking about technicalities, let me use this opportunity to invite you to contribute your own Registry constraints to pyVO. They are not particularly hard to write if you know both ADQL and Python. You will find several examples – between trivial and service-sensing complex in pyvo.registry.rtcons. The code for UAT looks like this (documentation removed for clarity[3]):

    class UAT(SubqueriedConstraint):
        _keyword = "uat"
        _subquery_table = "rr.res_subject"
        _condition = "res_subject in {query_terms}"
        _uat = None
    
        @classmethod
        def _expand(cls, term, level, direction):
            result = {term}
            new_concepts = cls._uat[term][direction]
            if level:
                for concept in new_concepts:
                    result |= cls._expand(concept, level-1, direction)
            return result
    
        def __init__(self, uat_keyword, *, expand_up=0, expand_down=0):
            if self.__class__._uat is None:
                self.__class__._uat = vocabularies.get_vocabulary("uat")["terms"]
    
            if uat_keyword not in self._uat:
                raise dalq.DALQueryError(
                    f"{uat_keyword} does not identify an IVOA uat"
                    " concept (see http://www.ivoa.net/rdf/uat).")
    
            query_terms = {uat_keyword}
            if expand_up:
                query_terms |= self._expand(uat_keyword, expand_up, "wider")
            if expand_down:
                query_terms |= self._expand(uat_keyword, expand_down, "narrower")
    
            self._fillers = {"query_terms": query_terms}
    

    Let me briefly describe what is going on here. First, we inherit from the base class SubqueriedConstraint. This is a class that takes care that your constraints are nicely encapsulated in a subquery, which generally is what you want in pyVO. Calmly adding natural joins as recommended by the RegTAP specification is a dangerous thing for pyVO because as soon as a resource matches your constraint more than once (think “columns with a given UCD”), the RegistryResult lists in pyVO will turn funny.

    To make a concrete SubqueriedConstraint, you have to fill out:

    • the table it will operate on, which is in the _subquery_table class attribute;
    • an expression suitable for a WHERE clause in the _condition attribute, which is a template for str.format. This is often computed in the constructor, but here it is just a constant expression and thus works fine as a class attribute;
    • a mapping _fillers mapping the substitutions in the _condition string template to Python values. PyVO's RegTAP machinery will worry about making SQL literals out of these, so feel free to just dump Python values in there. See the make_SQL_literal for what kinds of types it understands and expand it as necessary.

    There is an extra class attribute called _keyword. This is used by the pyvo.regtap machinery to let users say, for instance, registry.search(uat="foo.bar") instead of registry.search(registry.UAT("foo.bar")). This is a fairly popular shortcut when your constraints can be expressed as simple strings, but in the case of the UAT constraint you would be missing out on all the interesting functionality (viz., the query expansion that is only available through optional arguments to its constructor).

    This particular class has some extra logic. For one, we cache a copy of the UAT terms on first use at the class level. That is not critical for performance because caching already happens at the level of get_vocabulary; but it is convenient when we want query expansion in a class method, which in turn to me feels right because the expansion does not depend on the instance. If you don't grok the __class__ magic, don't worry. It's a nerd thing.

    More interesting is what happens in the _expand class method. This takes the term to expand, the number of levels to go, and whether to go up or down in the concept trees (which are of the computer science sort, i.e., with the root at the top) in the direction argument, which can be wider or narrower, following the names of properties in Desise, the format we get our vocabulary in. To learn more about Desise, see section 3.2 of Vocabularies in the VO 2.

    At each level, the method now collects the wider or narrower terms, and if there are still levels to include, calls itself on each new term, just with level reduced by one. I consider this a particularly natural application of recursion. Finally. everything coming back is merged into a set, which then is the return value.

    And that's really it. Come on: write your own RegTAP constraints, and also have fun with vocabularies. As you see here, it's really not that magic.

    [1]Also, just so you don't leave with the impression I don't believe in AI tech at all, something like SciX's KAILAS might also help improving Registry subject keywords.
    [2]Yes, in a little sleight of hand, I've switched the URI scheme to https here. That's not really right, because the term URIs are supposed to be opaque, but some browsers currently forget the fragment identifiers when the IVOA web server redirects them to https, and so https is safer for this demonstration. This is a good example of why the web would be a better place if http had been evolved to support transparent, client-controlled encryption (rather than inventing https).
    [3]I've always wanted to write this.
  • Find a Dust-Free Window Using ADQL

    Five sky images, all of them showing star clusters

    Five of the seven patches of the sky that Bayestar 17 considers least obscured by dust in Aladin's WISE color HiPSes. There clearly is a pattern here. This post is about how you'll find these (and the credible ones, too).

    The upcoming AG-Tagung in Bremen will have another puzzler, and while concocting the problem I needed to find a spot on the sky where there is very little interstellar extinction. What looks like a quick query turned out to require a few ADQL tricks that I thought I might show in this little post; they will come in handy in many situations.

    First, I needed to find data on where on the sky there is dust. Had I not known about the extinction maps I've blogged about in 2018, I would probably have looked for extinction maps in the Registry, which might have led me to the Bayestar 17 map on my service eventually, too. The way it was, I immediately fired up TOPCAT and pointed it to the TAP service at http://dc.g-vo.org/tap (the “GAVO DC TAP“ of the TAP service list) and went to the column metadata of the prdust.map_union table.

    Browsing the descriptions, the relevant columns here are healpix (which will give me the position) and best_fit. That latter thing is an array of reddening E(B − V) (i.e., higher values mean more dust) per distance bin, where the bins are 0.5 mag of distance modulus wide. I decided I'd settle for bin 20, corresponding to a kiloparsec. Dust further away than that will not trouble me much in the puzzler.

    Finding the healpixes in the rows with the smallest best_fit[20] should be easy; it is a minor variant of a classic from the ADQL course:

    SELECT TOP 20 healpix
    FROM prdust.map_union
    ORDER BY best_fit[20] ASC
    

    Except that my box replies with an error message reading “Expected end of text, found '[' (at char 61), (line:3, col:18)”.

    Hu? Well… if you look, then the problem is where I ask to sort by an array element. And indeed, it turns out that DaCHS, the software driving this site, will not let you sort by array elements yet. This is arguably a bug, and in all likelihood I will have fixed it by the time your read this. But there is a technique to defeat this and similar cases that every astronomer should know about: subqueries, which turn any query into something you can work with as if it were a table. In this case:

    SELECT TOP 30 healpix, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    – the “AS q“ gives the name of the “virtual” table resulting from the query a name. It is mandatory here. Do not be tempted to leave out the “AS” – that that is even legal is one of the major blunders of the SQL standard.

    The result is looking good:

    # healpix extinction
    1021402 0.00479
    1021403 0.0068
    418619  0.00707
    ...
    

    – so, we have the healpixes for which the extinction works out to be minimal. It is also reassuring that the two healpixes with the clearest sky (by this metric) are next to each other – where there are clear skies, it's likely that there are more clear skies nearby.

    But then… where exactly are these patches? The column description says “The healpix (in galactic l, b) for which this data applies. This is of the order given in the hpx_order column”. Hm.

    To go from HEALPix to positions, there is the ivo_healpix_center user defined function (UDF) on many ADQL services; it is part of the IVOA's UDF catalogue, so whenever you see it, it will do the same thing. And where would you see it? Well, in TOPCAT, UDFs show up in the Service tab with a signature and a short description. In this case:

    ivo_healpix_center(hpxOrder INTEGER, hpxIndex BIGINT) -> POINT
    
      returns a POINT corresponding to the center of the healpix with the
      given index at the given order.
    

    With this, we can change our query to spit out positions rather than indices:

    SELECT TOP 30 ivo_healpix_center(hpx_order, healpix) AS pos, extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    The result is:

    # pos                                    extinction
    "(42.27822580645164, 78.65148926014334)" 0.00479
    "(42.44939271255061, 78.6973986631694)"  0.0068
    "(58.97460937500027, 40.86635677386179)" 0.00707
    ...
    

    That's my positions all right, but they are still in galactic coordinates. That may be fine for many applications, but I'd like to have them in ICRS. Transforming them takes another UDF; this one is not yet standardised and hence has a gavo_ prefix (which means you will only find it on reasonably new services driven by DaCHS).

    On services that have that UDF (and the GAVO DC TAP certainly is one of them), you can write:

    SELECT TOP 30
      gavo_transform('GALACTIC', 'ICRS',
        ivo_healpix_center(hpx_order, healpix)) AS pos,
      extinction
    FROM (
      SELECT healpix, best_fit[20] as extinction, hpx_order
      FROM prdust.map_union) AS q
    ORDER BY extinction ASC
    

    That results in:

    # pos                                    extinction
    "(205.6104289782676, 28.392541949473785)" 0.00479
    "(205.55600830161907, 28.42330388161418)" 0.0068
    "(250.47595812552925, 36.43011215633786)" 0.00707
    "(166.10872483007287, 21.232866316024364)" 0.00714
    "(259.3314211312357, 43.09275090468469)" 0.00742
    "(114.66957763676628, 21.603135736808532)" 0.00787
    "(229.69174233173712, 2.0244022486718793)" 0.00793
    "(214.85349325052758, 33.6802370378023)" 0.00804
    "(204.8352084989552, 36.95716352922782)" 0.00806
    "(215.95667870050661, 36.559656879148044)" 0.00839
    "(229.66068062277128, 2.142516479012763)" 0.0084
    "(219.72263539838667, 58.371829835018424)" 0.00844
    ...
    

    If you have followed along, you now have a table of the 30 least reddened patches in the sky according Bayestar17. And you are probably as curious to see them as I was. That curiosity made me start Aladin and select WISE colour imagery, reckoning dust (at the right temperature) would be more conspicuous in WISE's wavelengths then in, say, DSS.

    I then did Views -> Activation Actions and wanted to check “Send Sky Coordinates“ to make Aladin show the sky at the position of my patches. This is usually preconfigured by TOPCAT to just work when tables have positions. Alas: at least in versions up to 4.8, TOPCAT does not know about points (in the ADQL sense) when making clever guesses there.

    But there is a workaround: Select “Send Sky Coordinates” in the Activation Actions window and then type pos[0] in “RA Column“, and pos[1] in “Dec Column” – this works because under the hood, VOTable points are just 2-arrays. That done, you can check the activation action.

    After these preparations, when you click through the first few results, you will find objects like those in the opending image (and also a few fairly empty fields). Stellar clusters are relatively rare on the sky, so their prevalence in these patches quite clearly shows that Bayestar's model has a bit of a fixation about them that's certainly not related to dust.

    Which goes to serve as another example of Demleitner's law 567: “In any table, the instances with the most extreme values are broken with a likelihood of 0.567”.

  • Spectral Units in ADQL

    math formulae.

    In case you find the piece of Python given below too hard to read: It's just this table of conversion expressions between the different SI units we are dealing with here.

    Astronomers these days work all along the electromagnetic spectrum (and beyond, of course). Depending on where they observe, they will have very different instrumentation, and hence some see their messengers very naturally as waves, others quite as naturally as particles, others just as electrons flowing out of a CCD that is sitting behind a filter.

    In consequence, when people say where in the spectrum they are, they use very different notions. A radio astronomer will say “I'm observing at 21 cm” or “at 50 GHz“. There's an entire field named after a wavelength, “submillimeter“, and blueward of that people give their bands in micrometers. Optical astronomers can't be cured of their Ångström habit. Going still more high-energy, after an island of nanometers in the UV you end up in the realm of keV in X-ray, and then MeV, GeV, TeV and even EeV.

    However, there is just one VO (or at least that's where we want to go). Historically, the VO has had a slant towards optical astronomy, which gives us the legacy of having wavelengths in far too many places, including Obscore. Retrospectively, this was an unfortunate choice not only because it makes us look optical bigots, but in particular because in contrast to energy and, by ν = E/h, frequency, messenger wavelength depends on the medium you work in, and I shudder to think how many wavelengths in my data center actually are air wavelengths rather than vacuum wavelengths. Also, as you go beyond photons, energy really is the only thing that reasonably characterises all messengers alike (well, even that still isn't quite settled for gravitational waves as long as we're not done with a quantum theory of gravitation).

    Well – the wavelength milk is spilled. Still, the VO has been boldly expanding its reach beyond the optical and infrared windows (recently, with neutrinos and gravitational waves, not to mention EPN-TAP's in-situ measurements in the solar system, even beyond the electromagnetic spectrum). Which means we will have to accomodate the various customs regarding spectral units described above. Where there are “thick” user interfaces, these can care about that. For instance, my datalink XSLT and javascript lets people constrain spectral cutouts (along BAND) in a variety of units (Example).

    But what if the UI is as shallow as it is in ADQL, where you deal with whatever is in the underlying database tables? This has come up again at last week's EuroVO Technology Forum in virtual Strasbourg in the context of making Obscore more attractive to radio astronomers. And thus I've sat down and taught DaCHS a new user defined function to address just that.

    Up front: When you read this in 2022 or beyond and everything has panned out, the function might be called ivo_specconv already, and perhaps the arguments have changed slightly. I hope I'll remember to update this post accordingly. If not, please poke me to do so.

    The function I'm proposing is, mainly, gavo_specconv(expr, target_unit). All it does is convert the SQL expression expr to the (spectral) target_unit if it knows how to do that (i.e., if the expression's unit and the target unit are spectral units properly written in VOUnit) and raise an error otherwise.

    So, you can now post:

    SELECT TOP 5 gavo_specconv(em_min, 'GHz') AS nu
    FROM ivoa.obscore
    WHERE gavo_specconv((em_min+em_max)/2, 'GHz')
        BETWEEN 1 AND 2
      AND obs_collection='VLBA LH sources'
    

    to the TAP service at http://dc.g-vo.org/tap. You will get your result in GHz, and you write your constraint in GHz, too. Oh, and see below on the ugly constraint on obs_collection.

    Similarly, an X-ray astronomer would say, perhaps:

    SELECT TOP 5 access_url, gavo_specconv(em_min, 'keV') AS energy
    FROM ivoa.obscore
    WHERE gavo_specconv((em_min+em_max)/2, 'keV')
      BETWEEN 0.5 AND 2
      AND obs_collection='RASS'
    

    This works because the ADQL translator can figure out the unit of its first argument. But, perhaps regrettably, ADQL has no notion of literals with units, and so there is no way to meaningfully say the equivalent of gavo_specconv(656, 'Hz') to get Hα in Hz, and you will receive a (hopefully helpful) error message if you try that.

    However, this functionality is highly desirable not the least because the queries above are fairly inefficient. That's why I added the funny constraints on the collection: without them, the queries will take perhaps half a minute and thus require async operation on my box.

    The (fundamental) reason for that is that postgres is not smart enough to work out it could be using an index on em_min and em_max if it sees something like nu between 3e8/em_min and 3e7/em_max by re-writing the constraint into 3e8/nu between em_min and em_max (and think really hard about whether this is equivalent in the presence of NULLs). To be sure, I will not teach that to my translation layer either. Not using indexes, however, is a recipe for slow queries when the obscore table you query has about 85 million rows (hi there in 2050: yes, that was a sizable table in our day).

    To let users fix what's too hard for postgres (or, for that matter, the translation engine when it cannot figure out units), there is a second form of gavo_specconv that takes a third argument: gavo_specconv(expr, unit_of_expr, target_unit). With that, you can write queries like:

    SELECT TOP 5 gavo_specconv(em_min, 'Angstrom') AS nu
    FROM ivoa.obscore
    WHERE gavo_specconv(5000, 'Angstrom', 'm')
      BETWEEN em_min AND em_max
    

    and hope the planner will use indexes. Full disclosure: Right now, I don't have indexes on the spectral limits of all tables contributing to my obscore table, so this particular query only looks fast because it's easy to find five datasets covering 500 nm – but that's an oversight I'll fix soon.

    Of course, to make this functionality useful in practice, it needs to be available on all obscore services (say) – only then can people run all-VO obscore searches without the optical bias. The next step (before Bambi-eyeing the TAP implementors) therefore would be to get it into the catalogue of ADQL user defined functions.

    For this, one would need to specify a bit more carefully what units must minimally be supported. In DaCHS, I have built this on a full implementation of VOUnits, which means you can query using attoparsecs of wavelength and get your result in dekaerg (which is a microjoule: 1 daerg = 1 uJ in VOUnits – don't you just love this?):

    SELECT gavo_specconv(
      (spectral_start+spectral_end)/2, 'daerg')
      AS energy
    FROM rr.stc_spectral
    WHERE gavo_specconv(0.0002, 'apc', 'J')
      BETWEEN spectral_start AND spectral_end
    

    (stop computing: an attoparsec is about 3 cm). This, incidentally, queries the draft RegTAP extension for the VODataService 1.2 coverage in space, time, and spectrum, which is another reason I'm proposing this function: I'm not quite sure how well my rationale that using Joules of energy is equally inconvenient for all communities will be generally received. The real rationale – that Joule is the SI unit for energy – I don't dare bring forward in the first place.

    Playing with wavelengths in AU (you can do that, too; note, though, that VOUnit forbids prefixes on AU, so don't even try mAU) is perhaps entertaining in a slightly twisted way, but admittedly poses a bit of a challenge in implementation when one does not have full VOUnits available. I'm currently thinking that m, nm, Angstrom, MHz, GHz, keV and MeV (ach! No Joule! But no erg, either!) plus whatever spectral units are in use in the local tables would about cover our use cases. But I'd be curious what other people think.

    Since I found the implementation of this a bit more challenging than I had at first expected, let me say a few words on how the underlying code works; I guess you can stop reading here unless you are planning to implement something like this.

    The fundamental trouble is that spectral conversions are non-linear. That means that what I do for ADQL's IN_UNIT – just compute a conversion factor and then multiply that to whatever expression is in its first argument – will not work. Instead, one has to write a new expression. And building these expressions becomes involved because there are thousands of possible combinations of input and output units.

    What I ended up doing is adopting standard (i.e., SI) units for energy (J), wavelength (m), and frequency (Hz) as common bases, and then first convert the source and target units to the applicable standard unit. This entails trying to convert each input unit to each standard unit until a conversion actually works, which in DaCHS' Python looks like this:

    def toStdUnit(fromUnit):
        for stdUnit in ["J", "Hz", "m"]:
            try:
                 factor = base.computeConversionFactor(
                     fromUnit, stdUnit)
            except base.IncompatibleUnits:
                continue
            return stdUnit, factor
    
        raise common.UfuncError(
            f"specconv: {fromUnit} is not a spectral unit understood here")
    

    The VOUnits code is hidden away in base.computeConversionFactor, which raises an IncompatibleUnits when a conversion is impossible; hence, in the end, as a by-product this function also determines what kind of spectral value (energy, frequency, or wavelength) I am dealing with.

    That accomplished, all I need to do is look up the conversions between the basic units, which can be done in a single dictionary mapping pairs of standard units to the conversion expression templates. I have not tried to make these templates particularly pretty, but if you squint, you can still, I hope, figure out this is actually what the opening image shows:

    SPEC_CONVERSION = {
        ("J", "m"): "h*c/(({expr})*{f})",
        ("J", "Hz"): "({expr})*{f}/h",
        ("J", "J"): "({expr})*{f}",
        ("Hz", "m"): "c/({expr})/{f}",
        ("Hz", "Hz"): "{f}*({expr})",
        ("Hz", "J"): "h*{f}*({expr})",
        ("m", "m"): "{f}*({expr})",
        ("m", "Hz"): "c/({expr})/{f}",
        ("m", "J"): "h*c/({expr})/{f}",}
    

    expr is (conceptually) replaced by the first argument of the UDF, and f is the conversion factor between the input unit and the unit expr is in. Note that thankfully, no additive operators are involved and thus all this is numerically well-conditioned. Hence, I can afford not attempting to simplify any of the expressions involved.

    The rest is essentially book-keeping, where I'm using the ADQL parser to turn the expression into a tree fragment and then fiddling in the tree fragment for expr into that. The result then replaces the UDF function call in the syntax tree. You can review all this in context in DaCHS' ufunctions.py, starting at the definition of toStdUnit.

    Sure: this is no Turing award material. But perhaps these notes are useful when people want to put this kind of thing into their ADQL engines. Which I'd consider a Really Good Thing™.

  • Histograms and Hidden Open Clusters

    image: reddish pattern

    Colour-coded histograms for distances of stars in the direction of some NGC open clusters -- one cluster per line, so you're looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

    I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What's a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

    This polishing work reminded me of a UDF I've wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

    For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

    Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

    On the other hand, SQL's usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it's the way to go if you want to capture dependencies between the variables).

    So, gavo_histogram to the rescue. Here's what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows' ”Use Service” tab):

    gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]
    
    The aggregate function returns a histogram of val with
    nbins+2 elements. Assuming 0-based arrays, result[0] contains
    the number of underflows (i.e., val<lower), result[nbins+1]
    the number of overflows. Elements 1..nbins are the counts in
    nbins bins of width (upper-lower)/nbins. Clients will have to
    convert back to physical units using some external communication,
    there currently is no (meta-) data as to what lower and upper was in
    the TAP response.
    

    This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

    Let's try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that's too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as openngc.data:

    select name, raj2000, dej2000, maj_ax_deg
    from openngc.data
    where obj_type='OCl'
    

    Then, we need to add the stars in their rough directions. That's a classic crossmatch, and of course these days we use Gaia as the star catalogue:

    select name, source_id
    from openngc.data
    join gaia.dr2light
    on (
      1=contains(
        point(ra,dec),
        circle(raj2000, dej2000, maj_ax_deg)))
    where obj_type='OCl')
    

    This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al's pre-computed distances and join them in through source_id.

    And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:

    with ocl as (
      select name, raj2000, dej2000, maj_ax_deg, source_id
      from openngc.data
      join gaia.dr2light
      on (
        1=contains(
          point(ra,dec),
          circle(raj2000, dej2000, maj_ax_deg)))
      where obj_type='OCl')
    
    select
      name,
      gavo_histogram(r_est, 0, 4000, 200) as hist
    from
      gdr2dist.main
      join ocl
      using (source_id)
    where r_est!='NaN'
    group by name
    

    That's it! This query will give you (admittedly somewhat raw, since we're ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

    Oh, one odd thing is left to discuss (ignore this paragraph if you don't know what I'm talking about): r_est!='NaN'. That's not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn't hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn't work in SQL (or at least not with Postgres).

    So, fire up your TOPCAT and run this on the TAP server http://dc.g-vo.org/tap.

    You will get a table with 618 (or so) histograms. At this point, TOPCAT can't do a lot with them. So, let's emigrate to pyVO and save this table in a file ocl.vot

    My visualisation proposition would be: Let's substract a “background” from the histograms (I'm using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

    This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit 'q' to get to the next one).

    In the resulting image, any two yellow dots in one line are at least suspicious; I've spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you'd like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they've been spotted before.

    So, here's the yellow spot programme:

    from astropy.table import Table
    import matplotlib.pyplot as plt
    import numpy
    from scipy.interpolate import UnivariateSpline
    
    GALLERY = False
    
    def substract_background(arr):
        x = range(len(arr))
        mean = sum(arr)/len(arr)
        arr = arr/mean
        background = UnivariateSpline(x, arr, s=100)
        cleaned = arr-background(x)
    
        if GALLERY:
            plt.plot(x, arr)
            plt.plot(x, background(x))
            plt.show()
    
        return cleaned
    
    
    def main():
        tab = Table.read("ocl.vot")
        hist = numpy.array([substract_background(r["hist"][1:-1])
          for r in tab])
        plt.matshow(hist, cmap='gist_heat')
        plt.show()
    
    
    if __name__=="__main__":
        main()
    
  • Deredden using TAP

    An animated color-magnitude diagram

    Raw and dereddened CMD for a region in Cygnus.

    Today I published a nice new set of tables on our TAP service: The Bayestar17 3D dust map derived from Pan-STARRS 1 by Greg Green et al. I mention in passing that this was made particularly enjoyable because Greg and friends put an explicit license on their data (in this case, CC-BY-SA).

    This dust map is probably a fascinating resource by itself, but the really nifty thing is that you can use it to correct all kinds of photometric data for extinction – at least to some extent. On the Bayestar web page, the authors give some examples for usage – and with our new service, you can use TAP as well to correct photometry for extinction.

    To see how, first have a look at the table metadata for the prdust.map_union table; this is what casual users probably should look at. More specifically, at the coverage, best_fit, and grdiagnostic columns.

    coverage here is an interval of 10-healpixes. It has to be an interval because the orginal data comes on wildly different levels; depending on the density of stars, sometimes it takes the area of a 6-healpix (about a square degree) to get enough signal, whereas in the galactic plane a 10-healpix (a thousandth of a square degree) already has enough stars. To make the whole thing conveniently queriable without exploding a 6-healpix row into 1000 identical rows, larger healpixes translate into intervals of 10-helpixes. Don't panic, though, I'll show how to conveniently query this below.

    best_fit and grdiagnostic are arrays (remember the light cuves in Gaia DR2?). In bins of 0.5 in distance modulus (which is, in case you feel a bit uncertain as to the algebraic signs, 5 log10(dist)-5 for a distance in parsec), starting with a distance modulus of 4 and ending with 19. This means that for a distance modulus of 4.2 you should check the array index 0, whereas 4.3 already would be covered by array index 1. With this, best_fit[ind] gives E(B-V) = (B-V) - (B-V)0 in the direction of coverage in a distance modulus bin of 2*ind+4. For each best_fit[ind], grdiagnostic[ind] contains a quality measure for that value. You probably shouldn't touch the E(B-V) if that measure is larger than 1.2.

    So, how does one use this?

    To try things, let's pull some Gaia data with distances; in order to have interesting extinctions, I'm using a patch in Cygnus (RA 288.5, Dec 2.3). If you live on the northern hemisphere and step out tonight, you could see dust clouds there with the naked eye (provided electricity fails all around, that is). Full disclosure: I tried the Coal Sack first but after checking the coverage of the dataset – which essentially is the sky north of -30 degrees – I noticed that wouldn't fly. But stories like these are one reason why I'm making such a fuss about having standard STC coverage representations.

    We want distances, and to dodge all the intricacies involved when naively turning parallaxes to distances discussed at length in a paper by Xavier Luri et al (and elsewhere), I'm using precomputed distances from Bailer-Jones et al. (2018AJ....156...58B); you'll find them on the "ARI Gaia" service; in TOPCAT's TAP dialog simply search for “Gaia” – that'll give you the GAVO DC TAP search, too, and that we'll need in a second.

    The pre-computed distances are in the gaiadr2_complements.geometric_distance table, which can be joined to the main Gaia object catalog using the source_id column. So, here's a query to produce a little photometric catalog around our spot in Cygnus (we're discarding objects with excessive parallax errors while we're at it):

    SELECT
    r_est, 5*log10(r_est)-5 as dist_mod,
    phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag,
    ra, dec
    FROM
    gaiadr2.gaia_source
    JOIN gaiadr2_complements.geometric_distance
    USING (source_id)
    WHERE
    parallax_over_error>1
    AND 1=CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 288.5, 2.3, 0.5 ))
    

    The color-magnitude diagram resulting from this is the red point cloud in the animated GIF at the top. To reproduce it, just plot phot_bp_mean_mag-phot_rp_mean_mag against phot_g_mean_mag-dist_mod (and invert the y axis).

    De-reddening this needs a few minor technicalities. The most important one is how to match against the odd intervals of healpixes in the prdust.map_union table. A secondary one is that we have only pulled equatorial coordinates, and the healpixes in prdust are in galactic coordinates.

    Computing the healpix requires the ivo_healpix_index ADQL user defined function (UDF) that you may have met before, and since we have to go from ICRS to Galactic it requires a fairly new UDF I've recently defined to finally get the discussion on having a “standard library” of astrometric functions in ADQL going: gavo_transform. Here's how to get a 10-healpix as required for map_union from ra and dec:

    CAST(ivo_healpix_index(10,
      gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER)
    

    The CAST call is a pure technicality – ivo_healpix_index returns a 64-bit integer, which I can't use in my interval logic.

    The comparison against the intervals you could do yourself, but as argued in Registry-STC article this is one of the trivial things that are easy to get wrong. So, let's use the ivo_interval_overlaps UDF; it goes in the join condition to properly match prdust healpixes to catalog positions. Then our total query – that, I hope, should be reasonably easy to adapt to similar problems – is:

    WITH sources AS (
      SELECT phot_g_mean_mag,
        phot_bp_mean_mag,
        phot_rp_mean_mag,
        dist_mod,
        CAST(ivo_healpix_index(10,
          gavo_transform('ICRS', 'GALACTIC', POINT(ra, dec))) AS INTEGER) AS hpx,
        ROUND((dist_mod-4)*2)+1 AS dist_mod_bin
      FROM TAP_UPLOAD.T1)
    
    SELECT
      phot_bp_mean_mag-phot_rp_mean_mag-dust.best_fit[dist_mod_bin] AS color,
      phot_g_mean_mag-dist_mod+
        dust.best_fit[dist_mod_bin]*3.384 AS abs_mag,
      dust.grdiagnostic[dist_mod_bin] as qual
    FROM sources
    JOIN prdust.map_union AS dust
    ON (1=ivo_interval_has(hpx, coverage))
    

    (If you're following along: you have to switch to the GAVO DC TAP to run this, and you will probably have to change the index after TAP_UPLOAD).

    Ok, in the photometry department there's a bit of cheating going on here – I'm correcting Gaia B-R with B-V, and I'm using the factor for Johnson V to estimate the extinction in Gaia G (if you're curious where that comes from: See the footnote on best_fit and the MC extinction service docs should get you started), so this is far from physically correct. But, as you can see from the green cloud in the plot above, it already helps a bit. And if you find out better factors, by all means let me know so I can add an update... right here:

    Update (2018-09-11): The original data creator, Gregory Green points out that the thing with having a better factor for Gaia G isn't that simple, because, as he says “Gaia G is very broad, [and] the extinction coefficients are much more dependent on stellar type, and extinction is also more nonlinear with dust column (extinction is only linear with dust column and independent of stellar type for an infinitely narrow passband)”. So – when de-reddening, prefer narrow passbands. But whether narrow or wide: TAP helps you.

Page 1 / 1