GAVO Blog: Virtual Observatory Matters: Categories

Articles from Demo

Histograms and Hidden Open Clusters

2020-08-13 Markus Demleitner

Colour-coded histograms for distances of stars in the direction of some NGC open clusters -- one cluster per line, so you're looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What's a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

This polishing work reminded me of a UDF I've wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

On the other hand, SQL's usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it's the way to go if you want to capture dependencies between the variables).

So, gavo_histogram to the rescue. Here's what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows' ”Use Service” tab):
```
gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]

The aggregate function returns a histogram of val with
nbins+2 elements. Assuming 0-based arrays, result[0] contains
the number of underflows (i.e., val<lower), result[nbins+1]
the number of overflows. Elements 1..nbins are the counts in
nbins bins of width (upper-lower)/nbins. Clients will have to
convert back to physical units using some external communication,
there currently is no (meta-) data as to what lower and upper was in
the TAP response.
```
This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

Let's try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that's too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as openngc.data:
```
select name, raj2000, dej2000, maj_ax_deg
from openngc.data
where obj_type='OCl'
```
Then, we need to add the stars in their rough directions. That's a classic crossmatch, and of course these days we use Gaia as the star catalogue:
```
select name, source_id
from openngc.data
join gaia.dr2light
on (
  1=contains(
    point(ra,dec),
    circle(raj2000, dej2000, maj_ax_deg)))
where obj_type='OCl')
```
This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al's pre-computed distances and join them in through source_id.

And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:
```
with ocl as (
  select name, raj2000, dej2000, maj_ax_deg, source_id
  from openngc.data
  join gaia.dr2light
  on (
    1=contains(
      point(ra,dec),
      circle(raj2000, dej2000, maj_ax_deg)))
  where obj_type='OCl')

select
  name,
  gavo_histogram(r_est, 0, 4000, 200) as hist
from
  gdr2dist.main
  join ocl
  using (source_id)
where r_est!='NaN'
group by name
```
That's it! This query will give you (admittedly somewhat raw, since we're ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

Oh, one odd thing is left to discuss (ignore this paragraph if you don't know what I'm talking about): r_est!='NaN'. That's not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn't hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn't work in SQL (or at least not with Postgres).

So, fire up your TOPCAT and run this on the TAP server http://dc.g-vo.org/tap.

You will get a table with 618 (or so) histograms. At this point, TOPCAT can't do a lot with them. So, let's emigrate to pyVO and save this table in a file ocl.vot

My visualisation proposition would be: Let's substract a “background” from the histograms (I'm using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit 'q' to get to the next one).

In the resulting image, any two yellow dots in one line are at least suspicious; I've spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you'd like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they've been spotted before.

So, here's the yellow spot programme:
```
from astropy.table import Table
import matplotlib.pyplot as plt
import numpy
from scipy.interpolate import UnivariateSpline

GALLERY = False

def substract_background(arr):
    x = range(len(arr))
    mean = sum(arr)/len(arr)
    arr = arr/mean
    background = UnivariateSpline(x, arr, s=100)
    cleaned = arr-background(x)

    if GALLERY:
        plt.plot(x, arr)
        plt.plot(x, background(x))
        plt.show()

    return cleaned


def main():
    tab = Table.read("ocl.vot")
    hist = numpy.array([substract_background(r["hist"][1:-1])
      for r in tab])
    plt.matshow(hist, cmap='gist_heat')
    plt.show()


if __name__=="__main__":
    main()
```
Category: Demo
ADQL Traps #1: NULL

2019-07-03 Markus Demleitner

NULL is a difficult concept. Not only in SQL

I recently got embarrassed by ADQL NULLs, i.e., the magic value indicating that a value in a given column is missing. And since that's a common source of errors when writing ADQL queries, I'll take this as a cue for a blog post.

The concrete background is fairly technical and registry-ish; suffice it to say that some data providers who implemented interfaces conforming to some standard didn't properly say so in their registry records. Back in RegTAP 1.0 (that's the standard that says how a client like TOPCAT talks to the VO Registry), I decided to work around that by fudging the pattern for how to discover those interfaces so they'd still be found.

In RegTAP 1.1, which is now under review by the VO community, I wanted to do away with that workaround. But would that break anything? This question translates to “are there vs:ParamHTTP interfaces that don't have a role attribute of std”. Whatever “ParamHTTP” and “role attribute” actually mean, just appreciate that it looks like it might translate into SQL like:
```
select * from rr.interface
where
  intf_type='vr:paramhttp'
  and not intf_role='std'
```
I ran that query, rejoiced because it didn't return anything, removed the workarund from the standard, and then was shot down when I read Mark's mail (politely) saying I'm wrong and there are services still requiring the workaround. As usual: If a query returns what you expect, be double careful.

What went wrong? Well, NULL semantics. You see, in SQL NULL is never equal to anything, not even itself (it's like NaN in IEEE floats in that: try n = float('nan');print(n==n) in Python and look again if you're cool about it). It's also not unequal. Don't take my word for it. Try:
```
select * from tap_schema.schemas where NULL=NULL
```
and:
```
select * from tap_schema.schemas where NULL!=NULL
```
– you'll get empty results in both cases.

What does that mean for science queries? Well, whenever there's NULLs in columns (and the only safe assumption for now is that they may hide in there; we should probably add nun-null as a column property in the tap schema and in VODataService some day), you need to be careful in particular with inverted logic.

Here's an example: Suppose you want to investigate NGC objects brighter than 10 mag in B in one bin in everything else in another. The ones brighter are simple:
```
select count(*) from openngc.data where mag_b<10
```
(try it on the TAP server at http://dc.g-vo.org/tap, it's 383 in the current release). It becomes difficult for “the rest”. If you write:
```
select count(*) from openngc.data where not mag_b<10
```
or, equivalently:
```
select count(*) from openngc.data where mag_b>=10
```
you'll get (for the current release) 10887. However, the whole catalogue has 13954 entries, so there's 13954-10887-383=2684 rows missing. Your “rest” has missed everything for which mag_b isn't given. Sure enough:
```
select count(*) from openngc.data where mag_b is null
```
(and this is the only good way to compare against null) gives 2684.

The right way to say “anything for which mag_b is not smaller than 10” thus is:
```
select count(*) from openngc.data
where
  not mag_b<10
  or mag_b is null
```
Morale: Unless you're sure there are no missing values (i.e., NULLs) in a column you're looking at, think about what these mean to your research (or other) question: Should these rows just vanish? Then you usually don't need to do anything and the SQL semantics magically do the right thing (which is why things are defined as they are). If, however, the corresponding rows would mean something to your question, you need to be explicit, and you must have some condition involving IS NULL or IS NOT NULL.

The trouble, of course, is that just knowing this still isn't enough. You need to remember it in the right moment. Or you'll share my fate of suffering some public embarrassement.

Category: Demo
Say hello to RegTAP

2018-01-19 Markus Demleitner

GAVO's WIRR registry interface in action to find resources with radio parallaxes.

RegTAP is one of those standards that a scientist will normally not see – it works in the background and makes, for instance, TOPCAT display the Cone Search services matching some key words. And it's behind the services like WIRR, our Web Interface to the Relational Registry (“Relational Registry” being the official name for RegTAP) that lets you do some interesting data discovery beyond what current clients support. In the screenshot above, for instance (try it yourself), I'm looking for cone search services having parallaxes presumably from radio observations. You could now transmit the services you've found to, say, TOPCAT or your own pyvo-based program to start querying them.

The key point this query is the use of UCDs – these let services declare fairly unambiguously what kind of physics (if you take that word with a grain of salt) they are talking about. In the example, pos.parallax means, well, a parallax, and the percent character is a wildcard (coming not from UCDs, but from ADQL). That wildcard is a good idea here because without it we might miss things like pos.parallax;obs and pos.parallax;stat.fit that people might have used to distinguish “raw” and ”processed” estimates.

UCDs are great for data discovery. Really.

Sometimes, however, clicking around in menus just isn't good enough. That's when you want the full power of RegTAP and write your very own queries. The good news: If you know ADQL (and you should!), you're halfway there already.

Here's one example of direct RegTAP use I came up with the other day. The use case was discovering data collections that give the effective temperatures of components of binary star systems.

If you check the UCD list, that “physics” translates into data that has columns with UCDs of phys.temperature and meta.code.multip at the same time. To translate that into a RegTAP query, have a look at the tables that make up a RegTAP service: its ”schema”. Section 8 of the standard lists all the tables there are, and there's an ADASS poster that has an image of the schema with the more common columns illustrated. Oh, and if you're new to RegTAP, you're probably better off briefly studying the examples first to get a feeling for how RegTAP is supposed to work.

You will find that a pair of ivoid – the VO's global resource identifier – and a per-resource table index uniquely identify a table within the entire registry. So, an ADQL query to pick out all tables containing temperatures and component identifiers would look like this:
```
SELECT DISTINCT ivoid, table_index
FROM
rr.table_column AS t1
JOIN rr.table_column AS t2
USING (ivoid, table_index)
WHERE t1.ucd='phys.temperature'
AND t2.ucd='meta.code.multip'
```
– the DISTINCT makes it so even tables that have lots of temperatures or codes only turn up once in our result set, and the somewhat odd self-join of the rr.table_column table with itself lets us say “make sure the two columns are actually in the same table”. Note that you could catch multi-table resources that define the components in one table and the temperatures in another by just joining on ivoid rather than ivoid and table_index.

You can run this query on any RegTAP endpoint: GAVO operates a small network of mirrors behind http://reg.g-vo.org/tap, there's the ESAC one at http://registry.euro-vo.org/regtap/tap, and STScI runs one at http://vao.stsci.edu/RegTAP/TapService.aspx. Just use your usual TAP client.

But granted, the result isn't terribly user-friendly: just identifiers and number. We'd at least like to see the names and descriptions of the tables so we know if the data is somehow relevant.

RegTAP is designed so you can locate the columns you would like to retrieve or constrain and then just NATURAL JOIN everything together. The table_description and table_name columns are in rr.res_table, so all it takes to see them is to take the query above and join its result like this:
```
SELECT table_name, table_description
FROM rr.res_table
NATURAL JOIN (
  SELECT DISTINCT ivoid, table_index
  FROM
  rr.table_column AS t1
  JOIN rr.table_column AS t2
  USING (ivoid, table_index)
  WHERE t1.ucd='phys.temperature'
  AND t2.ucd='meta.code.multip') as q
```
If you try this, you'll see that we'd like to get the descriptions of the resources embedding the tables, too in order to get an idea what we can expect from a given data collection. And if we later want to find services exposing the tables (WIRR is nice for that – try the ivoid constraint –, but for this example all resources currently come from VizieR, so you can directly use VizieR's TAP service to interact with the tables), you want the ivoids. Easy: Just join rr.resource and pick columns from there:
```
SELECT table_name, table_description, res_description, ivoid
FROM rr.res_table
NATURAL JOIN rr.resource
NATURAL JOIN (
  SELECT DISTINCT ivoid, table_index
  FROM
  rr.table_column AS t1
  JOIN rr.table_column AS t2
  USING (ivoid, table_index)
  WHERE t1.ucd='phys.temperature'
  AND t2.ucd='meta.code.multip') as q
```
If you've made it this far and know a bit of ADQL, you probably have all it really takes to solve really challenging data discovery problems – as far as Registry metadata reaches, that is, which currently does not include space-time coverage. But stay tuned, more on this soon.

In case you're looking for a more systematic introduction into the world of the Registry and RegTAP, there are two... ouch. Can I really link to Elsevier papers? Well, here goes: 2014A&C.....7..101D (a.k.a. arXiv:1502.01186 on the Registry as such and 2015A%26C....11...91D (a.k.a. arXiv:1407.3083) mainly on RegTAP.

Category: Demo
ADQL tricks at MPIA

2017-06-29 Markus Demleitner

The 2017-06-29 ADQL talk (red circle) from 30000 ft

Today I was up on Heidelberg's signature mountain, Königstuhl, at the Max-Planck-Institute for Astronomy for a little talk on what I'd provisionally call “intermediate ADQL” – discussing some aspects of ADQL and some TAP techniques that may not be immediately obvious but still generally and straightforwardly applicable to everyday problems. Since I suspect the lecture notes for that talk may be of interest to some readers of this blog, I thought I should share them here.

What this also contains is a very quick piece of pyVO-based python (which needs both this helper and a recent pyVO) for a use case that comes up fairly often: “Give me all proper motions (radio fluxes, distances, radial velocities, whatever) for object in this region.”

This uses a discovery case I've been after for quite a while now: Find services by the UCDs of tables within them. And while that's been possible for quite a while on GAVO's Registry UI WIRR, there's still too many services that don't declare their tables to the Registry, and when talking about TAP, the situation is still a bit worse (as has been mentioned in my account of the last interop). So – enjoy the code, but very frankly, you'll still see wires sticking out for a several months yet.

And if you run a TAP service yourself, please have a look at how to enable table discovery over on the IVOA wiki so we can finally get those pesky wires out of our users' eyes.

Category: Demo
Updated Proper Motion Tutorial

2017-03-24 Markus Demleitner

At the risk of turning this into a blog on nice TAP tricks (which it's not supposed to be): Our classic short tutorial on adding proper motions to almost arbitrary object lists has just gotten a facelift today.

And there's new content, too – I now show what to do when you don't even have positions but just object names. In order to keep this sufficiently geeky, here's the query as a spoiler:
```
SELECT col1, ra, dec
FROM TAP_UPLOAD.t1
LEFT OUTER JOIN ident
ON (id=normId(col1))
LEFT OUTER JOIN basic
ON (oidref=oid)
```
But to close on a non-TAP topic: Registry! There's an experimental facility to have this kind of thing in the Registry; the PM tutorial is in, for instance, with the ivoid ivo://edu.gavo.org/hd/gavo_addpms). One thing you can do with this is generate a list of registred documents that essentially updates itself from the registry.

Another is figure out where the source code of the document is (if the authors choose to share it, which is of course a very smart thing to do); in our example it's in Volute, the IVOA's semi-official version control system. So, if you find a bug (defined as “superset of typo”) in the linked document, you're most welcome to supply patches as diffs or just directly fix things if you have commit privileges in Volute.

Category: Demo

« Page 3 / 4 »