Artikel mit Tag ADQL:

  • Histograms and Hidden Open Clusters

    image: reddish pattern

    Colour-coded histograms for distances of stars in the direction of some NGC open clusters -- one cluster per line, so you're looking a a couple of Gigabytes of data here. If you want this a bit more precise: Read the article and generate your own image.

    I have spent a bit of time last week polishing up what will (hopefully) be the definitive source of common ADQL User Defined Functions (UDFs) for IVOA review. What's a UDF, you ask? Well, it is an extension to ADQL where service operators can invent new functionality. If you have been following this blog for a while, you will probably remember the ivo_healpix_index function from our dereddening exercise (and some earlier postings): That was an UDF, too.

    This polishing work reminded me of a UDF I've wanted to blog about for a quite a while, available in DaCHS (and thus on our Heidelberg Data Center) since mid-2018: gavo_histogram. This, I claim, is a powerful tool for analyses over large amounts of data with rather moderate local means.

    For instance, consider this classic paper on the nature of NGC 2451: What if you were to look for more cases like this, i.e., (indulging in a bit of poetic liberty) open clusters hidden “behind” other open clusters?

    Somewhat more technically this would mean figuring out whether there are “interesting” patterns in the distance and proper motion histograms towards known open clusters. Now, retrieving the dozens of millions of stars that, say, Gaia, has in the direction of open clusters to just build histograms – making each row count for a lot less than one bit – simply is wasteful. This kind of counting and summing is much better done server-side.

    On the other hand, SQL's usual histogram maker, GROUP BY, is a bit unwieldy here, because you have lots of clusters, and you will not see anything if you munge all the histograms together. You could, of course, create a bin index from the distance and then group by this bin and the object name, somewhat like ...ROUND(r_est/20) as bin GROUP by name, bin – but that takes quite a bit of mangling before it can conveniently be used, in particular when you take independent distributions over multiple variables (“naive Bayesian”; but then it's the way to go if you want to capture dependencies between the variables).

    So, gavo_histogram to the rescue. Here's what the server-provided documentation has to say (if you use TOPCAT, you will find this in the ”Service” tab in the TAP windows' ”Use Service” tab):

    gavo_histogram(val REAL, lower REAL, upper REAL, nbins INTEGER) -> INTEGER[]
    
    The aggregate function returns a histogram of val with
    nbins+2 elements. Assuming 0-based arrays, result[0] contains
    the number of underflows (i.e., val<lower), result[nbins+1]
    the number of overflows. Elements 1..nbins are the counts in
    nbins bins of width (upper-lower)/nbins. Clients will have to
    convert back to physical units using some external communication,
    there currently is no (meta-) data as to what lower and upper was in
    the TAP response.
    

    This may sound a bit complicated, but the gist really is: type gavo_histogram(r_est, 0, 2000, 20) as hist, and you will get back an array with 20 bins, roughly 0..100, 100..200, and so on, and two extra bins for under- and overflows.

    Let's try this for our open cluster example. The obvious starting point is selecting the candidate clusters; we are only interested in famous clusters, so we take them from the NGC (if that's too boring for you: with TAP uploads you could take the clusters from Simbad, too), which conveniently sits in my data center as openngc.data:

    select name, raj2000, dej2000, maj_ax_deg
    from openngc.data
    where obj_type='OCl'
    

    Then, we need to add the stars in their rough directions. That's a classic crossmatch, and of course these days we use Gaia as the star catalogue:

    select name, source_id
    from openngc.data
    join gaia.dr2light
    on (
      1=contains(
        point(ra,dec),
        circle(raj2000, dej2000, maj_ax_deg)))
    where obj_type='OCl')
    

    This is now a table of cluster names and Gaia source ids of the candidate stars. To add distances, you could fiddle around with Gaia parallaxes, but because there is a 1/x involved deriving distances, the error model is complicated, and it is much easier and safer to adopt Bailer-Jones et al's pre-computed distances and join them in through source_id.

    And that distance estimation, r_est, is exactly what we want to take our histograms over – which means we have to group by name and use gavo_histogram as an aggregate function:

    with ocl as (
      select name, raj2000, dej2000, maj_ax_deg, source_id
      from openngc.data
      join gaia.dr2light
      on (
        1=contains(
          point(ra,dec),
          circle(raj2000, dej2000, maj_ax_deg)))
      where obj_type='OCl')
    
    select
      name,
      gavo_histogram(r_est, 0, 4000, 200) as hist
    from
      gdr2dist.main
      join ocl
      using (source_id)
    where r_est!='NaN'
    group by name
    

    That's it! This query will give you (admittedly somewhat raw, since we're ignoring the confidence intervals) histograms of the distances of stars in the direction of all NGC open clusters. Of course, it will run a while, as many millions of stars are processed, but TAP async mode easily takes care of that.

    Oh, one odd thing is left to discuss (ignore this paragraph if you don't know what I'm talking about): r_est!='NaN'. That's not quite ADQL but happens to do the isnan of normal programming languages at least when the backend is Postgres: It is true if computations failed and there is an actual NaN in the column. This is uncommon in SQL databases, and normal NULLs wouldn't hurt gavo_histogram. In our distance table, some NaNs slipped through, and they would poison our histograms. So, ADQL wizards probably should know that this is what you do for isnan, and that the usual isnan test val!=val doesn't work in SQL (or at least not with Postgres).

    So, fire up your TOPCAT and run this on the TAP server http://dc.g-vo.org/tap.

    You will get a table with 618 (or so) histograms. At this point, TOPCAT can't do a lot with them. So, let's emigrate to pyVO and save this table in a file ocl.vot

    My visualisation proposition would be: Let's substract a “background” from the histograms (I'm using splines to model that background) and then plot them row by row; multi-peaked rows in the resulting image would be suspicious.

    This is exactly what the programme below does, and the image for this article is a cutout of what the code produces. Set GALLERY = True to see how the histograms and background fits look like (hit 'q' to get to the next one).

    In the resulting image, any two yellow dots in one line are at least suspicious; I've spotted a few, but they are so consipicuous that others must have noticed. Or have they? If you'd like to check a few of them out, feel free to let me know – I think I have a few ideas how to pull some VO tricks to see if these things are real – and if they've been spotted before.

    So, here's the yellow spot programme:

    from astropy.table import Table
    import matplotlib.pyplot as plt
    import numpy
    from scipy.interpolate import UnivariateSpline
    
    GALLERY = False
    
    def substract_background(arr):
        x = range(len(arr))
        mean = sum(arr)/len(arr)
        arr = arr/mean
        background = UnivariateSpline(x, arr, s=100)
        cleaned = arr-background(x)
    
        if GALLERY:
            plt.plot(x, arr)
            plt.plot(x, background(x))
            plt.show()
    
        return cleaned
    
    
    def main():
        tab = Table.read("ocl.vot")
        hist = numpy.array([substract_background(r["hist"][1:-1])
          for r in tab])
        plt.matshow(hist, cmap='gist_heat')
        plt.show()
    
    
    if __name__=="__main__":
        main()
    
  • The Bochum Galactic Disk Survey

    Patches of higher perceived variability on the Sky

    Fig 1: How our haphazard variability ratio varies over the sky (galactic coordinates). And yes, it's clear that this isn't dominated by physical variability.

    About a year ago, I reported on a workshop on “Large Surveys with Small Telescopes” in Bamberg; at around the same time, I've published an example for those, the Bochum Galactic Disk Survey BGDS, which used a twin 15 cm robotic telescope in some no longer forsaken place in the Andes mountains to monitor the brighter stars in the southern Milky Way. While some tables from an early phase of the survey have been on VizieR for a while, we now publish the source images (also in SIAP and Obscore), the mean photometry (via SCS and TAP) and, perhaps potentially most fun of all, the the lightcurves (via SSAP and TAP) – a whopping 35 million of the latter.

    This means that in tools like Aladin, you can now find such light curves (and images in two bands from a lot of epochs) when you are in the survey's coverage, and you can run TAP queries on GAVO's http://dc.g-vo.org/tap server against the full photometry table and the time series.

    Regular readers of this blog will not be surprised to see me use this as an excuse to show off a bit of ADQL trickery.

    If you have a look at the bgds.phot_all table in your favourite TAP client, you'll see that it has a column amp, giving the difference between the highest and lowest magnitude. The trouble is that amp for almost all objects just reflects the measurement error rather than any intrinsic variability. To get an idea what's “normal” (based on the fact that essentially all stars have essentially constant luminosity on the range and resolution scales considered here), run a query like:

    SELECT ROUND(amp/err_mag*10)/10 AS bin, COUNT(*) AS n
    FROM bgds.phot_all
    WHERE nobs>10
    GROUP BY bin
    

    As this scans the entire 75 million rows of the table, you will probably have to use async mode to run this.

    distribution of amplitude/mag errors

    Figure 2: The distribution of amplitude over magnitude error for all BGDS objects with nobs>10 (blue) and the subset with a mean magnitude brighter than 15 (blue).

    When it comes back, you will have, for objects where any sort of statistics make sense at all (hence nobs>10), a histogram (of sorts) of the amplitude in units of upstream's magnitude error estimation. If you log-log-plot this, you'll see something like Figure 2. The curve at least tells you that the magnitude error estimate is not very far off – the peak at about 3 “sigma” is not unreasonable since about half of the objects have nobs of the order of a hundred and thus would likely contain outliers that far out assuming roughly Gaussian errors.

    And if you're doing a rough cutoff at amp/magerr>10, you will get perhaps not necessarily true variables, but, at least potentially interesting objects.

    Let's use this insight to see if we spot any pattern in the distribution of these interesting objects. We'll use the HEALPix technique I've discussed three years ago in this blog, but with a little twist from ADQL 2.1: The Common Table Expressions or CTEs I have already mentioned in my blog post on ADQL 2.1 and then advertised in the piece on the Henry Draper catalogue. The brief idea, again, is that you can write queries and give them a name that you can use elsewhere in the query as if it were an actual table. It's not much different from normal subqueries, but you can re-use CTEs in multiple places in the query (hence the “common”), and it's usually more readable.

    Here, we first create a version of the photometry table that contains HEALPixes and our variability measure, use that to compute two unsophisticated per-HEALPix statistics and eventually join these two to our observable, the ratio of suspected variables to all stars observed (the multiplication with 1.0 is a cheap way to make a float out of a value, which is necessary here because a/b does integer division in ADQL if a and b are both integers):

    WITH photpoints AS (
      SELECT
        amp/err_mag AS redamp,
        amp,
        ivo_healpix_index(5, ra, dec) AS hpx
      FROM bgds.phot_all
      WHERE
        nobs>10
        AND band_name='SDSS i'
        AND mean_mag<16),
    all_objs AS (
      SELECT count(*) AS ct,
        hpx
        FROM photpoints GROUP BY hpx),
    strong_var AS (
      SELECT COUNT(*) AS ct,
        hpx
        FROM photpoints
        WHERE redamp>4 AND amp>1 GROUP BY hpx)
    SELECT
      strong_var.ct/(1.0*all_objs.ct) AS obs,
      all_objs.ct AS n,
      hpx
    FROM strong_var JOIN all_objs USING (hpx)
    WHERE all_objs.ct>20
    

    If you plot this using TOPCAT's HEALPix thingy and ask it to use Galactic coordinates, you'll end up with something like Figure 1.

    There clearly is some structure, but given that the variables ratio reaches up to 0.2, this is still reflecting instrumental or pipeline effects and thus earthly rather than Astrophysics. And that's going beyond what I'd like to talk about on a VO blog, although I'l take any bet that you will see significant structure in the spatial distribution of the variability ratio at about any magnitude cutoff, since there are a lot of different population mixtures in the survey's footprint.

    Be that as it may, let's have a quick look at the time series. As with the short spectra from Byurakan use case, we've stored the actual time series as arrays in the database (the mjd and mags columns in bgds.ssa_time_series. Unfortunately, since they are a lot less array-like than homogeneous spectra, it's also a lot harder to do interesting things with them without downloading them (I'm grateful for ideas for ADQL functions that will let you do in-DB analysis for such things). Still, you can at least easily download them in bulk and then process them in, say, python to your heart's content. The Byurakan use case should give you a head start there.

    For a quick demo, I couldn't resist checking out objects that Simbad classifies as possible long-period variables (you see, as I write this, the public bohei over Betelgeuse's brief waning is just dying down), and so I queried Simbad for:

    SELECT ra, dec, main_id
    FROM basic
    WHERE
      otype='LP?'
      AND 1=CONTAINS(
         POINT('', ra, dec),
         POLYGON('', 127, -30, 112, -30, 272, -30, 258, -30))
    

    (as of this writing, Simbad still needs the ADQL 2.0-compliant first arguments to POINT and POLYGON), where the POLYGON is intended to give the survey's footprint. I obtained that by reading off the coordinates of the corners in my Figure 1 while it was still in TOPCAT. Oh, and I had to shrink it a bit because Simbad (well, the underlying Postgres server, and, more precisely, its pg_sphere extension) doesn't want polygons with edges longer than π. This will soon become less pedestrian: MOCs in relational databases are coming; more on this soon.

    TOPCAT action shot with a light curve display

    Fig 3: V566 Pup's BGDS lightcuve in a TOPCAT configured to auto-plot the light curves associated with a row from the bgds.ssa_time_series table on the GAVO DC TAP service.

    If you now do the usual spiel with an upload crossmatch to the bgds.ssa_time_series table and check “Plot Table” in Views/Activation Action, you can quickly page through the light curves (TOPCAT will keep the plot style as you go from dataset to dataset, so it's worth configuring the lines and the error bars). Which could bring you to something like Fig. 3; and that would suggest that V* V566 Pup isn't really long-period unless the errors are grossly off.

  • Parallel Queries

    Image: Plot of run times

    An experiment with parallel querying of PPMX, going from single-threaded execution to using seven workers.

    Let me start this post with a TL;DR for

    scientists:Large analysis queries (like those that contain a GROUP BY clause) profit a lot from parallel execution, and you needn't do a thing for that.
    DaCHS operators:
     When you have large tables, Postgres 11 together with the next DaCHS release may speed up your responses quite dramatically in some cases.

    So, here's the story –

    I've finally overcome my stretch trauma and upgraded the Heidelberg data center's database server to Debian buster. With that, I got Postgres 11, and I finally bothered to look into what it takes to enable parallel execution of database queries.

    Turns out: My Postgres started to do parallel execution right away, but just in case, I went for the following lines in postgresql.conf:

    max_parallel_workers_per_gather = 4
    max_worker_processes = 10
    max_parallel_workers = 10
    

    Don't quote me on this – I frankly admit I haven't really developed a feeling for the consequences of max_parallel_workers_per_gather and instead just did some experiments while the box was loaded otherwise, determining where raising that number has a diminishing return (see below for more on this).

    The max_worker_processes thing, on the other hand, is an educated guess: on my data center, there's essentially never more than one person at a time who's running “interesting”, long-running queries (i.e., async), and that person should get the majority of the execution units (the box has 8 physical CPUs that look like 16 cores due to hyperthreading) because all other operations are just peanuts in comparison. I'll gladly accept advice to the effect that that guess isn't that educated after all.

    Of course, that wasn't nearly enough. You see, since TAP queries can return rather large result sets – on the GAVO data center, the match limit is 16 million rows, which for a moderate row size of 2 kB already translates to 32 GB of memory use if pulled in at once, half the physical memory of that box –, DaCHS uses cursors (if you're a psycopg2 person: named cursors) to stream results and write them out to disk as they come in.

    Sadly, postgres won't do parallel plans if it thinks people will discard a large part of the result anyway, and it thinks that if you're coming through a cursor. So, in SVN revision 7370 of DaCHS (and I'm not sure if I'll release that in this form), I'm introducing a horrible hack that, right now, just checks if there's a literal “group” in the query and doesn't use a cursor if so. The logic is, roughly: With GROUP, the result set probably isn't all that large, so streaming isn't that important. At the same time, this type of query is probably going to profit from parallel execution much more than your boring sequential scan.

    This gives rather impressive speed gains. Consider this example (of course, it's selected to be extreme):

    import contextlib
    import pyvo
    import time
    
    @contextlib.contextmanager
    def timeit(activity):
      start_time = time.time()
      yield
      end_time = time.time()
      print("Time spent on {}: {} s".format(activity, end_time-start_time))
    
    
    svc = pyvo.tap.TAPService("http://dc.g-vo.org/tap")
    with timeit("Cold (?) run"):
      svc.run_sync("select round(Rmag) as bin, count(*) as n"
        " from ppmx.data group by bin")
    with timeit("Warm run"):
      svc.run_sync("select round(Rmag) as bin, count(*) as n"
        " from ppmx.data group by bin")
    

    (if you run it yourself and you get warnings about VOTable versions from astropy, ignore them; I'm right and astropy is wrong).

    Before enabling parallel execution, this was 14.5 seconds on a warm run, after, it was 2.5 seconds. That's an almost than a 6-fold speedup. Nice!

    Indeed, that holds beyond toy examples. The showcase Gaia density plot:

    SELECT
            count(*) AS obs,
            source_id/140737488355328 AS hpx
    FROM gaia.dr2light
    GROUP BY hpx
    

    (the long odd number is 235416-6, which turns source_ids into level 6-HEALPixes as per Gaia footnote id; please note that Postgres right now isn't smart enough to parallelise ivo_healpix), which traditionally ran for about an hour is now done in less than 10 minutes.

    In case you'd like to try things out on your postgres, here's what I've done to establish the max_parallel_workers_per_gather value above.

    1. Find a table with a few 1e7 rows. Think of a query that will return a small result set in order to not confuse the measurements by excessive client I/O. In my case, that's a magnitude histogram, and the query would be:

      select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;

      Run this query once so the data is in the disk cache (the query is “warm”).

    2. Establish a non-parallel baseline. That's easy to do:

      set max_parallel_workers_per_gather=0;
      
    3. Then run:

      explain analyze select round(Rmag) as bin, count(*) as n from ppmx.data group by bin;
      

      You should see a simple query plan with the runtime for the non-parallel execution – in my case, a bit more than 12 seconds.

    4. Then raise the number of max_parallel_workers_per_gatherer successively. Make sure the query plan has lines of the form “Workers Planned” or so. You should see that the execution time falls with the number of workers you give it, up to the value of max_worker_processes – or until postgres decides your table is too small to warrant further parallelisation, which for my settings happened at 7.

    Note, though, that in realistic, more complex queries, there will probably be multiple operations that will profit from parallelisation in a single query. So, if in this trivial example you can go to 15 gatherers and still see an improvement, this could actually make things slower for complex queries. But as I said above: I have no instinct yet for how things will actually work out. If you have experiences to share: I'm sure I'm not the only person on dachs-users who't be interested.

  • LAMOST5 meets Datalink

    One of the busiest spectral survey instruments operated right now is the Large Sky Area Multi-Object Fiber Spectrograph Telescope (LAMOST). And its data in the VO, more or less: DR2 and DR3 have been brought into the VO by our Czech colleagues, but since they currently lack resources to update their services to the latest releases, they have kindly given me their DaCHS resource descriptor, and so I had a head start for publishing DR5 in Heidelberg.

    With some minor updates, here it is now: Over nine million medium-resolution spectra covering large parts of the northen sky – the spatial coverage is like this:

    Coverage Healpix map

    There's lots of fun to be had with this; of course, there's an SSA service, so when you point Aladin or Splat at some part of the covered sky and look for spectra, chances are you'll see LAMOST spectra, and when working on some of our tutorials (this one, for example), it happened that LAMOST actually had what I was looking for when writing them.

    But I'd like to use the opportunity to mention two other modes of accessing the data.

    Stacked spectra

    Tablesample and TOPCAT's Plot Table activation action

    Say you'd like to look at spectra of M stars and would like to have some sample from across the sky, fire up TOPCAT, point its TAP client the GAVO DC TAP service (http://dc.g-vo.org/tap) and run something like:

    select
      ssa_pubDID, accref, raj2000, dej2000, ssa_targsubclass
    from lamost5.data tablesample(1)
    where
      ssa_targsubclass like 'M%'
    

    This is using the TABLESAMPLE modifier in the from clause, which isn't standard ADQL yet. As mentioned in the DaCHS 1.4 announcement, DaCHS has a prototype implementation of what's been discussed on the IVOA's DAL mailing list: pick a part of a table rather than the full one. It takes a percentage as an argument, and tells the server to choose about this percentage of the table's records using a reasonable and fast heuristic. Note that this won't give you perfect statistical sampling, but if it's not “good enough” for some purpose, I'd like to learn about that purpose.

    Drawing a proper statistical sample, on the other hand, would take minutes on the GAVO database server – with tablesample, I had the roughly 6000 spectra the above query returns essentially instantaneously, and from eyeballing a sky plot of them, I'd say their distribution is close enough to that of the full DR5. So: tablesample is your friend.

    For a quick look at the spectra themselves, in TOPCAT click Views/Activation Actions, check “Plot Table” and make sure TOPCAT proposes the accref column as “Table Location” (if you don't see these items, update your TOPCAT – it's worth it). Now click on a row or perhaps a dot on a plot and behold an M spectrum.

  • DaCHS 1.4 is out

    Dachs logo with "version 1.4" superposed

    Since the Groningen Interop is over, it's time for a DaCHS release, and so, roughly half a year after the release of DaCHS 1.3, today I've pushed DaCHS 1.4 into our Debian repository.

    As usual, you should upgrade as soon as you find time to do so, because upgrades become more difficult if they span large version gaps; and one of these days you will need some new feature or run into one of the odd bugs. Upgrading is a good opportunity to also get your DaCHS ready for buster by adding the repos mentioned there.

    The list of new features is rather short this time around. Here are some noteworthy ones:

    • There's now an XML grammar that can be used when you have to parse smallish snippets of XML as, for instance, in VOEvent.
    • You can now use TABLESAMPLE(1) after a table specification in DaCHS' ADQL to tell the database engine to just use 1% of a table for a query. While this isn't a precise way to sample tables, it's great when developing queries.
    • Also among new features I'd like to see in ADQL and have therefore put into DaCHS is GENERATE_SERIES(a,b), which is what is known as table-generating function in SQL . If you know SDSS CasJobs, you'll have seen lots of those already. GENERATE_SERIES, however, is really plain: it just spits out a table with a column with integers between a and b. For an example of why one might what to have that, check out the poster I'm linking to in my ADASS report.
    • If you have an updating data descriptor (usually, because you keep feeding data into a data collection), DaCHS will no longer automatically re-make its dependencies (like, say, views). That's because that's not necessary in general, and it's a pain if every update on an obscore-published table tears down and rebuilds the obscore view. For the rare cases when you do need to rebuild dependencies, there's now a remakeOnDataChange attribute on data.
    • At the interop, I've mentioned a few use cases for knowing which server software you're talking to, and I've said that people should set their server headers to informative values. DaCHS does that now.

    To conclude on a low note: This is probably going to be the last release of DaCHS for python 2. Even though we will have to shed a dependency or two that simply will not be ported to python 3, and even though I'm rather unhappy with a few properties of the python 3 port of twisted, there's probably no way to escape this, given that Debian is purging out python 2 packages quickly already.

    So, when we meet again for the next release, you'll probably be looking at DaCHS 2.0, and where you have custom code in your RDs, it's rather likely that you'll see a minor amount of breakage. I promise I'll do everything I can to make the migration easy for deployers, but I can't do higher magic, so: If there's ever been a time to add regression tests to your RDs, it's now.

« Page 2 / 5 »