Articles from Software

  • Global Dataset Discovery in PyVO

    A Tkinter user interface with inputs for Space, Spectrum, and Time, a checkbox marked "inclusive", and buttons Run, Stop, Broadcast, Save, and Quit.

    Admittedly somewhat old-style: As part of teaching global dataset discovery to pyVO, I have also come up with a Tkinter GUI for it. See A UI for more on this.

    One of the more exciting promises of the Virtual Observatory was global dataset discovery: You say “Give me all spectra of object X that there are“, and the computer relates that request to all the services that might have applicable data. Once the results come in, they are merged into some uniformly browsable form.

    In the early VO, there were a few applications that let you do this; I fondly remember VODesktop. As the VO grew and diversified, however, this became harder and harder, partly because there were more and more services, partly because there were more protocols through which to publish data. Thus, for all I can see, there is, at this point, no software that can actually query all services plausibly serving, say, images or spectra in the VO.

    I have to say that writing such a thing is not for the faint-hearted, either. I probably wouldn't have tackled it myself unless the pyVO maintainers had made it an effective precondition for cleaning up the pyVO Servicetype constraint.

    But they did, and hence as a model I finally wrote some code to do all-VO image searches using all of SIA1, SIA2, and obscore, i.e., the two major versions of the Simple Image Access Protocol plus Obscore tables published through TAP services. I actually have already reported in Tucson on some preparatory work I did last summer and named a few problems:

    • There are too many services to query on a regular basis, but filtering them would require them to declare their coverage; far too many still don't.
    • With the current way of registering obscore tables, there is no way to know their coverage.
    • One dataset may be availble through up to three protocols on a single host.
    • SIA1 does not even let you constrain time and spectrum.

    Some of these problems I can work around, others I can try to fix. Read on to find out how I fared so far.

    The pyVO API

    Currently, the development happens in pyVO PR #470. While it is still a PR, let me point you to temporary pyVO docs on the proposed pyvo.discover module – of course, all of this is for review and probably not in the shape it will remain in[1].

    To quote from there, the basic usage would be something like:

    from pyvo import discover
    from astropy import units as u
    from astropy import time
    
    datasets, log = discover.images_globally(
      space=(274.6880, -13.7920, 0.1),
      spectrum=500.7*u.nm,
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')))
    

    At this point, only a cone is supported as a space constraint, and only a single point in spectrum. It would certainly be desirable to be more flexible with the space constraint, but given the capabilities of the various protocols, that is hard to do. Actually, even with the plain cone Obscore (i.e., ironically, the most powerful of the discovery protocols covered here) currently results in an implementation that makes me unhappy: ugly, slow, and wrong. This is requires a longer discussion; see Appendix: Optionality Considered Harmful.

    datasets at this point is a list of, conceputally, Obscore records. Technically, the list contains instances of a custom class ImageFound, which have attributes named after the Obscore columns. In case you have doubts about the Semantics of any column, the Obscore specification is there to help. And yes, you can argue we should create a single astropy table from that list. You are probably right.

    PyVO adds an extra column over the mandatory obscore set, origin_service. This contains the IVOA identifier (IVOID) of the service at which the dataset was found. You have probably seen IVOIDs before: they are URIs with a scheme of ivo:. What you may not know: these things actually resolve, specifically to registry resource records. You can do this resolution in a web browser: Just prepend https://dc.g-vo.org/I/ to an IVOID and paste the result into the address bar. For instance, my Obscore table has the IVOID ivo://org.gavo.dc/__system__/obscore/obscore; the link below the IVOID leads you to an information page, which happens to be the resource's Registry record formatted with a bit of XSLT. A somewhat more readable but less informative rendering is available when you prepend https://dc.g-vo.org/LP/ (“landing page”).

    The second value returned from discover.images_globally is a list of strings with information on how the global discovery progressed. For now, this is not intended to be machine-readable. Humans can figure out which resources were skipped because other services already cover their data, which services yielded how many records, and which services failed, for instance:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Obscore GAVO Data Center Obscore Table: 2 records
    SIA2 The VO @ ASTRON SIAP Version 2 Service: 0 records
    SIA2 ivo://au.csiro/casda/sia2 skipped: ReadTimeout: HTTPSConnectionPool(host='casda.csiro.au', port=443): Read timed out. (read timeout=20)
    SIA2 CADC Image Search (SIA): 0 records
    SIA2 European HST Archive SIAP service: 0 records
    ...
    

    (On the skipping, see Relationships below). I consider this crucial provenance, as that lets you assess later what you may have missed. When you save the results, be sure to save these, too.

    A feature that will presumably (see Inclusivity for the reasons for this expectation) be important at least for a few years is that you can pass the result of a Registry query, and pyVO will try to find services suitable for image discovery on that set of resources.

    A relatively straightforward use case for that is global obscore discovery. This would look like this:

    from pyvo import discover
    from pyvo import registry
    from astropy import units as u
    from astropy import time
    
    def say(s):
            print(s)
    
    datasets, log = discover.images_globally(
      space=(274.6880, -13.7920, 1),
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')),
      services=registry.search(registry.Datamodel("obscore")),
      watcher=say)
    

    (the watcher thing lets you, well, watch the progress of the discovery).

    A UI

    To get an idea whether this API might one day work for the average astronomer, I have written a Tkinter-based GUI to global image discovery as it is now: tkdiscover (only available from github at this point). This is what a session with it might look like:

    Lots of TOPCAT windows with various graphs and tables, an x-ray image of the sky with overplotted points, and a play gray window offering the specification of space, spectrum, and time constraints.

    The actual UI is in the top right: A plain window in which you can configure a global discovery query by straightfoward serialisations of discover.images_globally's arguments:

    • Space (currently, a cone in RA, Dec, and search radius, separated by whitespace of commas)
    • Spectrum (currently, a single point as a wavelength in metres)
    • Time (currently, either a single point in time – which probably is rarely useful – or an interval, to be entered as civil DALI dates
    • Inclusivity.

    When you run this, this basically calls discover.images_globally and lets you know how it is progressing. You can click Broadcast (which sends the current result to all VOTable clients on the SAMP bus) or Save at any time and inspect how discovery is progressing. I predict you will want to do that, because querying dozens of services will take time.

    There is also a Stop button that aborts the dataset search (you will still have the records already found). Note that the Stop button will not interrupt running network operations, because the network library underneath pyVO, requests, is not designed for being interrupted. Hence, be patient when you hit stop; this may take as long as the configured timeout (currently is 20 seconds) if the service hangs or has to do a lot of work. You can see that tkdiscover has noticed your stop request because the service counter will show a leading zero.

    Service counter? Oh, that's what is at the bottom right of the window. Once service discovery is done, that contains three numbers: The number of services to query, the number of services queried already, and the number of services that failed.

    The table contains the obscore records described above, and the log lines are in the discovery_log INFO. I will give you that this is extremely unreadable in particular in TOPCAT, which normalises the line separators to plain whitespace. Perhaps some other representation of these log lines would be preferable: A PARAM with a char[][] (but VOTable still is terrible with arrays of variable-length strings)? Or a separate table with char[*] entries?

    Inclusivity

    I have promised above I'd explain the “Inclusive” part in both the pyVO API and the Tk UI. Well, this is a bit of a sad story.

    All-VO-queries take time. Thus, in pyVO we try to only query services that we expect serve data of interest. How do we arrive at expectations like that? Well, quite a few records in the Registry by now declare their coverage in space and time (cf. my 2018 post for details).

    The trouble is: Most still don't. The checkmark at inclusive decides whether or not to query these “undecidable” services. Which makes a huge difference in runtime and effort. With the pre-configured constraints in the current prototype (X-Ray images a degree around 274.6880, -13.7920 from the year 1995), we currently discover three services (of which only one actually needs to be queried) when inclusive is off. When it is on, pyVO will query a whopping 323 services (today).

    The inclusivity crisis is particularly bad with Obscore tables because of their broken registration pattern; I can say that so bluntly because I am the author of the standard at fault, TAPRegExt. I am preparing a note with a longer explanation and proposals for fixing matters – <cough> follow me on github –, but in all brevity: Obscore data is discovered using something like a flag on TAP services. That is bad because the TAP services usually have entriely different metadata from their Obscore table; think, in particular, of the physical coverage that is relevant here.

    It will be quite a bit of effort to get the data providers to do the Registry work required to improve this situation. Until that is done, you will miss Obscore tables when you don't check inclusive (or override automatic resource selection as above) – and if you do check inclusive, your discovery runs will take something like a quarter of an hour.

    Relationships

    In general, the sheer number of services to query is the Achilles' heel in the whole plan. There is nothing wrong with having a machine query 20 services, but querying 200 is starting to become an effort.

    With multi-data collection services like Obscore (or collective SIA2 services), getting down to a few dozen services globally for a well-constrained search is actually not unrealistic; once all resources properly declare their coverage, it is not very likely that more than 20 institutions worldwide will have data in a credibly small region of space, time, and spectrum. If all these run collective services and properly declare the datasets to be served by them, that's our 20-services global query right there.

    However, pyVO has to know when data contained in a resource is actually queriable by a collective service. Fortunately, this problem has already been addressed in the 2019 endorsed note on Discovering Data Collections Within Services: Basically, the individual resource declares an IsServedBy relationship to the collective service. PyVO global discovery already looks at these. That is how it could figure out these two things in the sample log given above:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    

    But of course the individual services have to declare these relationships. Surprisingly many already do, as you can observe yourself when you run:

    select ivoid, related_id from
    rr.relationship
    natural join rr.capability
    where
    standard_id like 'ivo://ivoa.net/std/sia%'
    and relationship_type='isservedby'
    

    on your favourite RegTAP endpoint (if you have no preferences, use mine: http://dc.g-vo.org/tap). If you have collective services and run individual SIA services, too, please run that query, see if you are in there, and if not, please declare the necessary relationships. In case you are unsure as to what to do, feel free to contact me.

    Future Directions

    At this point, this is a rather rough prototype that needs a lot of fleshing out. I am posting this in part to invite the more adventurous to try (and break) global discovery and develop further ideas.

    Some extensions I am already envisaging include:

    • Write a similar module for spectra based on SSAP and Obscore. That would then probably also work for time series and similar 1D data.

    • Do all the Registry work I was just talking about.

    • Allow interval-valued spectral constraints. That's pretty straightforward; if you are looking for some place to contribute code, this is what I'd point you to.

    • Track overflow conditions. That should also be simple, probably just a matter of perusing the pyVO docs or source code and then conditionally produce a log entry.

    • Make an obscore s_region out of the SIA1 WCS information. This should also be easy – perhaps someone already has code for that that's tested around the poles and across the stitching line? Contributions are welcome.

    • Allow more complex geometries to define the spatial region of interest. To keep SIA1 viable in that scenario it would be conceivable to compute a bounding box for SIA1 POS/SIZE and do “exact” matching locally on the coarser SIA1 result.

    • Enable multi-position or multi-interval constraints. This pretty certainly would exclude SIA1, and, realistically, I'd probably only enable Obscore services with TAP uploads with this. With those constraints, it would be rather straightforward.

    • Add SODA support: It would be cool if my ImageFound had a way to say “retrieve data for my RoI only”. This would use SODA and datalink to do server-side cutouts where available and do the cut-out locally otherwise. If this sounds like rocket science: No, the standards for that are actually in place, and pyVO also has the necessary support code. But still the plumbing is somewhat tricky, partly also because pyVO's datalink API still is a bit clunky.

    • Going async? Right now, we civilly query one service after the other, waiting for each result before proceeding to the next service. This is rather in line with how pyVO is written so far.

      However, on the network side for many years asynchronous programming has been a very successful paradigm – for instance, our DaCHS package has been based on an async framework from the start, and Python itself has growing in-language support for async, too.

      Async allows you to you fire off a network request and forget about it until the results come back (yes, it's the principle of async TAP, too). That would let people run many queries in parallel, which in turn would result in dramatically reduced waiting times, while we can rather easily ensure that a single client will not overflow any server. Still, it would be handing a fairly powerful tool into possibly unexperienced hands… Well: for now there is no need to decide on this, as pyVO would need rather substantial upgrades to support async.

    Appendix: Optionality Considered Harmful

    The trouble with obscore and cones is a good illustration of the traps of attempting to fix problems by adding optional features. I currently translate the cone constraint on Obscore using:

    "(distance(s_ra, s_dec, {}, {}) < {}".format(
      self.center[0], self.center[1], self.radius)
    +" or 1=intersects(circle({}, {}, {}), s_region))".format(
      self.center[0], self.center[1], self.radius))
    

    which is all of ugly, presumably slow, and wrong.

    To appreciate what is going on, you need to know that Obscore has two ways to define the spatial coverage of an observation. You can give its “center” (s_ra, s_dec) and something like a rough radius (s_fov), or you can give some sort of geometry (e.g., a polygon: s_region). When the standard was written, the authors wanted to enable Obscore services even on databases that do not know about spherical geometry, and hence s_region is considered rather optional. In consequence, it is missing in many services. And even the s_ra, s_dec, s_fov combo is not mandatory non-null, so you are perfectly entitled to only give s_region.

    That is why there are the two conditions or-ed together (ugly) in the code fragment above. 1=intersects(circle(.), s_region) is the correct part; this is basically how the cone is interpreted in SIA1, too. But because s_region may be NULL even when s_ra and s_dec are given, we also need to do a test based on the center position and the field of view. That rather likely makes things slower, possibly quite a bit.

    Even worse, the distance-based condition actually is wrong. What I really ought to take into account is s_fov and then do something like distance(.) < {self.radius}+s_fov, that is, the dataset position need only be closer than the cone radius plus the dataset's FoV (“intersects”). But that would again produce a lot of false negatives because s_fov may be NULL, too, and often is, after which the whole condition would be false.

    On top of that, it is virtually impossible that such an expression would be evaluated using an index, and hence with this code in place, we would likely be seqscanning the entire obscore table almost every time – which really hurts when you have about 85 Million records in your Obscore table (as I do).

    The standard could immediately have sanitised all this by saying: when you have s_ra and s_dec, you must also give a non-empty s_fov and s_region. This is a classic case for where a MUST would have been necessary to produce something that is usable without jumping through hoops. See my post on Requirements and Validators on this blog for a longer exposition on this whole matter.

    I'm not sure if there is a better solution than the current “if the operators didn't bother with s_region, the dataset's FoV will be ignored“. If you have good ideas, by all means let me know.

    [1]

    If you want to try this (in particular without clobbering your “normal” pyVO), do something like this:

    virtualenv --system-site-packages global-datasets
    . global-datasets/bin/activate
    cd global-datasets
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout global-datasets
    pip install .
    
  • DaCHS 2.9 is out

    Our VO server package DaCHS almost always sees two releases per year, each time roughly after the Interops[1]. So, with the Tucson Interop over, it's time for DaCHS 2.9, and this is the traditional what's new post.

    Data Origin – the big headline for this release could be “curation”, in that three upcoming standardoid entities in that field are prototyped in 2.9. One is Data Origin, which is a note on how to embed some very basic provenance information into VOTables.

    This is going to help your users figure out how they came up with a VOTable when the referee has clever questions about the paper they submitted half a year earlier. The good news is: if you defined your metadata in your RD with sufficient care, with DaCHS 2.9 you will automatically do Data Origin.

    Feed your D links – another curation-related new thing in DaCHS is an implementation of what will hopefully be known as BibVO in the future. At this point, it is an unpublished note on Github. In essence, the purpose is to feed bibliographic services – and in particular the ADS – “D links”, i.e., links from publications to data. A part of this works automatically (the source metadatum), but the more advanced biblinks need a bit of manual intervention.

    If you even have, say, an observatory bibliography consisting pairs of papers and data used by these papers, you will probably have to write a handful of code. See biblinks in the reference documentation for details if any of this sounds as if it could apply to you. In this context, I have also enabled passing multiple accrefs to the /get endpoint. Users will then receive a tar file of the referenced data products.

    altIdentifiers in relationships – still in the bibliographic realm, VOResource 1.2 will (almost certainly) let you set altIdentifiers, in particular DOIs, when you declare relationships to other resources. That is probably of interest in particular when you want to declare relationships to things outside of the VO to services like b2find that themselves do not understand ivoids. In that situation, you would write something like:

    Cites: Some external thing
    Cites.altIdentifier: doi:10.fake/123412349876
    

    in a <meta> tag in your RD.

    json columns – postgresql has the very tempting and apparently all-powerful json type; it lets you stick complex structures into database columns and thus apparently relieve you of all the tedious tasks of designing database tables and documenting metadata.

    Written like this, you probably notice it's a slippery slope at best. Still, there are some non-hazardous uses for such columns, and thus you can now say type="json" or (probably preferably) type="jsonb" in column definitions. You can feed these columns with dicts, lists or JSON literals in strings. Clients will receive both of them as JSON string literals in char[*] FIELDs with an xtype of json. Neither astropy nor TOPCAT do anything with that xtype yet, but I expect that will change soon.

    Copy coverage – sometimes two resources have the same spatial (and potentially temporal and spectral) coverage. Since obtaining the coverage is an expensive operation, it would be nice to be able to say “aw, look at that other resource and take its coverage.” The classic example in DaCHS is the system-wide SIAP2 service that really is just a parametric wrapper around obscore. In such cases, you can now say something like:

    <coverage fallbackTo="__system__/obscore"/>
    

    – and //siap2 already does. That's one more reason to occasionally run dachs limits //obscore if you offer an obscore table.

    First VOTable row in tests – if you have calls to getFirstVOTableRow in regression tests (you have regression tests, right?) that return multiple rows, these will fail now until you also pass rejectExtras=False to that call. I've had regressions that were hidden by the function's liberal acceptance of extra rows, and it's too simple to produce unstable tests (that magically succeed and fail depending to the current state of the database) with the old behaviour. I hence hope for your sympathy and understanding in case I broke one of your tests.

    ADQL extensions – there is now arr_count to complement the array extension added in 2.7. Also, our custom UDFs transform, normal_random, to_jd, to_mjd, and simbadpoint now have a prefix of ivo_ rather than the previous gavo_. In order not to break existing queries, DaCHS will still accept the gavo_-prefixed names for the forseeable future, but it will no longer advertise them.

    Minor fixes – as usual, there are many minor bug fixes and improvements, the most visible of which is probably that DaCHS now correctly handles literal + chars in multipart-encoded (”uploads”) requests again; that was broken in 2.8 after the removal of the dependency on python's CGI module. Also, MOC-valued columns can now be serialised into non-VOTable formats like JSON or CSV.

    If you have been using DaCHS' built-in HTTPS support, certain clients may have rejected its certificates. That was because we were pulling an expired intermediate certificate from letsencrypt. If you don't understand what I was just saying, don't worry. If you do understand that and know a good way to avoid this kind of calamity in the future, I'm grateful for advice.

    VCS move – when DaCHS was born, using the venerable subversion for version control was considered reputable. These days, fewer and fewer people can still deal with that, and thus I have moved the DaCHS source code into a git repository: https://gitlab-p4n.aip.de/gavo/dachs/.

    I hear you moan “why not github?” Well: don't get me started unless you are prepared to listen to a large helping of proselytising. Suffice it to say that we in academia invented the internet (for all intents and purposes) and it's a shame that we now rely so much on commercial entities to provide our basic services (and then without paying them, as a rule, which is always a dangerous proposition towards commercial entities).

    Anyway: Feel free to use that service's bug tracker; we try to find ways to let you log in there without undue hardship, too.

    At this point, I customarily urge: don't wait, upgrade. If you have our Debian repository enabled, apt update && apt upgrade should do the trick, except if you missed our announcement on dachs-users that our repository key has changed. If you have not updated it, please have a look at our repo page to see what needs to be done. Sorry about this, but our old 1024D key was being frowned upon, so we had to do something.

    Unless you are an old hand and have upgraded many times before, let me recommend a quick glance at our upgrading guide before doing the actual upgrade.

    [1]The reason we wait for the Interops is that we are generally promising to put something into DaCHS at or around these conferences. This time, the preliminary support for json-typed database columns is an example for that.
  • DaCHS 2.8 is out

    Today, I have released DaCHS 2.8 and uploaded it to our APT repository; it should also appear in Debian unstable within the next two weeks. This is the traditional post on what is new in this release.

    If I had to name the highlights of what was added since version 2.7, released last November, I would probably say it's HiPS support and the general move towards SIAPv2, although I would have to admit that both did not involve large amounts of code, in particular when compared to the various changes related to COOSYS and TIMESYS.

    So, what about HiPS support? As you probably know, HiPSes are zoomable images (or catalogues, too); if you have a survey-like image collection published through SIAP, you owe it to yourself to have a look at this.

    Given HiPSes are so interactive in Aladin and the like, it may be surprising that they do not really require an active server component: technically, they are just a directory tree created and organised in a very clever way. So, why would DaCHS have a HiPS renderer and boast about it? Well, there are a few amenities (such as auto-generated hips.params files and properties once you have your RD), and DaCHS will care about the Registry side of a HiPS publication. For details, see the HiPS section in the tutorial.

    The SIAP2 story is that (against my rather substantial skepticism) people insisted on creating a new image search protocol in the early 2010s. Since it doesn't have tangible benefits over the venerable SIA1 and even less over Obscore, DaCHS so far has limited its support for SIAP2 to a single global SIAP2 service based on the Obscore table. But then SIAP1 with its stinky UCDs does show its age, and since support for SIAP2 in various clients has been falling into place over the last few years, DaCHS now nudges you to publish your images through SIAP2, for instance by producing a template for a SIAP2 service in dachs start.

    SIAP2 is also what the image section of the tutorial now reflects. If you already have SIAP1 services, the migration should not be hard (except where you used the siapCutoutCore), but given occasional shakiness in the SIAP2 support of the various tools, I'd still wait for a year or two; I have certainly no plans to remove SIAP1 from DaCHS within the next ten years or so. If you still want to migrate, feel free to ask for a section on doing so in DaCHS' How Do I? document.

    From the department of “this update may break your service”: I you have SODA cutouts of cubes, this update will rather likely break the cutout on the non-spatial axis. To fix things, if that axis is spectral, pass its index in a spectralAxis parameter to //soda#fits_standardDLFuncs (or to //soda#fits_makeWCSParams, if that's what you use)[1]. On the other hand, you can now define a velocityAxis, too (and for other cases, there is still axisMetaOverrides).

    Among the more generally interesting new features may be the UnionGrammar. This is for when you have multiple sorts of inputs that require different parsers, for instance, when the data provider changes the formats in which they deliver the data in the midst of a project. I would hope the example from the unionGrammar documentation illustrates what this could be useful for:

    <unionGrammar>
      <handles pattern=".*\.txt$">
        <reGrammar...>
      </handles>
      <handles pattern=".*\.csv$">
        <csvGrammar...>
      </handles>
    </unionGrammar>
    

    Also note that you can create some uniformity between what the grammars yield (and thus avoid a lot of if-else-ing in the rowmaker) by using rowfilters.

    I would have needed the union grammar several times before but had always quickly hacked around that need with some custom grammar. Another itch that has in this way come up multiple times before and for which 2.8 has what I think is a reasonable solution: I occasionally want to share some logic between multiple RDs, but that logic is not general enough to go into DaCHS itself. For such situations, you can now drop a file local.py into your configuration directory (usually, /var/gavo/etc).

    In code saying from gavo import api (which is what you should in general do when programming against DaCHS; in procs, say <setup imports="gavo.api"/>), you can then access the names defined in there as api.local.<name>. For instance (and that's not contrived), say your observers have several particularly babylonian ways of writing times, and you have to parse these in several data collections (i.e., RDs). You could then add a function like this to your local.py:

    def parse_babylonian_time(raw_time:str) -> float:
      """Tries to interpret raw_time as a time in one of the many forms
      our observers like so much.
    
      Here is the syntaxes supported by the function:
    
      >>> parse_babylonian_time("1h")
      3600.0
      >>> parse_babylonian_time("4h30m")
      16200.0
      >>> parse_babylonian_time("1h30m20s")
      5420.0
      >>> parse_babylonian_time("20m")
      1200.0
      >>> parse_babylonian_time("10.5m")
      630.0
      >>> parse_babylonian_time("1m10s")
      70.0
      >>> parse_babylonian_time("15s")
      15.0
      >>> parse_babylonian_time("s23m")
      Traceback (most recent call last):
      ValueError: Cannot understand time 's23m'
      """
      mat = re.match(
        r"^(?P<hours>\d+(?:\.\d+)?h)?"
        r"(?P<minutes>\d+(?:\.\d+)?m)?"
        r"(?P<seconds>\d+(?:\.\d+)?s)?$", raw_time)
      if mat is None:
        raise ValueError(f"Cannot understand time '{raw_time}'")
      parts = mat.groupdict()
    
      return (float((parts["hours"] or "0h")[:-1])*3600
        + float((parts["minutes"] or "0m")[:-1])*60
        + float((parts["seconds"] or "0s")[:-1]))
    

    (or something similarly abominable). That way, the function is available to all RDs, there is just one implementation to maintain, and it can be centrally tested (dachs test could certainly do with with a facility to execute local.py doctests, too).

    DaCHS 2.8 also comes with yet another way to declare space-time metadata. That's a longer story, and while all this should have happened 10 years ago, there's no particular hurry now. I will therefore write about improvements in TIMESYS and COOSYS in a later post dedicated to votable:Coords and its products. Meanwhile, just two things: In the unlikely case you already have “stc2“ annotations in your RDs, you will have to rename the value attribute in space clauses to location. And: SSAP and SIAP now produce proper TIMESYS-es. If you happen to know the timescales and reference positions of your observation dates, starting in 2.8 you can define them in the respective mixins (the refposition and timescale mixin parameters).

    There are two notable additions in DaCHS' Datalink support (which is newly declared to support version 1.1): For one, you can now pass contentQualifier to descriptor.makeLink[FromFile], which will normally be a product type taken from http://www.ivoa.net/rdf/product-type (e.g., “image” or “dynamic-spectrum“). Because they can help clients select appropriate clients to send a datalink to, it is certainly a good thing to add them to your datalinks where applicable.

    Also, datalink meta makers can now return ProcLinkDef instances. This lets you have multiple distinct processing services within a single Datalink document. To make that a bit prettier, there is also a secret handshake (as in: an INFO element with a name of title) between DaCHS' datalink service and the XSLT that formats datalink documents in browsers (also available for third-party datalink documents). See multiple processing services in the reference for details.

    Let me briefly mention a few more changes you may be interested in:

    • condDescs can now be declared as inputOptional, which is useful when you want to have syntax-adaptive defaults.
    • you can now configure the size of DaCHS connection pools in [db]poolSize (in particular, set it to 0 to disable connection pooling).
    • in ADQL, you can now do things like CONTAINS(CIRCLE(23, 42, 1), some_moc) (i.e., compute boolean predicates between the classical geometries and MOCs).
    • DaCHS no longer fails with numpy-s later than 1.23, and is no longer dependent on the cgi module that is scheduled for removal from python. In consequence, there is a new dependency, python3-multipart.
    [1]That is, unless you already defined spectralAxis because DaCHS' heuristics were wrong before version 2.8. But then your service won't break, either.
  • DaCHS is now at Version 2.7

    Logo-ish 2.7 with a multi-array plot

    Last Friday, I have released Version 2.7 of GAVO's Virtual Observatory server package DaCHS. As is customary, I will give a brief overview of the more noteworthy changes in this blog post. This is probably only of interest to people running DaCHS-based data centres. What I discuss here is both a bit more verbose and a bit less extensive than what you find in the Changes file (when installed from package, you would read it by running /usr/share/doc/python3-gavo/changelog.gz).

    The highlight in this release from my view are simple, numpy-like vector operations in ADQL. Regular readers of this blog will already have seen an example for their use. This is altogether a prototype, which is why what specification is there is only on the IVOA wiki. It is thus likely some details of the vector math will change until they make it into any sort of standard (I am hoping for ADQL 2.2). This should not keep you from trying it out and telling your users about it.

    In that same vein, the FITS binary table grammar now copes with vectors, which makes it easier to populate tables that make these operations useful, and for the sort of large tables where the array magic has particularly much promise, it is now a lot simpler to feed array-valued columns with C boosters.

    Other ADQL work includes the addition of proper, standards-compliant epoch propagation (i.e., “application of proper motion and radial velocity“) in the form of the ivo_epoch_prop and ivo_epoch_prop_pos user defined functions. Regrettably, this will not immediately work for you, as it builds on a feature in pgsphere that upstream has not merged yet; comments on that PR will certainly help make that happen. Of course, if you want, you can just build the pgsphere branch containing the new feature yourself. To make up for this complication, DaCHS will no longer advertise UDFs that will not work given the database extensions present – which will help me be a bit more liberal in letting in UDFs wrapping functionality not in Postgres' default distribution in the future.

    If you run datalink services and have multiple items with the same semantics, you may be interested in using local_semantics in Datalink. The use case here is that clients like TOPCAT will remain on, say, light curves in a red filter when the user jumps between records rather than randomly switching between red and blue ones when both have #coderived semantics (Mark's proposal). If you have data of this kind: you can now pass a localSemantics parameter to the makeLink and makeLinkFromFile methods of datalink descriptors; what string you use is up to you, as long as it's the same between similar rows for different datasets.

    I tend to forget that surprisingly many people actually do something with the ADQL form you get on DaCHS' web interface rather than use a TAP client. Well, a DaCHS operator complained about really sub-standard table headings in the HTML tables coming out of this service. Looking again, I had to admit he was right. So, TAP columns now have more meaningful table headings; in particular, if you write expressions, up to a certain length these expressions will be used as table headings. At least in this respect the ADQL form now has an advantage over using a proper client.

    In case you have a processor doing astrometric calibration with astrometry.net (you probably don't because it would have been very hard to make that work on without a lot of hacks so far) – have another look at the documentation because I have had various reasons to change api.AnetHeaderProcessor's API in quite a number of ways. It's now a lot easier to use with astrometry.net and source-extractor as distributed by Debian, but I'd still not have broken the API so badly if I had suspected anyone but me had significant code against this.

    I should also warn you that DaCHS now uses astropy to format sexagesimal times and coordinates. This is probably welcome news to those who ever encountered one of DaCHS' 05:59:60 outputs (which happened due to the way it did its rounding). Still, if you have regression tests testing for strings like that, you will need to update them.

    From the many minor fixes I should probably mention that DaCHS is now ready for Postgres 15 (which will probably the Postgres version in the next Debian stable). This used to be broken on new installations because Postgres 15 no longer lets normal users write to the public schema. DaCHS needs a database role that can do this, though, because it defines public functions. Since version 2.7, it does the necessary setup to make this possible. If you make your public schema non-world-writable manually – Postgres upgrades will not do that for you, and I would say there is no strong reason to do so for databases backing DaCHS –, do not forget to GRANT ALL ON SCHEMA public TO gavoadmin.

    With this – don't wait, upgrade. If you have GAVO's repository enabled, apt update && apt upgrade would probably do the trick, though of course I recommend having a look at our upgrading guide for robustness and good housekeeping.

  • What's new in DaCHS 2.6

    Rainbowy image with a DaCHS logo

    The transitions of four-times ionised Technetium, with the energies of the lower and upper states on the two axes and the colour a measure of the frequency of the emitted light. Well: DaCHS 2.6 has preliminary support for LineTAP.

    After six months of development, I have just released DaCHS 2.6. This blog post is the traditional discussion of major news for operators of DaCHS-based services. Also have a look at the changelog, which has finally made it to the Debian package; if you installed from package, you can now read it using zless /usr/share/doc/python3-gavo/changelog.gz.

    This post's title picture alludes to LineTAP, an upcoming standard for disseminating data on specral lines intended to obviate SLAP and play nicely with VAMDC. The standard only exists as a rather preliminary draft yet, but there should be a working draft soon-ish. If you have line data to publish or can get your hands on some, consider trying //linetap#table-0 (the “-0” suggests that there will be changes, but I'd hope not terribly many).

    Quite a few changes resulted from a seemingly minor user request: “How do I put a form interface in front of my EPN-TAP table?“ I rather foolishly chose to use the obscore table as an example, which was about the worst choice I could have made, as ivoa.obscore is a view in DaCHS (which means, for instance, that you can't simply add indexes), and a rather large one in Heidelberg at that (more than 80 Megarows, which means that without indexes, interactive services are impossible).

    The first change in that direction was supporting form conditions over pairs of columns; you need that whenever your table has intervals in column pairs, as for instance em_min/em_max in obscore. With the new code, when users write something like 8000 .. 10000, you can instruct DaCHS to translate that into SQL computing whether or not the intervals overlap.

    The spectral queries from that form still timed out, even after I had made sure there were indexes on the larger contributing tables' spectral columns. The reason for that was that the obscore mixin casted the spectral coordinates to double precision[1], and even if there is an index on a real-valued my_col, a condition like:

    my_col::double precision < 4
    

    will not use the index (unless it were over the cast expression, of course). I have hence shortened a few obscore columns (specifically, s_fov, s_resolution, em_min, em_max, em_res_power, and s_pixel_scale) to real; that's what they are in SSAP, and for now I cannot see a case where these would need to be double precision in a discovery protocol.

    Having this service reminded me that registering obscore as an independent resource (rather than just as a table in a tap service's tableset) was something I've been wanting to tackle for quite a time now. This needs proper metadata, in particular coverage metadata. Determining the coverage of obscore is now possible (run dachs limits //obscore), and using codeItems (more or less explicitly), you can inject that metadata where you need it.

    The cover story (“use case,” if you will) underlying this form-based service on top of obscore that started all that was that it was supposed to be friendly to optical astronomers, who by and large are still stuck with Ångström (that is, 10 − 10 m), and hence I wanted to write the spectral information in Ångström, too. In this case, the old displayUnit display hint would have done (because Obscore uses wavelengths, too), but by the time I noticed that, I had already written a spectralUnit display hint. With that, you can write something like:

    <column name="e_min"
      unit="J"
      description="Lower energy in the spectrum"
      displayHint="spectralUnit=Angstrom"/>
    

    This would convert e_min to Ångström when written to HTML table (but not otherwise, following the assumption that non-HTML data will be consumed by machines that have no use for legacy units).

    Talking about HTML: If your root template is derived from root-tree.html (it is not unless you made it so), you have to apply a minor update to it; locate the tmpl_resDetails “script” (it's actually some HTML) in /var/gavo/web/templates/root.html. In there, there's a $description, which for the javascript templater that interprets this thing means “insert the content of the description field, properly escaping it”. Since 2.6, however, DaCHS produces these descriptions in HTML. That's progress, since these descriptions often contain links or other formatting. But it means that you have to tell the templater to not escape things: Just write $!description instead.

    There are a few new things you can do in RDs. First, there are relocatable RDs: It is now recommended to have resdir="." in the opening resource (and dachs start's templates are nudging you to do that). Without that, the resource directory defaults to inputsDir/<schema>, which breaks as soon as you need to rename that directory. Now: renaming resource directories is never easy in DaCHS (for instance, because they are reflected in URLs). But for instance with mirrors, or when forking a resource, such renames happen, and relocatable RD make that a lot simpler. You can obtain the current value of the resource directory from the new \resdir macro.

    Then, by popular request, you can now have index options. If you look at the documentation for create index in the postgres docs, you will notice that there are quite a few things you can do to an index. Acquainting DaCHS' index element with all of these seemed wrong to me, in particular because most of these things are only interesting in rather special circumstances beyond DaCHS' control. Instead, you can now add option elements to an index to change its behaviour, each of which can reflect some postgres configuration item. DaCHS will order your fragments so the resulting command fits Postgres' grammar.

    Since this is somewhat low-level, I recommend isolating the details in userconfig. For instance, you could add streams there saying:

    <STREAM id="staticindex">
      <doc>For indexes on tables that never change, save about 10% storage
      by feeding this.</doc>
      <option>WITH (fillfactor=100)</option>
    </STREAM>
    
    <STREAM id="onfastdisk">
      <doc>FEED this into an index to let it live on a fast disk</doc>
      <option>TABLESPACE fast</option>
    </STREAM>
    

    (the second stream assumes you have set up such a tablespace). You could then configure your indexes like this:

    <index columns="foo">
      <FEED source="%#staticindex"/>
      <FEED source="%#onfastdisk"/>
    </index>
    

    A feature I have put in mainly because of, say, due diligence is that you can now store the administrator password as a hash in /etc/gavo.rc. This has the advantage that people that get to read your configuration cannot (reasonably) become administrators on DaCHS' web interface; I'd consider the hash strong enough that you could put that into version control. Of course, that administrator can't do all that much in the first place.

    The drawback of hashing the admin password is that then DaCHS itself cannot use the password to authenticate against a running server. That is not a disaster, but it will keep it from automatically discarding the root page on changes and automatically clearing a few caches when you import a resource.

    As usual, there are many other changes; let me mention

    • the modern VOTables from SCS I have celebrated here before,
    • the makeIAUId(prefix, long, lat) rowmaker function that makes creating IAU-compliant identifiers a bit simpler,
    • a function utils.formatFloat that may be helpful when producing human-readable floating-point numbers (it's not in gavo.api yet, but I think it will migrate there),
    • the statistics property on columns that you can set to enumerate on TEXT-typed columns to make DaCHS collect preliminary statistics on those (more on that in a later post),
    • the -d option to dachs limits to dump the column statistics DaCHS has gathered (see the DaCHS 2.4 announcement for more on these stats), and
    • that the maximum order of a MOC is now given in ASCII-MOCs DaCHS produces.

    With this: If you have GAVO's repository enabled, you will get DaCHS 2.6 with the next apt upgrade. I will also try to get it into the Debian backports, too, and if I manage that, you will read about it on this blog.

    [1]

    In case you wonder why it did that: The obscore mixin basically fills out templates like:

    CAST(\em_min AS real) AS em_min,
    CAST(\em_max AS real) AS em_max,
    

    where the macro replacements are taken from whatever you give in the mixin's parameters. Now, if \em_min happens to work out to NULL, Postgres just picks any old type (text, IIRC) for the corresponding column. That is not a problem until the result of that table definition is UNION-ed together with another table where \em_min is a proper floating point number: Postgres will then complain about incompatible types in a union. To avoid that, I must give a type to anything contributing to the obscore view.

Page 1 / 4 »