Posts with the Tag DaCHS:

  • Queries Against My Obscore Are Slow!

    Content Warning: This is fairly deep nerd stuff. If you are just a normal VO user, you probably don't want to know about this. You probably even don't want to know about it if you are running a smallish DaCHS site. But perhaps you'll enjoy it anyway.

    Last May, I finally tried to get to the bottom of why certain queries against my obscore table – and in particular some joins I care about – were unneccessarily slow. The immediate use case was that I wanted to join the proposed radio extension for obscore to the main obscore table like this:

    SELECT COUNT(*)
        FROM ivoa.obscore
        JOIN ivoa.obs_radio
        USING (obs_publisher_did)
    

    This looks harmless, in particular since there are almost always indexes on obs_publisher_did columns for operational purposes: DaCHS uses them to locate rows in, for instance, Datalink operation.

    It is not. Harmless, I mean. On the contrary.

    Match Your Types Before You UNION ALL

    The main reason why there is a trap is that ivoa.obscore in DaCHS is a view (i.e., some sort of virtual table defined by a SQL query). This is because typically, multiple data collections contribute, and they can change independently of each other. We do not want to have to rebuild a full obscore table (which has almost 150 million rows in the Heidelberg data centre right now) just because we fix the metadata of a handful of images somewhere.

    Hence, ivoa.obscore is built somewhat like this in DaCHS[1]:

    CREATE OR REPLACE VIEW ivoa.obscore AS
     SELECT 'image'::text AS dataproduct_type,
        NULL::text AS dataproduct_subtype,
        2::smallint AS calib_level,
        'BGDS'::text AS obs_collection,
        ...
     FROM bgds.data
    UNION ALL
      SELECT 'image'::text AS dataproduct_type,
        NULL::text AS dataproduct_subtype,
        3::smallint AS calib_level,
      ...
    [and 42 further subqueries that are union-ed together]
    

    It turns out that this architecture is dangerous in Postgres.

    Laurenz Albe has a writeup on the underlying problem, which he summarises in a cartoon as “Before I UNION ALL you, be sure that your types match”. In short, UNION ALL becomes a planner barrier when the types of the columns of the relations being merged do not exactly match. For this purpose, a bigint is completely different from an integer.

    Full disclosure: it's not like I figured out the applicability of Laurenz' analysis to the DaCHS troubles by myself. It actually took multiple applications of the cluestick by Tom Lane, Laurenz, and others on pgsql-general.

    Known Problem Is Not Solved Problem

    Hence, since May, I sort-of understood the problem. Fixing it, on the other hand, seemed rather overwhelming given the size of the view and sometimes multiple levels of view building. In consequence, I procrastinated actually doing something about it until some time last November when I realised that the computer could support the analysis of what types from which tables do not match.

    I therefore wrote analyze-obscore.py and added it to the DaCHS repo. It will (presumably) never be part of the DaCHS package, but you can simply run it from a clone of the repo – and should do so if you have an obscore view fed from multiple tables.

    The output then is something like:

    ==== access_estsize ====
    
      bgds.data                      accsize/1024
      danish.data                    accsize/1024
      dfbsspec.ssa                   accsize/1024
      plts.data                      accsize/1024
      emi.main                       access_estsize (bigint)
      rosat.images                   accsize/1024
      califadr3.cubes                10
      robott.data                    accsize/1024
      k2c9vst.timeseries             accsize/1024
      dasch.narrow_plates            access_estsize (bigint)
      onebigb.ssa                    accsize/1024
      [...]
    
    ==== access_format ====
    
      bgds.data                      mime (text)
      danish.data                    mime (text)
      dfbsspec.ssa                   mime (text)
      plts.data                      mime (text)
      emi.main                       access_format (text)
      rosat.images                   mime (text)
      califadr3.cubes                'application/x-votable+xml;content=datalink'
      [...]
    
    ==== calib_level ====
    
      bgds.data                      2
      danish.data                    2
      dfbsspec.ssa                   2
      plts.data                      1
      emi.main                       calib_level (smallint)
      rosat.images                   2
      califadr3.cubes                3
      [...]
    

    and so on. That is: for each table contributing to a column, it either shows the source column together with its type, a literal, or the full expression. Literals are not problematic: as it turns out, DaCHS has always cast them to the appropriate type, so as long as the other source columns match what obscore thinks the columns ought to be, you should be fine.

    Expressions are more difficult. The only way to be sure there is to ask Postgres, somewhat like this:

    select pg_typeof(accsize/1024) from bgds.data limit 1
    

    Changing Types En Masse

    In my case, I had lots of inconsistencies between columns coming from SSA and more directly from obscore-like tables. If you have spectra and other things in one obscore table created by DaCHS <2.12.2, so will you.

    This is because in my obscore implementation I followed the somewhat ill-advised types written down in (but in my reading not actually requried by) the obscore specification (p. 21). There is no conceivable scenario that would require more than 231 polarisation states (the pol_xel column, which is supposed to be “adql:BIGINT”), and I do not feel overly future-skeptic when I say that it will also be some time until we have images with a linear dimension of more than two billion. There is also no good reason to have an order-of-magnitude value like em_res_power to 16 significant digits (as implied by “adql:DOUBLE”)[2].

    I have cleaned this up in DaCHS 2.12.2. With this, the types of Obscore and the corresponding columns in SSA and SIAP are consistent within DaCHS' metadata declarations.

    However, the on-disk tables will keep their original types regardless of what DaCHS claims they are. You could fix this by re-importing the tables, but that would take quite a while, at least in my case. I have hence opted for targeted updates.

    The first step in that procedure is to figure out where Postgres' ideas of columns are now different from DaCHS' ideas given the recent metadata updates. For that, dachs val has had the -c (or --compare-db) flag for a long time. Running:

    dachs val -vc ALL
    

    gives you a list of all RDs that need work because the on-disk types (which actually determine the query plan) differ from DaCHS' expectations (which will fix the UNION ALL trouble). Once they match, you can feel entitled to a good query plan.

    Based on this, I have incrementally built a fixing script on my development system. As I'm pointing out towards the end of Publishing a Service in the DaCHS tutorial, the recommended way to run a DaCHS-based data centre is to have test snippets of almost all the resources on the production system on a <cough> development system (presumably: your laptop). That's what I do, and in this way I built this script:

    import subprocess
    
    from gavo import api
    
    with api.getWritableAdminConn() as conn:
            conn.execute("DROP VIEW IF EXISTS ivoa.obscore")
            conn.execute("DROP VIEW IF EXISTS dasch.plates")
    
            for table_name in ["emi.main", "dasch.narrow_plates"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER t_xel TYPE integer")
    
            for table_name in [
                            "emi.main", "dasch.narrow_plates", "ppakm31.cubes", "applause.main",]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_xel1 TYPE integer")
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_xel2 TYPE integer")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER pol_xel TYPE integer")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates",
                            "califadr3.cubes"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_xel TYPE integer")
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_res_power TYPE real")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates",
                            "ppakm31.cubes",
                            "applause.main",
                            "califadr3.cubes"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_min TYPE real")
                    conn.execute(f"ALTER TABLE {table_name} ALTER em_max TYPE real")
    
            for table_name in [
                            "emi.main",
                            "dasch.narrow_plates",
                            "applause.main"]:
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_resolution TYPE real")
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_pixel_scale TYPE real")
                    conn.execute(f"ALTER TABLE {table_name} ALTER s_fov TYPE real")
    
    
    for rd_id in ["emi/q", "califa/q3", "rome/q", "dasch/q", "ppakm31/q"]:
            subprocess.call(["dachs", "imp", "-m", rd_id])
    
    subprocess.call(["dachs", "imp", "dasch/q", "make-view"])
    subprocess.call(["dachs", "imp", "//obscore"])
    

    As I said: which columns to fix I learned from dachs val -vc; the extra DaCHS operations were necessary because Postgres refused the type changes as long as the views were still defined.

    Success?

    This entire operation has made quite a few obscore queries a lot faster.

    Regrettably, the motivating query, viz.,:

    select count(*)
    from ivoa.obscore
    natural join ivoa.obs_radio
    

    is still slow. I have dug a bit into why Postgres does not find the seemingly obvious plan of just materialising the join with the tiny obs_radio table and contented myself with the note that has been in section 9.21 of the postgres documentation forever:

    Users accustomed to working with other SQL database management systems might be disappointed by the performance of the count aggregate when it is applied to the entire table. A query like:

    SELECT count(*) FROM sometable;
    

    will require effort proportional to the size of the table: PostgreSQL will need to scan either the entire table or the entirety of an index that includes all rows in the table.

    But at least a query like:

    select dataproduct_type, access_url, t_min, t_max
    from ivoa.obscore
    natural join ivoa.obs_radio
    where t_min between 56000 and 56005
    

    is fast, and until further trouble that's good enough for me.

    Followup (2026-03-03)

    Well, futher trouble came afoot, and with DaCHS 2.12.3 you can therefore materialise your obscore table. This is as simple as saying:

    materialiseObscore: True
    

    in the [ivoa] section of your /etc/gavo.rc and then saying:

    dachs imp //obscore
    dachs limits //obscore
    

    in a shell. For large obscore tables, this will take a while (about 30 minutes for the imp in my data centre). I don't intend to do that more than once a month on average, and while queries to ivoa.obscore will block in that time, I think it's worth it: Query plans and all become a lot more readable, and my count(*) query suddenly finishes in less than a second. That's a big win over the several minutes I had before.

    Well: At least I have learned quite a bit about UNION ALL, and also about gathering metadata from many RDs at a time. So, this whole investigation was not a total waste of time.

    And if you have to know: this is not actually a materialised view but rather a normal, full-fledged table. That is because you cannot drop tables that are part of a materialised view, whereas once their rows are in a table, postgres lets you drop them as you like. And dropping is important if you want to develop your data collections.

    [1]In case you wonder: the individual parts of this union are kept in a table ivoa._obscoresources that you can inspect and even manipulate for special effects. The management of that table is among there more complex things one can do in DaCHS RDs. If you are curious, dachs adm dump //obscore will show you all the magic.
    [2]I put these type names into quotation marks because they were never formally defined. What Obscore does there has been identified as an antipattern in the meantime; newer specifications of similar schemas only distinguish floating point, integral, and string types and leave the choice of lengths to the implementations. If I may say so myself, I like the considerations on types within section 8 of RegTAP.
  • DaCHS 2.12 Is Out

    The DaCHS logo, a badger's head and the text "VO Data Publishing"

    A bit more than one month after the last Interop, I have released the next version of GAVO's data publication package, DaCHS. This is the customary post on what is new in this release.

    There is no major headline for DaCHS 2.12, but there is a fair number of nice conveniences in it. For instance, if you have a collection of time series to publish, the new time series service template might help you. You get it by calling dachs start timeseries; I will give you that it suffers from about the same malady as the existing ssap+datalink one: There is a datalink service built in from the start, which puts up a scary amount of up-front complexity you have to master before you get any sort of gratification.

    There is little we can do about that; the creators of time series data sets just have not come up with a good convention for how to write them. I might be moved to admit that putting them into nice FITS binary tables might count as acceptable. In practice, none of the time series I got from my data providers came in a format remotely fit for distribution. Perhaps Ada's photometric time series convention (which is what you will deliver with the template) is not the final word on how to represent time series, but it is much better than anything else I have seen. Turning what you get from your upstreams into something you can confidently hand out to your users just requires Datalink at this point I'm afraid[1].

    I will add tutorial chapters for how to deal with the datalink-infested templates one of these days; within them bulk commenting will play a fairly important role. For quite a while, I have recommended to define a lazy macro with a CDATA section in order to comment out a large portion of an RD. I have changed that recommendation now to open such comments with <macDef raw="True" name="todo"><![CDATA[ and close them with ]]></macDef>. The new (2.12) part is the raw="True". This only means that DaCHS will not try to expand macros within the macro definition. So far, it has done that, and that was a pain in for the datalink-infested templates, because there are macro calls in the templates, but some of them will not work in the RD context the macDef is in, which then lead to hard-to-understand RD parse errors.

    By the way, in case you would like to write your template to a file other than q.rd (perhaps because there already is one in your resdir), there is now an -o option to dachs start.

    Speaking of convenience, defining spectral coverage has become a lot less of a pain in 2.12. So far, whenever you had to manually define a resource's STC coverage (and that is not uncommon for the spectral axis, where dachs limits often will find no suitable columns or does not find large gaps in observations in multiple narrow bands), you had to turn the Ångströms or GHz into Joule by throwing in the right amounts of c, h, and math operators. Now, you just add the appropriate units in square brackets and let DaCHS work out the rest; DaCHS will also ensure that the lower limit actually is smaller than the upper limit. A resource covering a number of bands in various parts of the spectrum might thus say:

    <coverage>
      <spectral>100[kHz] 21.5[cm]</spectral>
      <spectral>2[THz] 1[um]</spectral>
      <spectral>653[nm] 660[nm]</spectral>
      <spectral>912[Angstrom] 10[eV]</spectral>
      <spectral>20[GeV] 100[GeV]</spectral>
    </coverage>
    

    DaCHS will produce a perfectly viable coverage declaration for the Registry from that.

    Still in the convenience department, I have found myself define a STREAM (in case you don't know what I'm talking about: read up on them in the tutorial) that creates pairs of columns for a value and its error once to often. Thus, there is now the //procs#witherror stream. Essentially, you can replace the <column in a column definition with <FEED source="//procs#witherror, and you get two columns: One with the name itself, the other with a name of err_name, and the columns ought to have suitable metadata. For instance:

    <FEED source="//procs#witherror
      name="rv" type="double precision"
      unit="km/s" ucd="spect.dopplerVeloc"
      tablehead="RV_S"
      description="Radial velocity derived by the Serval pipeline"
      verbLevel="1"/>
    

    You cannot yet have values children with witherror, but it is fairly uncommon for such columns to want them: you won't enumerate values or set null values (things with errors will be floating point values, which have “natural” null values at least in VOTable), and columns statistics these days are obtained automatically by dachs limits.

    You can take this a turn further and put witherror into a LOOP. For instance, to define ugriz photometry with errors, you would write:

    <LOOP>
      <csvItems>
      item, ucd
      u, U
      g, V
      r, R
      i, I
      z, I
      </csvItems>
      <events passivate="True">
        <FEED source="//procs#witherror name="mag_\item"
          unit="mag" ucd="phot.mag;em.opt.\ucd">
          tablehead="m_\item"
          description="Magnitude in \item band"/>
      </events>
    </LOOP>
    

    There is a difficult part in this: the passivate="True" in the events element. If you like puzzlers, you may want to figure out why that is needed based on what I document about active tags in the reference documentation. Metaprogramming and Macros become subtle not only in DaCHS.

    Far too few DaCHS operators define examples for their TAP services. Trust me, your users will love them. To ensure that they still are good, you can now pass an -x flag to dachs val (nb not dachs test); that will execute all of the TAP examples defined in the RD against the local server and complain when one does not return at least one valid row. The normal usage would be to say dachs val -x //tap if you define your examples in the userconfig RD; but with hierarchical examples, any RD might contain examples modern TAP clients will pick up.

    There is another option to have an example tested: you could put the query into a macro (remember macDef above?) and then use that macro both in the example and in a regTest element. That is because url attributes now expand macros. That may be useful for other and more mundane things, too; for instance, you could have DaCHS fill in the schema in queries.

    Actual new features in 2.12 are probably not very relevant to average DaCHS operators, at least for now:

    • users can add indexes to their persistent uploads (featured here before)
    • registration of VOEvent streams according to the current VOEvent 2.1 PR (ask if interested; there is minimal documentation on this at this point).
    • an \if macro that sometimes may be useful to skip things that make no sense with empty strings: \if{\relpath}{http://example.edu/p/\relpath} will not produce URLs if relpath is empty.
    • if you have tables with timestamps, it may be worth running dachs limits on them again, as DaCHS will now obtain statistics for them (in MJD, if you have to know) and consequently provide, e.g., placeholders.
    • our spatial WCS implementation no longer assumes the units are degrees (but still that it is dealing with spherical coordinates).
    • when params are array-valued, any limits defined in values are now validated component-wise.

    Finally, if you inspected a diff to the last release, you would see a large number of changes due to type annotation of gavo.base. I have promised to my funders to type-annotate the entire DaCHS code (except perhaps for exotic stuff I shouldn't have written in the first place, viz., gavo.stc) in order to make it easier for the community to maintain DaCHS.

    From my current experience, I don't think I will keep this particular promise. After annotating several thousand lines of code my impression is that the annotation is a lot of effort even with automatic annotation helpers (the cases it can do are the ones that would be reasonably quick for a human, too). The code does in general improve in consequence (but not always), but not fundamentally, and it does not become dramatically more readable in most places (there are exceptions to that reservation, though).

    All in all, the cost/benefit ratio just does not seem to be small enough. And: the community members that I want to encourage to contribute code would feel obliged to write type annotations, too, which feels like an extra hurdle I would like to spare them.

    [1]Ok: you could also do an offline conversion of the data collection before ingestion, but I tend to avoid this, partly because I am reluctant to touch upstream data, but in this case in particular because with the current approach it will be much easier to adopt improved serialisations as they become defined.
  • DaCHS 2.11: Persistent TAP Uploads

    The DaCHS logo, a badger's head and the text "VO Data Publishing"

    The traditional autumn release of GAVO's server package DaCHS is somewhat late this year, but not so late that could not still claim it comes after the interop. So, here it is: DaCHS 2.11 and the traditional what's new post.

    But first, while I may have DaCHS operators' attention: If you have always wondered why things in DaCHS are as they are, you will probably enjoy the article Declarative Data Publication with DaCHS, which one day will be in the proceedings of ADASS XXXIV (and before that probably on arXiv). You can read it in a pre-preprint version already now at https://docs.g-vo.org/I301.pdf, and feedback is most welcome.

    Persistent TAP Uploads

    The potentially most important new feature of DaCHS 2.11 (in my opinion) will not be news to regular readers of this blog: Persistent TAP Uploads.

    At this point, no client supports this, and presumably when clients do support it, it will look somewhat different, but if you like the bleeding edge and have users that don't mind an occasional curl or requests call, you would be more than welcome to help try the persistent uploads. As an operator, it should be sufficient to type:

    dachs imp //tap_user
    

    To make this more useful, you probably want to hand out proper credentials (make them with dachs adm adduser) to people who want to play with this, and point the interested users to the demo jupyter notebook.

    I am of course grateful for any feedback, in particular on how people find ways to use these features to give operators a headache. For instance, I really would like to avoid writing a quota system. But I strongly suspect will have to…

    On-loaded Execute-s

    DaCHS has a built-in cron-type mechanism, the execute Element. So far, you could tell it to run jobs every x seconds or at certain times of the day. That is fine for what this was made for: updates of “living” data. For instance, the RegTAP RD (which is what's behind the Registry service you are probably using if you are reading this) has something like this:

    <execute title="harvest RofR" every="40000">
      <job><code>
          execDef.spawnPython("bin/harvestRofR.py")
      </code></job>
    </execute>
    

    This will pull in new publishing registries from the Registry of Registries, though that is tangential; the main thing is that some code will run every 40 kiloseconds (or about 12 hours).

    Against using plain cron, the advantage is that DaCHS knows context (for instance, the RD's resdir is not necessary in the example call), that you can sync with DaCHS' own facilities, and most of all that everything is in once place and can be moved together. By the way, it is surprisingly simple to run a RegTAP service of your own if you already run DaCHS. Feel free to inquire if you are interested.

    In DaCHS 2.11, I extended this facility to include “events” in the life of an RD. The use case seems rather remote from living data: Sometimes you have code you want to share between, say, a datalink service and some ingestion code. This is too resource-bound for keeping it in the local namespace, and that would again violate RD locality on top.

    So, the functions somehow need to sit on the RD, and something needs to stick them there. To do that, I recommended a rather hacky technique with a LOOP with codeItems in the respective howDoI section. But that was clearly rather odious – and fragile on top because the RD you manipulated was just being parsed (but scroll down in the howDoI and you will still see it).

    Now, you can instead tell DaCHS to run your code when the RD has finished loading and everything should be in place. In a recent example I used this to have common functions to fetch photometric points. In an abridged version:

    <execute on="loaded" title="define functions"><job>
      <setup imports="h5py, numpy"/>
      <code>
      def get_photpoints(field, quadrant, quadrant_id):
        """returns the photometry points for the specified time series
        from the HDF5 as a numpy array.
    
        [...]
        """
        dest_path = "data/ROME-FIELD-{:02d}_quad{:d}_photometry.hdf5".format(
          field, quadrant)
        srchdf = h5py.File(rd.getAbsPath(dest_path))
        _, arr = next(iter(srchdf.items()))
    
        photpoints = arr[quadrant_id-1]
        photpoints = numpy.array(photpoints)
        photpoints[photpoints==0] = numpy.nan
        photpoints[photpoints==-9999.99] = numpy.nan
    
        return photpoints
    
    
      def get_photpoints_for_rome_id(rome_id):
        """as get_photpoints, but taking an integer rome_id.
        """
        field = rome_id//10000000
        quadrant = (rome_id//1000000)%10
        quadrant_id = (rome_id%1000000)
        base.ui.notifyInfo(f"{field} {quadrant} {quadrant_id}")
        return get_photpoints(field, quadrant, quadrant_id)
    
      rd.get_photpoints = get_photpoints
      rd.get_photpoints_for_rome_id = get_photpoints_for_rome_id
    </code></job></execute>
    

    (full version; if this is asking you to log in, tell your browser not to wantonly switch to https). What is done here in detail again is not terribly relevant: it's the usual messing around with identifiers and paths and more or less broken null values that is a data publisher's everyday lot. The important thing is that with the last two statements, you will see these functions whereever you see the RD, which in RD-near Python code is just about everywhere.

    dachs start taptable

    Since 2018, DaCHS has supported kickstarting the authoring of RDs, which is, I claim, the fun part of a data publisher's tasks, through a set of templates mildly customised by the dachs start command. Nobody should start a data publication with an empty editor window any more. Just pass the sort of data you would like to publish and start answering sensible questions. Well, “sort of data” within reason:

    $ dachs start list
    epntap -- Solar system data via EPN-TAP 2.0
    siap -- Image collections via SIAP2 and TAP
    scs -- Catalogs via SCS and TAP
    ssap+datalink -- Spectra via SSAP and TAP, going through datalink
    taptable -- Any sort of data via a plain TAP table
    

    There is a new entry in this list in 2.11: taptable. In both my own work and watching other DaCHS operators, I have noticed that my advice “if you want to TAP-publish any old material, just take the SCS template and remove everything that has scs in it” was not a good one. It is not as simple as that. I hope taptable fits better.

    A plan for 2.12 would be to make the ssap+datalink template less of a nightmare. So far, you basically have to fill out the whole thing before you can start experimenting, and that is not right. Being able to work incrementally is a big morale booster.

    VOTable 1.5

    VOTable 1.5 (at this point still a proposed recommendation) is a rather minor, cleanup-type update to the VO's main table format. Still, DaCHS has to say it is what it is if we want to be able to declare refposition in COOSYS (which we do). Operators should not notice much of this, but it is good to be aware of the change in case there are overeager VOTable parsers out there or in case you have played with DaCHS MIVOT generator; in 2.10, you could ask it to do its spiel by requesting the format application/x-votable+xml;version=1.5. In 2.11, it's application/x-votable+xml;version=1.6. If you have no idea what I was just saying, relax. If this becomes important, you will meet it somewhere else.

    Minor Changes

    That's almost it for the more noteworthy news; as usual, there are a plethora of minor improvements, bug fixes and the like. Let me briefly mention a few of these:

    • The ADQL form interface's registry record now includes the site name. In case you are in this list, please say dachs pub //adql after upgrading.
    • More visible legal info, temporal, and spatial coverage in table and service infos; one more reason to regularly run dachs limits!
    • VOUnit's % is now known to DaCHS (it should have been since about 2.9)
    • More vocabulary validation for VOResource generation; so, dachs pub might now complain to you when it previously did not. It is now right and was wrong before.
    • If you annotate a column as meta.bib.bibcode, it will be rendered as ADS links
    • The RD info links to resrecs (non-DaCHS resources, essentially), too.

    Upgrade As Convenient

    As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.

  • What's new in DaCHS 2.10

    A part of the IVOA product-type vocabulary, and the DaCHS logo with a 2.10 behind it.

    About twice a year, I release a new version of our VO server package DaCHS; in keeping with tradition, this post summarises some of the more notable changes of the most recent release, DaCHS 2.10.

    productTypeServed

    The next version of VODataService will probably have a new element for service descriptions: productTypeServed. This allows operators to declare what sort of files will come out of a service: images, time series, spectra, or some of the more exotic stuff found in the IVOA product-type vocabulary (you can of course give multiple of these). More on where this is supposed to go is found my Interop talk on this. DaCHS 2.10 now lets you declare what to put there using a productTypeServed meta item.

    For SIA and SSAP services, there is usually no need to give it, as RegTAP services will infer the right value from the service type. But if you serve, say, time series from SSAP, you can override the inference by saying something like:

    <meta name="productTypeServed">timeseries</meta>
    

    Where this really is important is in obscore, because you can serve any sort of product through a single obscore table. While you could manually declare what you serve by overriding obscore-extraevents in your userconfig RD, this may be brittle and will almost certainly get out of date. Instead, you can run dachs limits //obscore (and you should do that occasionally anyway if you have an obscore table). DaCHS will then feed the meta from what is in your table.

    A related change is that where a piece of metadata is supposed to be drawn from a vocabulary, dachs val will now complain if you use some other identifier. As of DaCHS 2.10 the only metadata item controlled in this way is productTypeServed, though.

    Registering Obscore Tables

    Speaking about Obscore: I have long been unhappy about the way we register Obscore tables. Until now, they rode piggyback in the registry record of the TAP services they were queriable through. That was marignally acceptable as long as we did not have much VOResource metadata specific to the Obscore table. In the meantime, we have coverage in space, time, and spectrum, and there are several meaningful relationships that may be different for the obscore table than for the TAP service. And since 2019, we have the Discovering Data Collections Note that gives a sensible way to write dedicated registry records for obscore tables.

    With the global dataset discovery (discussed here in February) that should come with pyVO 1.6 (and of course the productTypeServed thing just discussed), there even is a fairly pressing operational reason for having these dedicated obscore records. There is a draft of a longer treatment on the background on github (pre-built here) that I will probably upload into the IVOA document repository once the global discovery code has been merged. Incidentally, reviews of that draft before publication are most welcome.

    But what this really means: If you have an obscore table, please run dachs pub //obscore after upgrading (and don't forget to run dachs limits //obscore after you do notable changes to your obscore table).

    Ranking

    Arguably the biggest single usability problem of the VO is <drumroll> sorting! Indeed, it is safe to assume that when someone types “Gaia DR3“ into any sort of search mask, they would like to find some way to query Gaia's gaia_source table (and then perhaps all kinds of other things, but that should reasonably be sorted below even mirrors of gaia_source. Regrettably, something like that is really hard to work out across the Registry outside of these very special cases.

    Within a data centre, however, you can sensibly give an order to things. For DaCHS, that in particular concerns the order of tables in TAP clients and the order of the various entries on the root page. For instance, a recent TOPCAT will show the table browser on the GAVO data centre like this:

    Screenshot of a hierachical display, top-level entries are, in that order, ivoa, tap_schema, bgds, califadr3; ivoa is opened and shows obscore and obs_radio, califadr3 is opened and shows cubes first, then fluxpos tables and finally flux tables.

    The idea is that obscore and TAP metadata are way up, followed by some data collections with (presumably) high scientific value for which we are the primary site; within the califadr3 schema, the tables are again sorted by relevance, as most people will be interested in the cubes first, the somewhat funky fluxpos tables second, and in the entirely nerdy flux tables last.

    You can arrange this by assigning schema-rank metadata at the top level of an RD, and table-rank metadata to individual tables. In both cases, missing ranks default to 10'000, and the lower a rank, the higher up a schema or table will be shown. For instance, dfbsspec/q (if you wonder what that might be: see Byurakan to L2) has:

    <resource schema="dfbsspec">
      <meta name="schema-rank">100</meta>
        ...
        <table id="spectra" onDisk="True" adql="True">
          <meta name="table-rank">1</meta>
    

    This will put dfbsspec fairly high up on the root page, and the spectra table above all others in the RD (which have the implicit table rank of 10'000).

    Note that to make DaCHS notice your rank, you need to dachs pub the modified RDs so the ranks end up in DaCHS' dc.resources table; since the Registry does not much care for these ranks, this is a classic use case for the -k option that preserves the registry timestamp of the resource and will thus prevent a re-publication of the registry record (which wouldn't be a disaster either, but let's be good citizens). Ideally, you assign schema ranks to all the resources you care about in one go and then just say:

    dachs pub -k ALL
    

    The Obscore Radio Extension

    While the details are still being discussed, there will be a radio extension to Obscore, and DaCHS 2.10 contains a prototype implementation for the current state of the specification (or my reading of it). Technically, it comprises a few columns useful for, in particular, interferometry data. If you have such data, take a look at https://github.com/ivoa-std/ObsCoreExtensionForRadioData.git and then consider trying what DaCHS has to offer so far; now is the time to intervene if something in the standard is not quite the way it should be (from your perspective).

    The documentation for what to do in DaCHS is a bit scarce yet – in particular, there is no tutorial chapter on obs-radio, nor will there be until the extension has converged a bit more –, but if you know DaCHS' obscore support, you will be immediately at home with the //obs-radio#publish mixin, and you can see it in (very limited) action in the emi/q RD.

    The FITS Media Type

    I have for a long time recommended to use a media type of image/fits for FITS “images” and application/fits for FITS (binary) tables. This was in gross violation of standards: I had freely invented image/fits, and you are not supposed to invent media types without then registering them with the IANA.

    To be honest, the invention was not mine (only). There are applications out there flinging around image/fits types, too, but never mind: It's still bad practice, and DaCHS 2.10 tries to rectify it by first using application/fits even where defaults have been image/fits before, and actually retroactively changing image/fits to application/fits in the database where it can figure out that a column contains a media type.

    It is accepting image/fits as an alias for application/fits in SIAP's FORMAT parameter, and so I hope nothing will break. You may have to adapt a few regression tests, though.

    On the Way To pathlib.Path

    For quite a while, Python has had the pathlib module, which is actually quite nice; for instance, it lets you write dir / name rather than os.path.join(dir, name). I would like to slowly migrate towards Path-s in DaCHS, and thus when you ask DaCHS' configuration system for paths (something like base.getConfig("inputsDir")), you will now get such Path-s.

    Most operator code, however, is still isolated from that change; in particular, the sourceToken you see in grammars mostly remains a string, and I do not expect that to change for the forseeable future. This is mainly because the usual string operations many people to do remove extensions and the like (self.sourceToken[:-5]) will fail rather messily with Path-s:

    >>> n = pathlib.Path("/a/b/c.fits")
    >>> n[:-5]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'PosixPath' object is not subscriptable
    

    So, if you don't call getConfig in any of your DaCHS-facing code, you are probably safe. If you do and get exceptions like this, you know where they come from. The solution, stringification, is rather straightforward:

    >>> str(n)[:-5]
    '/a/b/c'
    

    Partly as a consequence of this, there were slight changes in the way processors work. I hope I have not damaged anyone's code, but if you do custom previews and you overrode classify, you will have to fix your code, as that now takes an accref together with the path to be created.

    Odds And Ends

    As usual, there are many minor improvements and additions in DaCHS. Let me mention security.txt support. This complies to RFC 9116 and is supposed to give folks discovering a vulnerability a halfway reliable way to figure out who to complain to. If you try http://<your-hostname>/.well-known/security.txt, you will see exactly what is in https://dc.g-vo.org/.well-known/security.txt. If this is in conflict with some bone-headed security rules your institution may have, you can replace security.txt in DaCHS' central template directory (most likely /usr/lib/python3/dist-packages/gavo/resources/templates/); but in that case please complain, and we will make this less of a hassle to change or turn off.

    You can no longer use dachs serve start and dachs serve stop on systemd boxes (i.e., almost all modern Linux boxes as configured by default). That is because systemd really likes to manage daemons itself, and it gets cross when DaCHS tries to do it itself.

    Also, it used to be possible to fetch datasets using /getproduct?key=some/accref. This was a remainder of some ancient design mistake, and DaCHS has not produced such links for twelve years. I have now removed DaCHS' ability to fetch accrefs from key parameters (the accrefs have been in the path forever, as in /getproduct/some/accref). I consider it unlikely that someone is bitten by this change, but I personally had to fix two ancient regression tests.

    If you use embedded grammars and so far did not like the error messages because they always said “unknown location“, there is help: just set self.location to some string you want to see when something is wrong with your source. For illustration, when your source token is the name of a text file you process line by line, you would write:

    <iterator><code>
      with open(self.sourceToken) as f:
        for line_no, line in enumerate(f):
          self.location = f"{self.sourceToken}, {line_no}"
          # not do whatever you need to do on line
    </code></iterator>
    

    When regression-testing datalink endpoints, self.datalinkBySemantics may come in handy. This returns a mapping from concept identifiers to lists of matching rows (which often is just one). I have caught myself re-implementing what it does in the tests itself once too often.

    Finally, and also datalink-related, when using the //soda#fromStandardPubDID descriptor generator, you sometimes want to add just an extra attribute or two, and defining a new descriptor generator class for that seems too much work. Well, you can now define a function addExtras(descriptor) in the setup element and mangle the descriptor in whatever way you like.

    For instance, I recently wanted to enrich the descriptor with a few items from the underlying database table, and hence I wrote:

    <descriptorGenerator procDef="//soda#fromStandardPubDID">
      <bind name="accrefPrefix">"dasch/q/"</bind>
      <bind name="contentQualifier">"image"</bind>
      <setup>
        <code>
          def addExtras(descriptor):
            descriptor.suppressAutoLinks = True
            with base.getTableConn() as conn:
              descriptor.extMeta = next(conn.queryToDicts(
                "SELECT * FROM dasch.plates"
                " WHERE obs_publisher_did = %(did)s",
                {"did": descriptor.pubDID}))
        </code>
      </setup>
    </descriptorGenerator>
    

    Upgrade As Convenient

    That's it for the notable changes in DaCHS 2.10. As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.

    Oh, by the way: If you are still on buster (or some other distribution that still has astropy 4): A few (from my perspective minor) things will be broken; astropy is evolving too fast, but in general, I am trying to hack around the changes to make DaCHS work at least with the astropys in oldstable, stable, and unstable. However, in cases when a failure seems to be more of an annoyance to, I am resigning. If any of the broken things do bother you, do let me know, but also consider installing a backport of astropy 5 or higher – or, better, to dist-upgrade to bookworm. Sorry about that.

  • DaCHS 2.9 is out

    Our VO server package DaCHS almost always sees two releases per year, each time roughly after the Interops[1]. So, with the Tucson Interop over, it's time for DaCHS 2.9, and this is the traditional what's new post.

    Data Origin – the big headline for this release could be “curation”, in that three upcoming standardoid entities in that field are prototyped in 2.9. One is Data Origin, which is a note on how to embed some very basic provenance information into VOTables.

    This is going to help your users figure out how they came up with a VOTable when the referee has clever questions about the paper they submitted half a year earlier. The good news is: if you defined your metadata in your RD with sufficient care, with DaCHS 2.9 you will automatically do Data Origin.

    Feed your D links – another curation-related new thing in DaCHS is an implementation of what will hopefully be known as BibVO in the future. At this point, it is an unpublished note on Github. In essence, the purpose is to feed bibliographic services – and in particular the ADS – “D links”, i.e., links from publications to data. A part of this works automatically (the source metadatum), but the more advanced biblinks need a bit of manual intervention.

    If you even have, say, an observatory bibliography consisting pairs of papers and data used by these papers, you will probably have to write a handful of code. See biblinks in the reference documentation for details if any of this sounds as if it could apply to you. In this context, I have also enabled passing multiple accrefs to the /get endpoint. Users will then receive a tar file of the referenced data products.

    altIdentifiers in relationships – still in the bibliographic realm, VOResource 1.2 will (almost certainly) let you set altIdentifiers, in particular DOIs, when you declare relationships to other resources. That is probably of interest in particular when you want to declare relationships to things outside of the VO to services like b2find that themselves do not understand ivoids. In that situation, you would write something like:

    Cites: Some external thing
    Cites.altIdentifier: doi:10.fake/123412349876
    

    in a <meta> tag in your RD.

    json columns – postgresql has the very tempting and apparently all-powerful json type; it lets you stick complex structures into database columns and thus apparently relieve you of all the tedious tasks of designing database tables and documenting metadata.

    Written like this, you probably notice it's a slippery slope at best. Still, there are some non-hazardous uses for such columns, and thus you can now say type="json" or (probably preferably) type="jsonb" in column definitions. You can feed these columns with dicts, lists or JSON literals in strings. Clients will receive both of them as JSON string literals in char[*] FIELDs with an xtype of json. Neither astropy nor TOPCAT do anything with that xtype yet, but I expect that will change soon.

    Copy coverage – sometimes two resources have the same spatial (and potentially temporal and spectral) coverage. Since obtaining the coverage is an expensive operation, it would be nice to be able to say “aw, look at that other resource and take its coverage.” The classic example in DaCHS is the system-wide SIAP2 service that really is just a parametric wrapper around obscore. In such cases, you can now say something like:

    <coverage fallbackTo="__system__/obscore"/>
    

    – and //siap2 already does. That's one more reason to occasionally run dachs limits //obscore if you offer an obscore table.

    First VOTable row in tests – if you have calls to getFirstVOTableRow in regression tests (you have regression tests, right?) that return multiple rows, these will fail now until you also pass rejectExtras=False to that call. I've had regressions that were hidden by the function's liberal acceptance of extra rows, and it's too simple to produce unstable tests (that magically succeed and fail depending to the current state of the database) with the old behaviour. I hence hope for your sympathy and understanding in case I broke one of your tests.

    ADQL extensions – there is now arr_count to complement the array extension added in 2.7. Also, our custom UDFs transform, normal_random, to_jd, to_mjd, and simbadpoint now have a prefix of ivo_ rather than the previous gavo_. In order not to break existing queries, DaCHS will still accept the gavo_-prefixed names for the forseeable future, but it will no longer advertise them.

    Minor fixes – as usual, there are many minor bug fixes and improvements, the most visible of which is probably that DaCHS now correctly handles literal + chars in multipart-encoded (”uploads”) requests again; that was broken in 2.8 after the removal of the dependency on python's CGI module. Also, MOC-valued columns can now be serialised into non-VOTable formats like JSON or CSV.

    If you have been using DaCHS' built-in HTTPS support, certain clients may have rejected its certificates. That was because we were pulling an expired intermediate certificate from letsencrypt. If you don't understand what I was just saying, don't worry. If you do understand that and know a good way to avoid this kind of calamity in the future, I'm grateful for advice.

    VCS move – when DaCHS was born, using the venerable subversion for version control was considered reputable. These days, fewer and fewer people can still deal with that, and thus I have moved the DaCHS source code into a git repository: https://gitlab-p4n.aip.de/gavo/dachs/.

    I hear you moan “why not github?” Well: don't get me started unless you are prepared to listen to a large helping of proselytising. Suffice it to say that we in academia invented the internet (for all intents and purposes) and it's a shame that we now rely so much on commercial entities to provide our basic services (and then without paying them, as a rule, which is always a dangerous proposition towards commercial entities).

    Anyway: Feel free to use that service's bug tracker; we try to find ways to let you log in there without undue hardship, too.

    At this point, I customarily urge: don't wait, upgrade. If you have our Debian repository enabled, apt update && apt upgrade should do the trick, except if you missed our announcement on dachs-users that our repository key has changed. If you have not updated it, please have a look at our repo page to see what needs to be done. Sorry about this, but our old 1024D key was being frowned upon, so we had to do something.

    Unless you are an old hand and have upgraded many times before, let me recommend a quick glance at our upgrading guide before doing the actual upgrade.

    [1]The reason we wait for the Interops is that we are generally promising to put something into DaCHS at or around these conferences. This time, the preliminary support for json-typed database columns is an example for that.

Page 1 / 7 »