• At the Malta Interop

    A bonze statue of a running man with a newspaper in his hand in front of a massive stone wall.

    The IVOA meets in Malta, which sports lots of walls and fortifications. And a “socialist martyr” boldly stepping forward (like the IVOA, of course): Manwel Dimech.

    It is Interop time again! Most people working on the Virtual Observatory are convened in Malta at the moment and will discuss the development and reality of our standards for the next two days. As usual, I will report here on my thoughts and the proceedings as I go along, even though it will be a fairly short meeting: In northen autumn, the Interop always is back-to-back with ADASS, which means that most participants already have 3½ days of intense meetings behind them and will probably be particularly glad when we will conclude the Interop Sunday noon.

    The TCG discusses (Thursday, 15:00)

    Right now, I am sitting in a session of the Technical Coordination Group, where the chairs and vice-chairs of the Working and Interest Groups meet and map out where they want to go and how it all will fit together. If you look at this meeting's agenda, you can probably guess that this is a roller coaster of tech and paperwork, quickly changing from extremely boring to extremely exciting.

    For me up to now, the discussion about whether or not we want LineTAP at all was the most relevant agenda item; while I do think VAMDC would win by taking up the modern IVOA TAP and Registry standards (VAMDC was forked from the VO in the late 2000s), takeup has been meagre so far, and so perhaps this is solving a problem that nobody feels[1]. I have frankly (almost) only started LineTAP to avoid a SLAP2 with an accompanying data model that would then compete with XSAMS, the data model below VAMDC.

    On the other hand: I think LineTAP works so much more nicely than VAMDC for its use case (identify spectral lines in a plot) that it would be a pity to simply bury it.

    By the way, if you want, you can follow the (public; the TCG meeting is closed) proceedings online; zoom links are available from the programme page. There will be recordings later.

    At the Exec Session (Thurday, 16:45)

    The IVOA's Exec is where the heads of the national projects meet, with the most noble task of endorsing our recommendations and otherwise providing a certain amount of governance. The agenda of Exec meetings is public, and so will the minutes be, but otherwise this again is a closed meeting so everyone feels comfortable speaking out. I certainly will not spill any secrets in this post, but rest assured that there are not many of those to begin with.

    That I am in here is because GAVO's actual head, Joachim, is not on Malta and could not make it for video participation, either. But then someone from GAVO ought to be here, if only because a year down the road, we will host the Interop: In the northern autumn of 2025, the ADASS and the Interop will take place in Görlitz (regular readers of this blog have heard of that town before), and so I see part of my role in this session in reconfirming that we are on it.

    Meanwhile, the next Interop – and determining places is also the Exec's job – will be in the beginning of June 2025 in College Park, Maryland. So much for avoiding flight shame for me (which I could for Malta that still is reachable by train and ferry, if not very easily).

    Opening Plenary (Friday 9:30)

    A lecture hall with people, a slide “The University of Malta” at the wall.

    Alessio welcomes the Interop crowd to the University of Malta.

    Interops always begin with a plenary with reports from the various functions: The chair of the Exec, the chair of the committee of science priorities, and chair of technical coordination group. Most importantly, though, the chairs of the working and interest groups report on what has happened in their groups in the past semester, and what they are planning for the Interop (“Charge to the working groups”).

    For me personally, the kind words during Simon's State of the IVOA report on my VO lecture (parts of which he has actually reused) were particularly welcome.

    But of course there was other good news in that talk. With my Registry grandmaster hat on, I was happy to learn that NOIRLabs has released a simple publishing registry implementation, and that ASVO's (i.e., Australia) large TAP server will finally be properly registered, too. The prize for the coolest image, though, goes to VO France and in particular their solar system folks, who have used TOPCAT to visualise data on a model of comet 67P Churyumov–Gerasimenko (PDF page 20).

    Self-Agency (Friday, 10:10)

    A slide with quite a bit of text.  Highlighted: “Dropped freq_min/max“

    I have to admit it's kind of silly to pick out this particular point from all the material discussed by the IG and WG chairs in the Charge to the Working Groups, but a part of why this job is so gratifying is experiences of self-agency. I just had one of these during the Radio IG report: They have dropped the duplication of spectral information in their proposed extension to obscore.

    Yay! I have lobbied for that one for a long time on grounds that if there is both em_min/em_max and f_min/f_max in an obscore records (which express the same thing, with em_X being wavelengths in metres, and f_X frequencies in… something else, where proposals included Hz, MHz and GHz), it is virtually certain that at least one pair is wrong. Most likely, both of them will be. I have actually created a UDF for ADQL queries to make that point. And now: Success!

    Focus Session: High Energy and Time Domain (Friday, 12:00)

    The first “working” session of the Interop is a plenary on High Energy and Time Domain, that is, instruments that look for messenger particles that may have the energy of a tennis ball, as well as ways to let everyone else know about them quickly.

    Incidentally, that “quickly” is a reason for why the two apparently unconnected topics share a session: Particles in the tennis ball range are fortunately rare (or our DNA would be in trouble), and so when you have found one, you might want make sure everone else gets to look whether something odd shows up where that particle came from in other messengers (as in: optical photons, say). This is also relevant because many detectors in that energy (and particle) range do not have a particularly good idea of where the signal came from, and followups in other wavelengths may help figuring out what sort of thing may have produced a signal.

    I enjoyed a slide by Jutta, who reported on VO publication of km3net data, that is, neutrinos detected in a large detector cube below the Mediterrenean sea, using the Earth as a filter:

    Screenshot of a slide: “What we do: Point source analysis, Alerts and follow-ups; What we don't do: Mission planning, Nice pictures.”

    “We don't do pretty pictures“ is of course a very cool thing one can say, although I bet this is not 120% honest. But I am willing to give Jutta quite a bit of slack; after all, km3net data is served through DaCHS, and I am still hopeful that we will use it to prototype serving more complex data products than just plain event lists in the future.

    A bit later in the session, an excellent question was raised by Judy Racusin in her talk on GCN:

    A talk slide, with highlighted text: “Big Question: Why hasn't this [VOEvent] continued to serve the needs of various transient astrophysics communities?”

    The background of the question is that there is a rather reasonable standard for the dissemination of alerts and similar data, VOEvent. This has seen quite a bit of takeup in the 2000s, but, as evinced by page 17 of Judy's slides, all the current large time-domain projects decided to invent something new, and it seems each one invented something different.

    I don't have an definitive answer to why and how that happened (as opposed to, for instance, everyone cooperating on evolving VOEvent to match whatever requirements these projects have), although outside pressures (e.g., the rise of Apache Avro and Kafka) certainly played a role.

    I will, however, say that I strongly suspect that if the VOEvent community back then had had more public and registered streams consumed by standard software, it would have been a lot harder for these new projects to (essentially) ignore it. I'd suggest as a lesson to learn from that: make sure your infrastructure is public and widely consumed as early as you can. That ought to help a lot in ensuring that your standard(s) will live long and prosper.

    In Apps I (Friday 16:30)

    I am now in the Apps session. This is the most show-and-telly event you will get at an Interop, with largest likelihood of encountering the pretty pictures that Jutta had flamboyantly expressed disinterest in this morning. In the first talk already, Thomas delivers with, for instance, mystic pictures from Mars:

    A photo of Olympus Mons on Mars with overplotted contour lines.

    Most of the magic was shown in a live demo; once the recordings are online, consider checking this one out (I'll mention in passing that HiPS2MOC looks like a very useful feature, too).

    My talk, in contrast, had extremely boring slides; you're not missing out at all by simply reading the notes. The message is not overly nice, either: Rather do fewer features than optional ones, as a server operator please take up new standards as quickly as you can, and in the same role please provide good metadata. This last point happened to be a central message in Henrik's talk on ESASky (which aptly followed mine) as well, that, like Thomas', featured a live performance of eye candy.

    Mario Juric's talk on something called HATS then featured this nice plot:

    A presentation slide headed “partition hierarchically“, with all-sky heatmap featuring pixels of varying size.

    That's Gaia's source catalogue pixelated such that the sources in each pixel require about a constant processing time. The underlying idea, hierarchical tiling, is great and has proved itself extremely capable not only with HiPS, which is what is behind basically anything in the VO that lets you smoothly zoom, in particular Aladin's maps. HATS' basic premise seems to be to put tables (rather than JPEGs or FITS images as usual) into a HiPS structure. That has been done before, as with the catalogue HiPSes; Aladin users will remember the Gaia or Simbad layers. HATS, now, stores Parquet files, provides Pandas-like interfaces on top of them, and in particular has the nice property of handing out data chunks of roughly equal size.

    That is certainly great, in particular for the humongous data sets that Rubin (née LSST) will produce. But I wondered how well it will stand up when you want to combine different data collections of this sort. The good news: they have already tried it, and they even have thought about how pack HATS' API behind a TAP/ADQL interface. Excellent!

    Further great news in Brigitta's talk [warning: link to google]: It seems you can now store ipython (“Jupyter”) notebooks in, ah well, Markdown – at least in something that seems version-controllable. Note to self: look at that.

    Data Access Layer (Saturday 9:30)

    I am now sitting in the first session of the Data Access Layer Working Group. This is where we talk about the evolution of the protocols you will use if you “use the VO”: TAP, SIAP, and their ilk.

    Right at the start, Anastasia Laity spoke about a topic that has given me quite a bit of headache several times already: How do you tell simulated data from actual observations when you have just discovered a resource that looks relevant to your research?

    There is prior art for that in that SSAP has a data source metadata item on complete services, with values survey, pointed, custom, theory, or artificial (see also SimpleDALRegExt sect. 3.3, where the operational part of this is specified). But that's SSAP only. Should we have a place for that in registry records in general? Or even at the dataset level? This seems rather related to the recent addition of productTypeServed in the brand-new VODataService 1.3. Perhaps it's time for dataSource element in VODataService?

    A large part of the session was taken up by the question of persistent TAP uploads that I have covered here recently. I have summarised this in the session, and after that, people from ESAC (who have built their machinery on top of VOSpace) and CADC (who have inspired my implementation) gave their takes on the topic of persistent uploads. I'm trying hard to like ESAC's solution, because it is using the obvious VO standard for users to manage server-side resources (even though the screenshot in the slides,

    A cutout of a presentation slide showing a browser screenshot with a modal diaglog with a progress bar for an upload.

    suggests it's just a web page). But then it is an order of magnitude more complex in implementation than my proposal, and the main advantage would be that people can share their tables with other users. Is that a use case important enough to justify that significant effort?

    Then Pat's talk on CADC's perspective presented a hierarchy of use cases, which perhaps offers a way to reconcile most of the opinions: Is there is a point for having the same API on /tables and /user_tables, depending on whether we want the tables to be publicly visible?

    Data Curation and Preservation (Saturday, 11:15)

    This Interest Group's name sounds like something only a librarian could become agitated about: Data curation and preservation. Yawn.

    Fortunately, I am considering myself a librarian at heart, and hence I am participating in the DCP session now. In terms of engagement, we have already started to quarrel about a topic that must seem rather like bikeshedding from the outside: should we bake in the DOI resolver into the way we write DOIs (like http://doi.org/10.21938/puTViqDkMGcQZu8LSDZ5Sg; actually, since a few years: https instead of http?) or should we continue to use the doi URI scheme, as we do now: doi:10.21938/puTViqDkMGcQZu8LSDZ5Sg?

    This discussion came up because the doi foundation asks you to render DOIs in an actionable way, which some people understand as them asking people to write DOIs with their resolver baked in. Now, I am somewhat reluctant to do that mainly on grounds of user freedom. Sure, as long as you consider the whole identifier an opaque string, their resolver is not actually implied, but that's largely ficticious, as evinced by the fact that somehow identifiers with http and with https would generally be considered equivalent. I do claim that we should make it clear that alternative resolvers are totally an option. Including ours: RegTAP lets you resolve DOIs to ivoids and VOResource metadata, which to me sounds like something you might absolutely want to do.

    Another (similarly biased) point: Not everything on the internet is http. There are other identifier types that are resolvable (including ivoids). Fortunately, writing DOIs as HTTP URIs is not actually what the doi foundation is asking you to do. Thanks to Gus for clarifying that.

    These kinds of questions also turned up in the discussion after my talk on BibVO. Among other things, that draft standard proposes to deliver information on what datasets a paper used or produced in a very simple JSON format. That parsimony has been put into question, and in the end the question is: do we want to make our protocols a bit more complicated to enable interoperability with other “things”, probably from outside of astronomy? Me, I'm not sure in this case: I consider all of BibVO some sort of contract essentially between the IVOA and SciX (née ADS), and I doubt that someone else than SciX will even want to read this or has use for it.

    But then I (and others) have been wrong with preditions like this before.

    Registry (Saturday 14:30)

    Now it's registry time, which for me is always a special time; I have worked a lot on the Registry, and I still do.

    Given that, in Christophe's statistics talk, I was totally blown away by the number of authorities and registries from Germany, given how small GAVO is. Oh wow. In this graph of authorities in the VO we are the dark green slice far at the bottom of the pie:

    A presentation slide with two pie charts.  In the larger one, there are man small and a couple of large slices.  A dark green one makes up a bit less than 10%.

    I will give you that, as usual with metrics, to understand what they mean you have to know so much that you then don't need the metrics any more. But again there is an odd feeling of self-agency in that slide.

    The next talk, Robert Nikutta's announcement of generic publishing registry code, was – as already mentioned above – particularly good news for me, because it let me add something particularly straightforward into my overview of OAI-PMH servers for VO use, and many data providers (those unwise enough to not use DaCHS…) have asked for that.

    For the rest of the session I entertained folks with the upcoming RFC of VOResource 1.2 and the somewhat sad state of affairs in fulltext seaches in the VO. Hence, I was too busy to report on how gracefully the speaker made his points. Ahem.

    Semantics and Solar System (Saturday 16:30)

    Ha! A session in which I don't talk. That's even more remarkable because I'm the chair emeritus of the Semantics WG and the vice-chair of the Solar Systems IG at the moment.

    Nevertheless, my plan has been to sit back and relax. Except that some of Baptiste's proposals for the evolution of the IVOA voacabularies are rather controversial. I was therefore too busy to add to this post again.

    But at least there is hope to get rid of the ugly “(Obscure)” as the human-readable label of the geo_app reference frame that entered that vocabulary via VOTable; you see, this term was allowed in COOSYS/@system since VOTable 1.0, but when we wrote the vocabulary, nobody who reviewed it could remember what it meant. In this session, JJ finally remembered. Ha! This will be a VEP soon.

    It was also oddly gratifying to read this slide from Stéphane's talk on fetching data from PDS4:

    A presentation slide with bullet points complaining about missing metadata, inconsistent metadata, and other unpleasantries.

    Lists like these are rather characteristic in a data publisher's diary. Of course, I know that's true. But seeing it in public is still gives me a warm feeling of comradeship. Stéphane then went on to tell us how to make the cool 67P images in TOPCAT (I had already mentioned those above when I talked about the Exec report):

    A 3D-plot of an odd shape with colours indicating some physical quantity.

    Operations (Sunday 10.00)

    I am now in the session of the Operations IG, where Henrik is giving the usual VO Weather Report. VO weather reports discuss how many of our services are “valid” in the sense of “will work reasonably well with our clients“. As usual for these kinds of metrics, you need to know quite a bit to understand what's going on and how bad it is when a service is “not compliant”. In particular for the TAP stats, things look a lot bleaker than they actually are:

    A bar graph showing the temporal evolution of the number of TAP servers failing (red), just passing (yellow) or passing (green) validation over the past year or so.  Yellow is king.

    Green is “fully compliant”, yellow is “mostly compliant”, red is “not compliant”. For whatever that means.

    These assessments are based on stilts taplint, which is really fussy (and rightly so). In reality, you can usually use even the red services without noticing something is wrong. Except… if you are not doing things quite right yourself.

    That was the topic of my talk for Ops. It is another outcome of this summer semester's VO course, where students were regularly confused by diagnostics they got back. Of course, while on the learning curve, you will see more such messages than if you are a researcher who is just gently adapting some sample code. But anyway: Producing good error messages is both hard and important. Let me quote my faux quotes in the talk:

    Writing good error messages is great art: Do not claim more than you know, but state enough so users can guess how to fix it.

    —Demleitner's first observation on error messages

    Making a computer do the right thing for a good request usually is not easy. It is much harder to make it respond to a bad request with a good error message.

    —Demleitner's first corollary on error messages

    Later in the session there was much discussion about “denial of service attacks” that services occasionally face. For us, that does not seem to be malicious people in general, but people basically well-meaning but challenged to do the right thing (read documentation, figure out efficient ways to do what they want to do).

    For instance, while far below DoS, turnitin.com was for a while harvesting all VO registry records from some custom, HTML-rendering endpoint every few days, firing off 30'000 requests relatively expensive on my side (admittedly because I have implemented that particular endpoint in the most lazy fashion imaginable) in a rather short time. They could have done the same thing using OAI-PMH with a single request that, no top, would have taken up almost no CPU on my side. For the record, it seems someone at turnitin.com has seen the light; at least they don't do that mass harvesting any more for all I can tell (without actually checking the logs). Still, with a single computer, it is not hard to bring down your average VO server, even if you don't plan to.

    Operators that are going into “the cloud” (which is a thinly disguised euphemism for “volunatrily becoming hostages of amazon.com”) or that are severely “encouraged” to do that by their funding agencies have the additional problem in that for them, indiscriminate downloads might quickly become extremely costly on top. Hence, we were talking a bit about mitigations, from HTTP 429 status codes (”too many requests“) to going for various forms of authentication, in particular handing out API keys. Oh, sigh. It would really suck if people ended up needing to get and manage keys for all the major services. Perhaps we should have VO-wide API keys? I already have a plan for how we could pull that off…

    Winding down (Monday 7:30)

    The Interop concluded yesterday noon with reports from the working groups and another (short) one from the Exec chair. Phewy. It's been a straining week ever since ADASS' welcome reception almost exactly a week earlier.

    Reviewing what I have written here, I notice I have not even mentioned a topic that pervaded several sessions and many of the chats on the corridors: The P3T, which expands to “Protocol Transition Tiger Team”.

    This was an informal working group that was formed because some adopters of our standards felt that they (um: the standards) are showing their age, in particular because of the wide use of XML and because they do not always play well with “modern” (i.e., web browser-based) “security” techniques, which of course mostly gyrate around preventing cross-site information disclosure.

    I have to admit that I cannot get too hung up on both points; I think browser-based clients should be the exception rather than the norm in particular if you have secrets to keep, and many of the “modern” mitigations are little more than ugly hacks (“pre-flight check“) resulting from the abuse of a system designed to distribute information (the WWW) as an execution platform. But then this ship has sailed for now, and so I recognise that we may need to think a bit about some forms of XSS mitigations. I would still say we ought to find ways that don't blow up all the sane parts of the VO for that slightly insane one.

    On the format question, let me remark that XML is not only well-thought out (which is not surprising given its designers had the long history of SGML to learn from) but also here to stay; developers will have to handle XML regardless of what our protocols do. More to the point, it often seems to me that people who say “JSON is so much simpler” often mean “But it's so much simpler if my web page only talks to my backend”.

    Which is true, but that's because then you don't need to be interoperable and hence don't have to bother with metadata for other peoples' purposes. But that interoperability is what the IVOA is about. If you were to write the S-expressions that XML encodes at its base in JSON, it would be just as complex, just a bit more complicated because you would be lacking some of XML's goodies from CDATA sections to comments.

    Be that as it may, the P3T turned out to do something useful: It tried to write OpenAPI specifications for some of our protocols, and already because that smoked out some points I would consider misfeatures (case-insensitive parameter names for starters), that was certainly a worthwhile effort. That, as some people pointed out, you can generate code from OpenAPI is, I think, not terribly valuable: What code that generates probably shouldn't be written in the first place and rather be replaced by some declarative input (such as, cough, OpenAPI) to a program.

    But I will say that I expect OpenAPI specs to be a great help to validators, and possibly also to implementors because they give some implementation requirements in a fairly concise and standard form.

    In that sense: P3T was not a bad thing. Let's see what comes out of it now that, as Janet also reported in the closing session, the tiger is sleeping:

    A presentation slide with a sleeping tiger and the proclamation that ”We feel the P3T has done its job”.
    [1]“feels” as opposed to “has”, that is. I do still think that many people would be happy if they could say something like: “I'm interested in species A, B, and C at temperature T (and perhaps pressure p). Now let me zoom into a spectrum and show me lines from these species; make it so the lines don't crowd too much and select those that are plausibly the strongest with this physics.”
  • A Proposal for Persistent TAP Uploads

    From its beginning, the IVOA's Table Access Protocol TAP has let users upload their own tables into the services' databases, which is an important element of TAP's power (cf. our upload crossmatch use case for a minimal example). But these uploads only exist for the duration of the request. Having more persistent user-uploaded tables, however, has quite a few interesting applications.

    Inspired by Pat Dowler's 2018 Interop talk on youcat I have therefore written a simple implementation for persistent tables in GAVO's server package DaCHS. This post discusses what is implemented, what is clearly still missing, and how you can play with it.

    If all you care about is using this from Python, you can jump directly to a Jupyter notebook showing off the features; it by and large explains the same things as this blogpost, but using Python instead of curl and TOPCAT. Since pyVO does not know about the proposed extensions, the code necessarily is still a bit clunky in places, but if something like this will become more standard, working with persistent uploads will look a lot less like black art.

    Before I dive in: This is certainly not what will eventually become a standard in every detail. Do not do large implementations against what is discussed here unless you are prepared to throw away significant parts of what you write.

    Creating and Deleting Uploads

    Where Pat's 2018 proposal re-used the VOSI tables endpoint that every TAP service has, I have provisionally created a sibling resource user_tables – and I found that usual VOSI tables and the persistent uploads share virtually no server-side code, so for now this seems a smart thing to do. Let's see what client implementors think about it.

    What this means is that for a service with a base URL of http://dc.g-vo.org/tap[1], you would talk to (children of) http://dc.g-vo.org/tap/user_tables to operate the persistent tables.

    As with Pat's proposal, to create a persistent table, you do an http PUT to a suitably named child of user_tables:

    $ curl -o tmp.vot https://docs.g-vo.org/upload_for_regressiontest.vot
    $ curl -H "content-type: application/x-votable+xml" -T tmp.vot \
      http://dc.g-vo.org/tap/user_tables/my_upload
    Query this table as tap_user.my_upload
    

    The actual upload at this point returns a reasonably informative plain-text string, which feels a bit ad-hoc. Better ideas are welcome, in particular after careful research of the rules for 30x responses to PUT requests.

    Trying to create tables with names that will not work as ADQL regular table identifiers will fail with a DALI-style error. Try something like:

    $ curl -H "content-type: application/x-votable+xml" -T tmp.vot
      http://dc.g-vo.org/tap/user_tables/join
    ... <INFO name="QUERY_STATUS" value="ERROR">'join' cannot be used as an
      upload table name (which must be regular ADQL identifiers, in
      particular not ADQL reserved words).</INFO> ...
    

    After a successful upload, you can query the VOTable's content as tap_user.my_upload:

    A TOPCAT screenshot with a query 'select avg("3.6mag") as blue, avg("5.8mag") as red from tap_user.my_upload' that has a few red warnings, and a result window showing values for blue and red.

    TOPCAT (which is what painted these pixels) does not find the table metadata for tap_user tables (yet), as I do not include them in the “public“ VOSI tables. This is why you see the reddish syntax complaints here.

    I happen to believe there are many good reasons for why the volatile and quickly-changing user table metadata should not be mixed up with the public VOSI tables, which can be several 10s of megabytes (in the case of VizieR). You do not want to have to re-read that (or discard caches) just because of a table upload.

    If you have the table URL of a persistent upload, however, you inspect its metadata by GET-ting the table URL:

    $ curl http://dc.g-vo.org/tap/user_tables/my_upload | xmlstarlet fo
    <vtm:table [...]>
      <name>tap_user.my_upload</name>
      <column>
        <name>"_r"</name>
        <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
        <unit>arcmin</unit>
        <ucd>pos.angDistance</ucd>
        <dataType xsi:type="vs:VOTableType">float</dataType>
        <flag>nullable</flag>
      </column>
      ...
    

    – this is a response as from VOSI tables for a single table. Once you are authenticated (see below), you can also retrieve a full list of tables from user_tables itself as a VOSI tableset. Enabling that for anonymous uploads did not seem wise to me.

    When done, you can remove the persistent table, which again follows Pat's proposal:

    $ curl -X DELETE http://dc.g-vo.org/tap/user_tables/my_upload
    Dropped user table my_upload
    

    And again, the text/plain response seems somewhat ad hoc, but in this case it is somewhat harder to imagine something less awkward than in the upload case.

    If you do not delete yourself, the server will garbage-collect the upload at some point. On my server, that's after seven days. DaCHS operators can configure that grace period on their services with the [ivoa]userTableDays setting.

    Authenticated Use

    Of course, as long as you do not authenticate, anyone can drop or overwrite your uploads. That may be acceptable in some situations, in particular given that anonymous users cannot browse their uploaded tables. But obviously, all this is intended to be used by authenticated users. DaCHS at this point can only do HTTP basic authentication with locally created accounts. If you want one in Heidelberg, let me know (and otherwise push for some sort of federated VO-wide authentication, but please do not push me).

    To just play around, you can use uptest as both username and password on my service. For instance:

      $ curl -H "content-type: application/x-votable+xml" -T tmp.vot \
      --user uptest:uptest \
      http://dc.g-vo.org/tap/user_tables/privtab
    Query this table as tap_user.privtab
    

    In recent TOPCATs, you would enter the credentials once you hit the Log In/Out button in the TAP client window. Then you can query your own private copy of the uploaded table:

    A TOPCAT screenshot with a query 'select avg("3.6mag") as blue, avg("5.8mag") as red from tap_user.my_upload' that has a few red warnings, and a result window showing values for blue and red; there is now a prominent Log In/Out-button showing we are logged in.

    There is a second way to create persistent tables (that would also work for anonymous): run a query and prepend it with CREATE TABLE. For instance:

    A TOPCAT screenshot with a query 'create table tap_user.smallgaia AS SELECT * FROM gaia.dr3lite TABLESAMPLE(0.001)'. Again, TOPCAT flags the create as an error, and there is a dialog "Table contained no rows".

    The “error message” about the empty table here is to be expected; since this is a TAP query, it stands to reason that some sort of table should come back for a successful request. Sending the entire newly created table back without solicitation seems a waste of resources, and so for now I am returning a “stub” VOTable without rows.

    As an authenticated user, you can also retrieve a full tableset for what user-uploaded tables you have:

    $ curl --user uptest:uptest http://dc.g-vo.org/tap/user_tables | xmlstarlet fo
    <vtm:tableset ...>
      <schema>
        <name>tap_user</name>
        <description>A schema containing users' uploads. ...  </description>
        <table>
          <name>tap_user.privtab</name>
          <column>
            <name>"_r"</name>
            <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
            <unit>arcmin</unit>
            <ucd>pos.angDistance</ucd>
            <dataType xsi:type="vs:VOTableType">float</dataType>
            <flag>nullable</flag>
          </column>
          ...
        </table>
        <table>
          <name>tap_user.my_upload</name>
          <column>
            <name>"_r"</name>
            <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
            <unit>arcmin</unit>
            <ucd>pos.angDistance</ucd>
            <dataType xsi:type="vs:VOTableType">float</dataType>
            <flag>nullable</flag>
          </column>
          ...
        </table>
      </schema>
    </vtm:tableset>
    

    Open Questions

    Apart from the obvious question whether any of this will gain community traction, there are a few obvious open points:

    1. Indexing. For tables of non-trivial sizes, one would like to give users an interface to say something like “create an index over ra and dec interpreted as spherical coordinates and cluster the table according to it”. Because this kind of thing can change runtimes by many orders of magnitude, enabling it is not just some optional embellishment.

      On the other hand, what I just wrote already suggests that even expressing the users' requests in a sufficiently flexible cross-platform way is going to be hard. Also, indexing can be a fairly slow operation, which means it will probably need some sort of UWS interface.

    2. Other people's tables. It is conceivable that people might want to share their persistent tables with other users. If we want to enable that, one would need some interface on which to define who should be able to read (write?) what table, some other interface on which users can find what tables have been shared with them, and finally some way to let query writers reference these tables (tap_user.<username>.<tablename> seems tricky since with federated auth, user names may be just about anything).

      Given all this, for now I doubt that this is a use case sufficiently important to make all the tough nuts delay a first version of user uploads.

    3. Deferring destruction. Right now, you can delete your table early, but you cannot tell my server that you would like to keep it for longer. I suppose POST-ing to a destruction child of the table resource in UWS style would be straightforward enough. But I'd rather wait whether the other lacunae require a completely different pattern before I will touch this; for now, I don't believe many persistent tables will remain in use beyond a few hours after their creation.

    4. Scaling. Right now, I am not streaming the upload, and several other implementation details limit the size of realistic user tables. Making things more robust (and perhaps scalable) hence will certainly be an issue. Until then I hope that the sort of table that worked for in-request uploads will be fine for persistent uploads, too.

    Implemented in DaCHS

    If you run a DaCHS-based data centre, you can let your users play with the stuff I have shown here already. Just upgrade to the 2.10.2 beta (you will need to enable the beta repo for that to happen) and then type the magic words:

    dachs imp //tap_user
    

    It is my intention that users cannot create tables in your DaCHS database server unless you say these words. And once you say dachs drop --system //tap_user, you are safe from their huge tables again. I would consider any other behaviour a bug – of which there are probably still quite a few. Which is why I am particularly grateful to all DaCHS operators that try persistent uploads now.

    [1]As already said in the notebook, if http bothers you, you can write https, too; but then it's much harder to watch what's going on using ngrep or friends.
  • GAVO at the AG-Tagung in Köln

    People standing an sitting around a booth-like table.  There's a big GAVO logo and a big screen on the left-hand side, a guy in a red hoodie is clearly giving a demo.

    As every year, GAVO participates in the fall meeting of the Astronomische Gesellschaft (AG), the association of astronomers working in Germany. This year, the meeting is hosted by the Universität zu Köln (a.k.a. University of Cologne), and I want to start with thanking them and the AG staff for placing our traditional booth smack next to a coffee break table. I anticipate with glee our opportunities to run our pitches on how much everyone is missing out if they're not doing VO while people are queueing up for coffee. Excellent.

    As every year, we are co-conveners for a splinter meeting on e-science the virtual observatory, where I will be giving a talk on global dataset discovery (you heard it here first; lecture notes for the talk) late on Thursday afternoon.

    And as every year, there is a puzzler, a little problem rather easily solvable using VO tools; I was delighted to see people apparently already waiting for it when I handed out the problem sheet during the welcome reception tonight. You are very welcome to try your hand on it, but you only get to enter our raffle if you are on site. This year, the prize is a towel (of course) featuring a great image from ESA's Mars Express mission, where Phobos floats in front of Mars' limb:

    A 2:1 landscape black-and-white image with a blackish irregular spheroid floating in front of a deep horizon.

    I will update this post with the hints we are going to give out during the coffee breaks tomorrow and on Wednesday. And I will post our solution here late on Thursday.

    At our booth, you will also find various propaganda material, mostly covering matters I have mentioned here before; for posteriority and remoteriority, let me link to PDFs of the flyers/posters I have made for this meeting (with re-usabilty in mind). To advertise the new VO lectures, I am asking Have you ever wished there was a proper introduction to using the Virtual Observatory? with lots of cool DOIs and perhaps less-cool QR codes. Another flyer trying to gain street cred with QR codes is the Follow us flyer advertising our Fediverse presence. We also still show a pitch for publishing with us and hand out the inevitable who we are flyer (which, I'll readily admit, has never been an easy sell).

    A fediverse screenshot and URIs for following us.

    Bonferroni for Open Data?

    A lot more feedback than on the QR code-heavy posters I got on a real classic that I have shown at many AG meetings since the 2013 Tübingen meeting: Lame excuses for not publishing data.

    A tricky piece of feedback on that was an excuse that may actually be a (marginally) valid criticism of open data in general. You see, in particular in astroparticle physics (where folks are usually particularly uptight with their data), people run elaborate statistics on their results, inspired by the sort of statistics they do in high energy physics (“this is a 5-sigma detection of the Higgs particle”). When you do this kind of thing, you do run into a problem when people run new “tests” against your data because of the way test theory works. If you are actually talking about significance levels, you would have to apply Bonferroni corrections (or worse) when you do new tests on old data.

    This is actually at least not untrue. If you do not account for the slight abuse of data and tests of this sort, the usual interpretation of the significance level – more or less the probablity that you will reject a true null hypothesis and thus claim a spurious result – breaks down, and you can no longer claim things like “aw, at my significance level of 0.05, I'll do spurious claims only one out of twenty times tops”.

    Is this something people opening their data would need to worry about when they do their original analysis? It seems obvious to me that that's not the case and it would actually be impossible to do, in particular given that there is no way to predict what people will do in the future. But then there are many non-obvious results in statistics going against at least my gut feelings.

    Mind you, this definitely does not apply to most astronomical research and data re-use I have seen. But the point did make me wonder whether we may actually need some more elaborate test theory for re-used open data. If you know about anything like that: please do let me know.

    Followup (2024-09-10)

    The first hint is out. It's “Try TOPCAT's TAP client to solve this puzzler; you may want to took for 2MASS XSC there.“ Oh, and we noticed that the problem was stated rather awkwardly in the original puzzler, which is why we have issued an erratum. The online version is fixed, it now says “where we define obscure as covered by a circle of four J-magnitude half-light radii around an extended object”.

    Followup (2024-09-10)

    After our first splinter – with lively discussions on the concept and viability of the “science-ready data” we have always had in mind as the primary sort of thing you would discover in the VO –, I have revealed the second hint: “TOPCAT's Examples button is always a good idea, in particular if you are not too proficient in ADQL. What you would need here is known as a Cone Selection.”

    Oh, in case you are curious where the discussion on the science-ready data gyrated to: Well, while the plan for supplying data usable without having to have reduction pipelines in place is a good one. However, there undoubtedly are cases in which transparent provenance and the ability to do one's own re-reductions enable important science. With datalink [I am linking to a 2015 poster on that written by me; don't read that spec just for fun], we have an important ingredient for that. But I give you that in particular the preservation of the software that makes up reduction pipelines is a hard problem. It may even be an impossible problem if “preservation” is supposed to encompass malleability and fixability.

    Followup (2024-09-11)

    I've given the last two hints today: “To find the column with the J half-light radius, it pays to sort the columns in the Columns tab in TOPCAT by name or, for experts using VizieR's version of the XSC, by UCD.” and “ADQL has aggregate functions, which let you avoid downloading a lot of data when all you need are summary properties. This may not matter with what little data you would transfer here, but still: use server-side SUM.”

    Followup (2024-09-12)

    I have published the (to me, physically surprising) puzzler solution to https://www.g-vo.org/puzzlerweb/puzzler2024-solution.pdf. In case it matters to you: The towel went to Marburg again. Congratulations to the winner!

    Followup (2024-09-13)

    On the way home I notice this might be a suitable place to say how I did the QR codes I was joking about above. Basis: The embedding documents are written in LaTeX, and I'm using make to build them. To include a QR code, I am writing something like:

    \includegraphics[height=5cm]{vo-qr.png}}
    

    in the LaTeX source, and I am declaring a dependency on that file in the makefile:

    fluggi.pdf: fluggi.tex vo-qr.png <and possibly more images>
    

    Of course, this will error out because there is no file vo-qr.png at that point. The plan is to programatically generate it from a file containing the URL (or whatever you want to put into the QR code), named, in this case, vo.url (that is, whatever is in front of -qr.png in the image name). In this case, this has:

    https://doi.org/10.21938/avVAxDlGOiu0Byv7NOZCsQ
    

    The automatic image generation then is effected by a pattern rule in the makefile:

    %-qr.png: %.url
            python qrmake.py $<
    

    And then all it takes is a showrt script qrmake.py, which based on python3-qrcode:

    import sys
    import qrcode
    
    with open(sys.argv[1], "rb") as f:
            content = f.read().strip()
    output_code = qrcode.QRCode(border=0)
    output_code.add_data(content)
    
    dest_name = sys.argv[1].replace(".url", "")+"-qr.png"
    output_code.make_image().save(dest_name)
    
  • Learn To Use The VO

    Thumbnails of the first 60 pages of the lecture notes, grayish goo with occasional colour spots thrown in.

    The first 60 pages of the lecture notes as they currently are. I give you a modern textbook would probably look a bit more colorful from this distance, but perhaps this will still do.

    About ten years ago, I had planned to write something I tentatively called VadeVOcum: A guide for people wanting to use the Virtual Observatory somewhat more creatively than just following and slightly adapting tutorials and use cases. If you will, I had planned to write a textbook on the VO.

    For all the usual reasons, that project never went far. Meanwhile, however, GAVO's courses on ADQL and on pyVO grew and matured. When, some time in 2021, I was asked whether I could give a semester-long course “on the VO”, I figured that would be a good opportunity to finally make the pyVO course publishable and complement the two short courses with enough framing that some coherent story would emerge, close enough to the VO textbook I had in mind in about 2012.

    Teaching Virtual Observatory Matters

    The result was a course I taught at Universität Heidelberg in the past summer semester together with Hendrik Heinl and Joachim Wambsganss. I have now published the lecture notes, which I hope are textbooky enough that they work for self-study, too. But of course I would be honoured if the material were used as a basis of similar courses in other places. To make this simpler, the sources are available on Codeberg without relevant legal restrictions (i.e., under CC0).

    The course currently comprises thirteen “lectures”. These are designed so I can present them within something like 90 minutes, leaving a bit of space for questions, contingencies, and the side tracks. You can build the slides for each of these lectures separately (see the .pres files in the source repository), which makes the PDF to work while teaching less cumbersome. In addition to that main trail, there are seven “side tracks”, which cover more fundamental or more general topics.

    In practice, I sprinkled in the side tracks when I had some time left. For instance, I showed the VOTable side track at the ends of the ADQL 2 and ADQL 3 lectures; but that really had no didactic reason, it was just about filling time. It seemed the students did not mind the topic switches to much. Still, I wonder if I should not bring at least some of the side tracks, like those on UCDs, identifiers, and vocabularies, into the main trail, as it would be unfortunate if their content fell through the cracks.

    Here is a commented table of contents:

    • Introduction: What is the VO and why should you care? (including a first demo)
    • Simple Protocols and their clients (which is about SIAP, SSAP, and SCS, as well as about TOPCAT and Aladin)
    • TAP and ADQL (that's typically three lectures going from the first SELECT to complex joins involving subqueries)
    • Interlude: HEALPix, MOC, HiPS (this would probably be where a few of the other side tracks might land, too)
    • pyVO Basics (using XService objects and a bit of SAMP, mainly along an image discovery task)
    • pyVO and TAP (which is developed around a multi-catalogue SED building case)
    • pyVO and the Registry (which, in contrast to the rest of the course, is employing Jupyter notebooks because much of the Registry API makes sense mainly in interactive use)
    • Datalink (giving a few pyVO examples for doing interesting things with the protocol)
    • Higher SAMP Magic (also introducing a bit of object oriented programming, this is mainly about tool building)
    • At the Limit: VO-Wide TAP Queries (cross-server TAP queries with query building, feature sensing and all that jazz; I admit this is fairly scary and, well, at the limit of what you'd want to show publicly)
    • Odds and Ends (other pyVO topics that don't warrant a full section)
    • Side Track: Terminology (client, server, dataset, data collection, oh my; I had expected this to grow more than it actually did)
    • Side Track: Architecture (a deeper look at why we bother with standards)
    • Side Track: Standards (a very brief overview of what standards the IVOA has produced, with a view of guiding users away from the ones they should not bother with – and perhaps towards those they may want to read after all)
    • Side Track: UCDs (including hints on how to figure out which would denote a concept one is interested in)
    • Side Track: Vocabularies (I had some doubts whether that is too much detail, but while updating the course I realised that vocabularies are now really user-visible in several places)
    • Side Track: VOTable (with the intention of giving people enough confidence to perform emergency surgery on VOTables)
    • Side Track: IVOA Identifiers (trying to explain the various ivo:// URIs users might see).

    Pitfalls: Technical, Intellectual, and Spiritual

    The course was accompanied by lab work, again 90 minutes a week. There are a few dozen exercises embedded in the course, and in the lab sessions we worked on some suitable subset of those. With the particular students I had and the lack of grading pressure, the fact that solutions for most of the exercises come with the lecture notes did not turn out to be a problem.

    The plan was that the students would explain their solutions and, more importantly, the places they got stuck in to their peers. This worked reasonably well in the ADQL part, somewhat less for the side tracks, and regrettably a lot less well in the pyVO part of the course. I cannot say I have clear lessons to be learned from that yet.

    A piece of trouble for the student-generated parts I had not expected was that the projector only interoperated with rather few of the machines the students brought. Coupling computers and projectors was occasionally difficult even in the age of universal VGA. These days, even in the unlikely event one has an adapter for the connectors on the students' computers, there is no telling what part of a computer screen will end up on the wall, which distortions and artefacts will be present and how much the whole thing will flicker.

    Oh, and better forget about trying to fix things by lowering the resolution or the refresh rate or whatever: I have not had one instance during the course in which any plausible action on the side of the computer improved the projected image. Welcome to the world of digital video signals. Next time around, I think I will bring a demonstration computer and figure out a way in which the students can quickly transfer their work there.

    Talking about unexpected technical hurdles: I am employing PDF-attached source code quite extensively in the course, and it turned out that quite a few PDF clients in use no longer do something reasonable with that. With pdf.js, I see why that would be, and it's one extra reason to want to avoid it. But even desktop readers behaved erratically, including some Windows PDF reader that had the .py extension on some sort of blacklist and refused to store the attached files on grounds that they may “damage the computer”. Ah well. I was tempted to have a side track on version control with git when writing the course. This experience is probably an encouragement to follow through with that and at least for the pyVO part to tell students to pull the files out of a checkout of the course's source code.

    Against the outline in the lecture as given, I have now promoted the former HEALPix side track to an interlude session, going between ADQL and pyVO. It logically fits there, and it was rather popular with the students. I have also moved the SAMP magic lecture to a later spot in the course; while I am still convinced it is a cool use case, and giving students a chance to get to like classes is worthwhile, too, it seems to be too much tool building to have much appeal to the average participant.

    Expectably, when doing live VO work I regularly had interesting embarrassments. For instance, in the pyvo-tap lecture, where we do something like primitive SEDs from three catalogues (SDSS, 2MASS and WISE), the optical part of the SEDs was suddenly gone in the lecture and I really wondered what I had broken. After poking at things for longer than I should have, I eventually promised to debug after class and report next time, only to notice right after the lecture that I had, to make some now-forgotten point, changed the search position – and had simply left the SDSS footprint.

    But I believe that was actually a good thing, because showing actual errors (it does not hurt if they are inadvertent) and at least brief attempts to understand them (and, possibly later, explain how one actually understood them) is a valuable part of any sort of (IT-related) education. Far too few people routinely attempt to understand what a computer is trying to tell them when it shows a message – at their peril.

    Reruns, House Calls, TV Shows

    Of course, there is a lot more one could say about the VO, even when mainly addressing users (as opposed to adopters). An obvious addition will be a lecture on the global dataset discovery API I have recently discussed here, and I plan to write it when the corresponding code will be in a pyVO release. I am also tempted to have something on stilts, perhaps in a side track. For instance, with a view to students going on to do tool development, in particular stilts' validators would deserve a few words.

    That said, and although I still did quite a bit of editing based on my experiences while teaching, I believe the material is by and large sound and up-to-date now. As I said: everyone is welcome to the material for tinkering and adoption. Hendrik and I are also open to give standalone courses on ADQL (about a day) or pyVO (two to three days) at astronomical institutes in Germany or elsewhere in not-too remote Europe as long as you house (one of) us. The complete course could be a 10-days block, but I don't think I can be booked with that[1].

    Another option would be a remote-teaching version of the course. Hendrik and I have discussed whether we have the inclination and the resources to make that happen, and if you believe something like that might fit into your curriculum, please also drop us a note.

    And of course we welcome all sorts of bug reports and pull requests on codeberg, first and foremost from people using the material to spread the VO gospel.

    [1]Well… let me hedge that I don't think I'd find a no in myself if the course took place on the Canary Islands…
  • What's new in DaCHS 2.10

    A part of the IVOA product-type vocabulary, and the DaCHS logo with a 2.10 behind it.

    About twice a year, I release a new version of our VO server package DaCHS; in keeping with tradition, this post summarises some of the more notable changes of the most recent release, DaCHS 2.10.

    productTypeServed

    The next version of VODataService will probably have a new element for service descriptions: productTypeServed. This allows operators to declare what sort of files will come out of a service: images, time series, spectra, or some of the more exotic stuff found in the IVOA product-type vocabulary (you can of course give multiple of these). More on where this is supposed to go is found my Interop talk on this. DaCHS 2.10 now lets you declare what to put there using a productTypeServed meta item.

    For SIA and SSAP services, there is usually no need to give it, as RegTAP services will infer the right value from the service type. But if you serve, say, time series from SSAP, you can override the inference by saying something like:

    <meta name="productTypeServed">timeseries</meta>
    

    Where this really is important is in obscore, because you can serve any sort of product through a single obscore table. While you could manually declare what you serve by overriding obscore-extraevents in your userconfig RD, this may be brittle and will almost certainly get out of date. Instead, you can run dachs limits //obscore (and you should do that occasionally anyway if you have an obscore table). DaCHS will then feed the meta from what is in your table.

    A related change is that where a piece of metadata is supposed to be drawn from a vocabulary, dachs val will now complain if you use some other identifier. As of DaCHS 2.10 the only metadata item controlled in this way is productTypeServed, though.

    Registering Obscore Tables

    Speaking about Obscore: I have long been unhappy about the way we register Obscore tables. Until now, they rode piggyback in the registry record of the TAP services they were queriable through. That was marignally acceptable as long as we did not have much VOResource metadata specific to the Obscore table. In the meantime, we have coverage in space, time, and spectrum, and there are several meaningful relationships that may be different for the obscore table than for the TAP service. And since 2019, we have the Discovering Data Collections Note that gives a sensible way to write dedicated registry records for obscore tables.

    With the global dataset discovery (discussed here in February) that should come with pyVO 1.6 (and of course the productTypeServed thing just discussed), there even is a fairly pressing operational reason for having these dedicated obscore records. There is a draft of a longer treatment on the background on github (pre-built here) that I will probably upload into the IVOA document repository once the global discovery code has been merged. Incidentally, reviews of that draft before publication are most welcome.

    But what this really means: If you have an obscore table, please run dachs pub //obscore after upgrading (and don't forget to run dachs limits //obscore after you do notable changes to your obscore table).

    Ranking

    Arguably the biggest single usability problem of the VO is <drumroll> sorting! Indeed, it is safe to assume that when someone types “Gaia DR3“ into any sort of search mask, they would like to find some way to query Gaia's gaia_source table (and then perhaps all kinds of other things, but that should reasonably be sorted below even mirrors of gaia_source. Regrettably, something like that is really hard to work out across the Registry outside of these very special cases.

    Within a data centre, however, you can sensibly give an order to things. For DaCHS, that in particular concerns the order of tables in TAP clients and the order of the various entries on the root page. For instance, a recent TOPCAT will show the table browser on the GAVO data centre like this:

    Screenshot of a hierachical display, top-level entries are, in that order, ivoa, tap_schema, bgds, califadr3; ivoa is opened and shows obscore and obs_radio, califadr3 is opened and shows cubes first, then fluxpos tables and finally flux tables.

    The idea is that obscore and TAP metadata are way up, followed by some data collections with (presumably) high scientific value for which we are the primary site; within the califadr3 schema, the tables are again sorted by relevance, as most people will be interested in the cubes first, the somewhat funky fluxpos tables second, and in the entirely nerdy flux tables last.

    You can arrange this by assigning schema-rank metadata at the top level of an RD, and table-rank metadata to individual tables. In both cases, missing ranks default to 10'000, and the lower a rank, the higher up a schema or table will be shown. For instance, dfbsspec/q (if you wonder what that might be: see Byurakan to L2) has:

    <resource schema="dfbsspec">
      <meta name="schema-rank">100</meta>
        ...
        <table id="spectra" onDisk="True" adql="True">
          <meta name="table-rank">1</meta>
    

    This will put dfbsspec fairly high up on the root page, and the spectra table above all others in the RD (which have the implicit table rank of 10'000).

    Note that to make DaCHS notice your rank, you need to dachs pub the modified RDs so the ranks end up in DaCHS' dc.resources table; since the Registry does not much care for these ranks, this is a classic use case for the -k option that preserves the registry timestamp of the resource and will thus prevent a re-publication of the registry record (which wouldn't be a disaster either, but let's be good citizens). Ideally, you assign schema ranks to all the resources you care about in one go and then just say:

    dachs pub -k ALL
    

    The Obscore Radio Extension

    While the details are still being discussed, there will be a radio extension to Obscore, and DaCHS 2.10 contains a prototype implementation for the current state of the specification (or my reading of it). Technically, it comprises a few columns useful for, in particular, interferometry data. If you have such data, take a look at https://github.com/ivoa-std/ObsCoreExtensionForRadioData.git and then consider trying what DaCHS has to offer so far; now is the time to intervene if something in the standard is not quite the way it should be (from your perspective).

    The documentation for what to do in DaCHS is a bit scarce yet – in particular, there is no tutorial chapter on obs-radio, nor will there be until the extension has converged a bit more –, but if you know DaCHS' obscore support, you will be immediately at home with the //obs-radio#publish mixin, and you can see it in (very limited) action in the emi/q RD.

    The FITS Media Type

    I have for a long time recommended to use a media type of image/fits for FITS “images” and application/fits for FITS (binary) tables. This was in gross violation of standards: I had freely invented image/fits, and you are not supposed to invent media types without then registering them with the IANA.

    To be honest, the invention was not mine (only). There are applications out there flinging around image/fits types, too, but never mind: It's still bad practice, and DaCHS 2.10 tries to rectify it by first using application/fits even where defaults have been image/fits before, and actually retroactively changing image/fits to application/fits in the database where it can figure out that a column contains a media type.

    It is accepting image/fits as an alias for application/fits in SIAP's FORMAT parameter, and so I hope nothing will break. You may have to adapt a few regression tests, though.

    On the Way To pathlib.Path

    For quite a while, Python has had the pathlib module, which is actually quite nice; for instance, it lets you write dir / name rather than os.path.join(dir, name). I would like to slowly migrate towards Path-s in DaCHS, and thus when you ask DaCHS' configuration system for paths (something like base.getConfig("inputsDir")), you will now get such Path-s.

    Most operator code, however, is still isolated from that change; in particular, the sourceToken you see in grammars mostly remains a string, and I do not expect that to change for the forseeable future. This is mainly because the usual string operations many people to do remove extensions and the like (self.sourceToken[:-5]) will fail rather messily with Path-s:

    >>> n = pathlib.Path("/a/b/c.fits")
    >>> n[:-5]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'PosixPath' object is not subscriptable
    

    So, if you don't call getConfig in any of your DaCHS-facing code, you are probably safe. If you do and get exceptions like this, you know where they come from. The solution, stringification, is rather straightforward:

    >>> str(n)[:-5]
    '/a/b/c'
    

    Partly as a consequence of this, there were slight changes in the way processors work. I hope I have not damaged anyone's code, but if you do custom previews and you overrode classify, you will have to fix your code, as that now takes an accref together with the path to be created.

    Odds And Ends

    As usual, there are many minor improvements and additions in DaCHS. Let me mention security.txt support. This complies to RFC 9116 and is supposed to give folks discovering a vulnerability a halfway reliable way to figure out who to complain to. If you try http://<your-hostname>/.well-known/security.txt, you will see exactly what is in https://dc.g-vo.org/.well-known/security.txt. If this is in conflict with some bone-headed security rules your institution may have, you can replace security.txt in DaCHS' central template directory (most likely /usr/lib/python3/dist-packages/gavo/resources/templates/); but in that case please complain, and we will make this less of a hassle to change or turn off.

    You can no longer use dachs serve start and dachs serve stop on systemd boxes (i.e., almost all modern Linux boxes as configured by default). That is because systemd really likes to manage daemons itself, and it gets cross when DaCHS tries to do it itself.

    Also, it used to be possible to fetch datasets using /getproduct?key=some/accref. This was a remainder of some ancient design mistake, and DaCHS has not produced such links for twelve years. I have now removed DaCHS' ability to fetch accrefs from key parameters (the accrefs have been in the path forever, as in /getproduct/some/accref). I consider it unlikely that someone is bitten by this change, but I personally had to fix two ancient regression tests.

    If you use embedded grammars and so far did not like the error messages because they always said “unknown location“, there is help: just set self.location to some string you want to see when something is wrong with your source. For illustration, when your source token is the name of a text file you process line by line, you would write:

    <iterator><code>
      with open(self.sourceToken) as f:
        for line_no, line in enumerate(f):
          self.location = f"{self.sourceToken}, {line_no}"
          # not do whatever you need to do on line
    </code></iterator>
    

    When regression-testing datalink endpoints, self.datalinkBySemantics may come in handy. This returns a mapping from concept identifiers to lists of matching rows (which often is just one). I have caught myself re-implementing what it does in the tests itself once too often.

    Finally, and also datalink-related, when using the //soda#fromStandardPubDID descriptor generator, you sometimes want to add just an extra attribute or two, and defining a new descriptor generator class for that seems too much work. Well, you can now define a function addExtras(descriptor) in the setup element and mangle the descriptor in whatever way you like.

    For instance, I recently wanted to enrich the descriptor with a few items from the underlying database table, and hence I wrote:

    <descriptorGenerator procDef="//soda#fromStandardPubDID">
      <bind name="accrefPrefix">"dasch/q/"</bind>
      <bind name="contentQualifier">"image"</bind>
      <setup>
        <code>
          def addExtras(descriptor):
            descriptor.suppressAutoLinks = True
            with base.getTableConn() as conn:
              descriptor.extMeta = next(conn.queryToDicts(
                "SELECT * FROM dasch.plates"
                " WHERE obs_publisher_did = %(did)s",
                {"did": descriptor.pubDID}))
        </code>
      </setup>
    </descriptorGenerator>
    

    Upgrade As Convenient

    That's it for the notable changes in DaCHS 2.10. As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.

    Oh, by the way: If you are still on buster (or some other distribution that still has astropy 4): A few (from my perspective minor) things will be broken; astropy is evolving too fast, but in general, I am trying to hack around the changes to make DaCHS work at least with the astropys in oldstable, stable, and unstable. However, in cases when a failure seems to be more of an annoyance to, I am resigning. If any of the broken things do bother you, do let me know, but also consider installing a backport of astropy 5 or higher – or, better, to dist-upgrade to bookworm. Sorry about that.

Page 1 / 20 »