It's Interop Time Again

2022-04-26 Markus Demleitner

A little ego booster in DAL I: Baptiste and Chloe discuss a feature for incremental harvesting of remote databases using odbcGrammar that I have implanted into DaCHS late last year.

This morning at seven CEST the first Interop of this year started: It's time again for everyone involved in the VO to come together, tell each other what happened since the last Interop and plan for the next steps. The meeting is purely digital again, and again the schedule is a bit crazy in order to evenly spread time painsj across the globe: there are sessions in the relatively early morning CET, in the late afternoon, and fairly late at night.

Fairly late at night (by my standards) is now, when I'm listening to the talks in a session of the Data Access Layer working group trying to work out how to do multiple cutouts in one request using SODA, something I've been rather skeptical about while we were coming up with the spec in the mid-2010s: Going from “single value“ to “sequence“ generally complicates matters by something like an order of magnitudes, and with HTTP 1.1 – which lets you run multiple requests in a single connection – doing multiple requests is cheap.

In contrast, SODA doesn't really say what a service should do if, say, there are multiple positions in a cutout request: should the regions be merged (that's what DaCHS does)? Should multiple images come back? If so, how: in a tar, in a multi-extension FITS, in some other way? What happens if you give both multiple positional and spectral ranges: should there be one result per element of the cartesian product? And if it works that way: should clients have a chance to figure out what combination of parameters produced which result dataset?

In all that mess, it's gratifying to see that my compromise proposal from way back when – if we do multi-cutout, let's do it by uploading a table specifying one cutout, including a label, per row – to be floated again. But very frankly: My vote would still be to deprecate repeated POS, CIRCLE, BAND, and friends in SODA: requests are cheap these days.

Oh, and while I'm confessing emotions of perhaps not entirely unselfish gratification: I still rejoice when I see DaCHS applications discussed in public, as Chloé and Baptiste did in their talk.

Update at 2022-04-27, Morning

The “virtual” Interop may not be quite as exciting as the real thing, but at least the jetlag is back.

Yesterday at midnight I gave a talk on requirements and validators, which really was an elaboration of some of the ideas I developed on this blog a month ago. If I may say so myself, I've grown fond of the classification of MUST-s into, in the end, items the machines need, items the users need, admonishments for implementors, and items that we believe the future may need. I'm sure there are more, but even for these I found it remarkable that the less will immediately break if someone violates a piece of a spec, the more important validation becomes. This again is one of these thoughts that feel as if someone probably has pondered them a lot more deeply before…

I also was really happy about Mark's pitch for validating specifications themselves that kept me awake until one a.m. CEST. In my authoring system ivoatex, I've introduced a hook to allow for a test target, and Mark kindly supported that effort by adding an xsdvalidate subcommand to the excellent stilts. The ivoatex documentation then grew some advice on what and how to test; in case you're writing or maintaining IVOA specs: do have a look. Mark's talk has a few great examples where spec-time validation would have saved a lot of effort and embarrassment.

Only six hours later, I was back in <expletive deleted> zoom to listen to the Grid session, which again featured Mark, apparently unfazed by the lack of sleep, talking about (potentially) federated authentication outside of the browser (which is something I really want for persistent TAP uploads).

And then there was the joint time domain/radio session. The slides are not yet there, but once they are, do yourself a favour and at least look at the beautiful images Dougal showed – Radio by now can make about as pretty pictures as Optical – and Alan's talk with the hypnotic sensitivity maps that again showed that low-frequency radio astronomy, seen from outside, is even more of an arcane art than is its high-frequency sibling.

Update at 2022-04-27, late evening

For me, this Interop has a strong proper motion slant. In this afternoon's Apps session, I tried to sell an extension to COOSYS I've wanted for a long time, just enough to do epoch propagation.

You see, ever since my first serious contribution to the VO standards universe, the proposal on doing STC annotation in VOTable in 2010, failed miserably because almost nobody took it up, I have struggled to still somehow get enough annotation added to VOTables to let clients apply proper motions automatically.

Given there are now data models for Coordinates and what we call Measurements (which roughly is errors and, well, a bit of physics) on the way, I figured this might be a good time to finally fix the COOSYS VOTable element. For one, data centers will revisit the STC annotation anyway if the models and the VOTable data model annotation will pass the reviews, and producing an improved COOSYS would then almost come for free.

But I can't lie: after the experiences of the past I'd also love to have a fallback position in case we spend another ten years on data models and annotations without getting anywhere. 25 years after the VO's birth epoch (if you will) of J2000.0, many stars have already moved of order of an arcsecond from where our first big catalogues saw them, and so we can ill afford to wait these extra ten years.

Not surprisingly, the proposal resulted in quite a bit of pushback, perhaps even a bit more than I had expected. Well: I should have given this talk years ago.

The proper motion topic will come back tomorrow in the second DAL session, when I will talk about ADQL user defined functions to do epoch propagation. This talk will feature one of the prettier plots I've produced in the last few months:

What happens if you propagate positions when all you have are proper motions (i.e., no parallaxes and no distances) and you do that naively (blue), in the tangential plane (red), and under the assumption of a purely tangential motion. The lecture notes tell you how to come up with the data plotted here.

I think I can safely predict you will read more about some of these UDFs on this very blog later this year.

Update 2022-04-28, late evening

Today felt the most conferency so far for this Interop, and perhaps for any “virtual conference“ I've attended. I believe there's a technical reason for that. After the second proper motion-flavoured talk I've just mentioned – that was still using, sigh, zoom –, things mostly happened in gathertown, a platform you can actually walk around in, stand together and don't always talk on stage as in zoom. Fervently believing in the mantra of “protocols, not platforms” (of course: this is the VO), I shouldn't be saying this, but: I actually like gathertown.

And so I guess we made quite a bit of progress in little side meetings and a hackathon on things like LineTAP (which, I hope, will bring all the rich data on spectral lines from VAMDC to the VO); how to let people have continuous integration checks against their Jupyter notebooks to notice in time when we're breaking something (my recent brown-bag pyvo bug that has somwhat started this was actually mentioned as a positive example in a talk (slide 19); and: it turned out I'm not the only notebook skeptic on this planet!); how we ought to define “facility” and “instrument“ in Obscore and the Registry (and, probably particularly insiduously, in SSAP, where what's called “facility“ there should probably be what's called “instrument“ elsewhere – sigh), a topic we already had touched yesterday, which in turn has resulted in Tamara's mail; an interesting service DaCHS operators want to run that would return PDF files as what DaCHS calls a “product” (which would normally be a thing like a FITS file); and then some more, including, of course, idle chatting.

That was almost as good as an actual meeting.

Update 2022-04-29, afternoon

This morning, I chaired a nice and lively Semantics session, where I talked about the move of our Vocabulary maintenance to github. That particular thing did not elicit a lot of comments, not even when I extended an invitation to perhaps amend Vocabularies in the VO 2 in other weys. I'll take that as some sort of reassurance that I did a reasonably good job designing that thing, although I cannot entirely rule out that people just did not have enough time to find the warts.

One thing I will call out at tonight's closing penary is Stéphane's talk on vocabularies in EPN-TAP. The way he was looking at the various word lists involved in that standard, looking at what “just works“, where the concepts are probably too special to worry about, and then the clumsy space in between – where there are or should be vocabularies that almost, but not quite fit – was exemplary. I'm looking forward to followups on the mailing lists, trying to work out where we can perhaps align different concept hierarchies so we spare implementors duplicate efforts. And figuring out where that's impossible, too expensive, or in other ways undesirable, and where the problems are. I suppose there's a lot to be learned from that.

Another high point was the identification of Wikidata as a valuable resource for the never-ending story of creating identifiers for instruments and facilities in Baptiste's talk. There is some special gratification in making our activities matter beyond the VO, link our resources with the wider RDF world – and hack SPARQL.

What's left for me is the Registry session, where I will briefly report, in particular, on my most recent effort of getting rid of my venerable GloTS service by adding a table of TAP-queriable tables to RegTAP. Let's see what people say – but in the end the challenge will be to convince the other operators of RegTAP services to take up the proposed changes. The central challenge there is that part of it is built on MOCs, and while the ESAC registry is built on Postgres that can already taught to deal with them, the one at MAST is based on SQLServer, which, I think, cannot yet. Let's see.

Another thing I'm looking forward to is Hendrik's pitch for registring tutorials and similar educational material. I'd really like to see more stuff on VOTT, which is fed from such registrations.

Update 2022-04-29, late evening

Interops for me always have something of an ego trip when I see traces of my activities in other people's work. And I've just discovered such a trace in a place I had not expected it: Gilles' talk on extra metadata in service responses, where he showed metadata DaCHS returns with its TAP responses. This was in this morning's session of the Data Curation and Preservation interest group that, I have to admit, I skipped in favour of a proper breakfast without a screen in front of me.

And he touched a topic that's dear to my heart, too. Really, I've been struggling to give applications enough metadata such that they can simply spit out a bunch of BibTeX for the sources used in a particular VO workflow for quite a while. In typcial DaCHS responses, you will find a bibcode and often a link to BibTeX (example), and at least the container element I got standardised in DALI 1.1. Let's see what else we can specify so that machines can reliably extract such information: Authors? Technical contact addresses? Date and time of production (could be very relevant for evolving data)? Full provenance? Well: If you've ever missed some piece of metadata, this would be a good time to bring it up.

All that's left now is the reports of the Working Groups (which will be another midnight talk for me) and a bit of farewell ceremony. After that, I'll go to sleep, and so that's it for my Interop reporting.

Category: Meetings

Requirements and Validators

2022-03-07 Markus Demleitner

Content Warning: this is mainly VO lore. I am not claiming any immediate applicability to the use or publication of astronomical data.

This morning, I set out to reply to a mail by Mark Taylor and noticed after a while that I was writing a philosophical piece on how to write standards – and how not to – that I may want to refer to again later. So, I'll make this a blog post.

The story started when the excellent stilts taplint during my monthly validation routine produced an error when exercising my data centre's TAP endpoint:

I-OBS-QSUB-5 Submitting query: SELECT TOP 1 obs_id FROM ivoa.ObsCore WHERE obs_id IS NULL
E-OBS-QERR-1 TAP query failed [Service error: "Field query: Query timed out (took too long).

What happened is that stilts tried to ascertain that all rows in my obscore table satisfy the standard's requirement that the obs_id column is non-NULL (see page 20). This made Postgres – the database system actually executing the queries – run what is known as a sequential scan through the tables involved in obscore; the reason underlying this bad judgement is a bit involved and has to do with the fact that in DaCHS, ivoa.obscore is a view composed of many tables. I will spare you the details, but the net effect of that is that it is not easy to tell Postgres that rows with obs_id NULL, if they exist at all, will be few and far between.

By now, the number of data sets in my obscore table approaches 100'000'000, and fetching all that data simply takes time, more time than a synchronous query has on my site[1].

Granted, I could fix that by adding indexes on the columns involved, but since these come from several dozen tables, that would be quite a bit of work for both me and the computer. Is that work worth it? Well, it certainly is if otherwise I'm breaking the standard, but since it is a serious amount of work, I am tempted to wonder: does the requirement actually make sense? And this leads to the question:

Why do we require things in standards?

In the end, there is just one reason to require something in a standard: Without the requirement, something important breaks. When one thinks about this a bit more deeply, one can distinguish two somewhat finer classes of requirements.

(a) “Internal requirements“. These are rules imposed so machines can do their job. The most obvious examples here are requirements on how to write things. For instance, if a client writes an interval as lower/upper and the service expects lower upper, it just won't work. Hence, a standard has to say “The separator in intervals MUST be whitespace” (or whatever).

There are more subtle requirements in that department. For instance, many tables need a primary key because other tables may want to refer to them. For Obscore, this becomes relevant just about now, when we think about having extensions for it. Those would add specific metadata for, say, radio or gamma observations. We will probably create them by adding per-extension tables holding a foreign key into ivoa.obscore. This is nice because then you can write something like:

SELECT ...
FROM ivoa.obscore
JOIN ivoa.obs_visibility
  USING (obs_publisher_did)
WHERE (some visiblity-specific constraint)

– and almost everything just works without further thought or effort: No plethora of columns that are NULL in ivoa.obscore for anything that is not a visibility, and no manual filtering out of non-visibilities either: JOIN does it all nicely for you. Isn't relational algebra great?

But this only is possible if obs_publisher_did (well: it's not certain yet whether that actually will be obscore's designated primary key, but bear with me there) really is non-NULL, and if there are no two rows with the same publisher DID (which are the general criteria to make something a primary key in a relation). Hence, these two constraints are something we simply MUST (pun intended) require.

(b) “Functional requirements”. These are requirements resulting from considerations of the use of the standard. I have just encountered a nice example when working on LineTAP, a future standard on how to access data about spectral lines. An important use case there is that the client displays the lines on top of a spectrum, and it will want to put something next to the lines so the user has at least a first indication just what would cause the line to show up. That it can only do if the service provides it with a plausible label – asking clients to invent a label based on the data it has is likely to produce very unsatifying results, as no machine is smart enough to figure out nice, idiomatic strings like „21 cm HI“ or „Hα“. Hence, we simply have to require that each row in such a LineTAP table has a title (technically: the corresponding column has a non-NULL constraint).

Going back to the obs_id example, it does not seem there is a strong case to invoke either (a) or (b) – since the column explicitly has no uniqueness requirement, it will not work as a primary key, and users will probably only want to use it for “grouped” data, where multiple artefacts belong to one “observation”. For data sets not within such groups, there really is no application for obs_id I can see. Of course, I may be missing something, which is why I asked around on the mailing lists.

If we figure out nothing breaks when we remove the requirement, then we should drop it: Every requirement causes some overhead in implementation and validation. In the present case, the implementation overhead would be all the indexes on the various obs_id columns, which I would not otherwise need. The validation overhead are the extra queries that taplint needs to do. Having overhead for no benefit (in terms of things not breaking) goes against sensible parsimony in what we ask our adopters to do (and I'll officially admit here that we do ask quite a bit already).

… and why do we validate them?

In the mail I have cited above, Mark has kindly offered to just not run the query in the validation suite, and all this philosophy was really intended to lead up to a “thanks, but no thanks”.

That is because, first of all, requirements that are not checked by a machine are requirements that are not met. You see, what we do is hard. Sure, there are harder problems in computing, but globally distributed information systems run by only loosely connected parties are rather non-trivial. People writing code to solve non-trivial problems will get it wrong.

The common way to deal with this fact is to test with one client and call it a day when that client seems to work for whatever was chosen as a test case. To mention a non-VO standard where this implement-to-the-client method failed horribly and continues to fail horribly: ACPI, the part of the firmware that's supposed to make, for instance, suspend-to-RAM something one doesn't have to think about. Vendors usually stop developing their ACPI code when the current version of Windows does not fail horribly with their implementation. A paper in the proceedings of the 2007 Linux symposium discusses some of the consequences in the least offensive way conceivable – and in a way that I, as a VO developer running quite a few Linux boxes, can very much relate to.

The bottom line is that if an unmet requirement breaks things and validators do not check for that requirement, then services will work to some degree with a certain client and break as soon as people switch to a different client (or perhaps only try to be smart). That's in stark contrast to one of my main selling points when I do VO teaching: „Hey, you can prototype with TOPCAT, and when you've figured out things, just switch to pyVO so you can scale, automate, and make your work reproducable“.

So, let's try to avoid unvalidated requirements.

Instead, let's have as few requirements as we can while covering the use cases we envision. And then let's have great validators that make sure these requirements are met by the services (or instance documents, or whatever it may be). Such validators not only help making the VO an effective environment that's fun to work with. They also give service operators – like… me – a peace of mind that nothing else can provide.

[1]

I keep a rather tight limit on the sync queries because the system also answers registry discovery queries, and these should be reasonably snappy. If I let long sync queries run, it is very easy to overload the system by accident. If I don't, people who want to run long queries can move to async. There, jobs are queued and only let in one or two at a time. That will not (usually) overload anything.

Small Change, Big Win

2022-02-23 Markus Demleitner

That's SCS 1.03 Erratum 2 rendered in my browser with a bit of image processing to celebrate that there's one painful VO legacy less on this world.

PSA: what follows is VO lore that may be entertaining but will not help you use or publish astronomical data.

Today, I've made a very small commit to my VO publication package DaCHS (revision 8452):
```
--- gavo/web/vodal.py (revision 8451)
+++ gavo/web/vodal.py (working copy)
@@ -260,7 +260,6 @@
        version = "1.0"
        parameterStyle = "dali"
        standardId = "ivo://ivoa.net/std/ConeSearch"
-     defaultOutputFormat = "votable1.1"
```
One deleted line, small cause, huge effect.

This story starts with the oldest „operational“ VO standard, Simple Cone Search, which was formally published in 2008 but really got its current shape a lot earlier.

I've not been there back then, but I think the authors expected that clients would be parsing the VOTables that the services were returning using something called XML binding. That, well, was a technique where code was generated from an XML schema, and only instance documents conforming to that exact schema could be parsed with that code.

That is of course the opposite of the golden rule of interoperability (“be strict in what you produce and lenient in what you accept”) and thus would have been a terrible implementation choice for interoperable clients (and I believe nobody ever tried it). But somehow – or that is my explanation – the XML binding reasoning translated into the requirement that SCS services could only return VOTable 1.0 or VOTable 1.1, and that made it into the standard. It was hence the law. And that it DaCHS had to keep alive VOTable 1.1 for writing (which the above commit of course doesn't remove, but I can remove it now any time I feel like it). And that it couldn't do a lot of useful things that required features not present in VOTable 1.1.

Nobody dared to touch the problem for about a decade, as it was actually unclear whether some ancient code might still be doing useful work with SCS and XML binding. And I shouldn't be scoulding them after I have recently broken ESO examples under the assumption that “aw, nobody's gonna do this“. Then, starting about five years ago, we had a couple of discussions at various conferences about how we might bring SCS into the present VO (where it, it has to be said, sticks out a bit for several other reasons, too, like its funky error reporting and the funny UCDs it uses). But these weren't easy: What exactly are we allowed to break within a minor version under the above assumption (“aw, nobody… “)? If we do a major version, how do we plan for co-existence for two parallel major version?

Well: For the version restriction, in the end a simple Erratum was enough. On January 26, 2022, the IVOA Technical Coordination Group accepted SCS 1.03 Erratum 2. And now I can return whatever VOTable version suits me. Phewy.

I can now have GROUPs in GROUPs (which I need to annotate photometry), I can finally return tables with my old proposal for STC in VOTable in SCS results (where they would have mattered most – not that anyone cares any more, as that ship has sailed somewhere completely different).

Hey, I can have xtypes. Doesn't mean anything to you? Well, try this: In TOPCAT, open VO/Cone Search. Type “Constellations” and select the “cslt cone“ service. Run a query for some part of the sky, with a size of a few 10s of degrees. Open a sky plot, and in there, do Layers → Add Area Control, and in that control select the table you have just pulled in. Presto: You'll see the constellation boundaries without further configuration, and that's because TOPCAT has the xtype to figure out that the odd numbers it sees are really the vertex coordinates of a spherical polygon in DALI serialisation.

Not a big deal, you say? Perhaps. But lots of small deals accumulated make the difference between what you can do and what you cannot, in particular across services (which is what the VO is about).

Removing the erroneous constraint on VOTable versions in SCS opened the standard up for quite a few small deals. Thanks, TCG!

Category: Standards
Towards Data Discovery in pyVO

2022-01-10 Markus Demleitner

When I struggled with ways to properly integrate TAP services – which may have hundreds or thousands of different resources in one service – into the VO Registry without breaking what we already had, I realised that there are really two fundamentally different modes of using the VO Registry. In Discovering Data Collections's abstract I wrote:

the Registry must support both VO-wide discovery of services by type ("service enumeration") and discovery by data collection ("data discovery").

To illustrate the difference in a non-TAP case, suppose I have archived images of lensed quasars from Telescopes A, B, and C. All these image collections are resources in their own right and should be separately findable when people look for “resources with data from Telescope A“ or perhaps “images obtained between 2011-01-01 and 2011-12-31”.

However, when a machine wants to find all images at a certain position, publishing the three resources through three different services would mean that that machine has to do three requests where one would work just as well. That is very relevant when you think about how the VO will evolve: At this point there are 342 SIAP services in the VO, and when you read this, that number may have grown further. Adding one service per collection will simply not scale when we want to keep the possibility of all-VO searches. Since I claim that is a very desirably thing, we need to enable collective services covering multiple subordinate resources.

So, while in the first (“data discovery”) case one wants to query (or at least discover) the three resources separately, in the second case they should be ignored, and only a collective “images of lensed quasars” service should be queried.

The technical solution to this requirement was creating “auxliary capabilities” as discribed in the endorsed note on discoving data collections cited above. But these of course need client support; VO clients up to now by and large do service enumeration, as that has been what we started with in the VO Registry. Client support would, roughly, mean that clients would present their users with data collections, and then offer the various ways to to access them.

There are quite a number of technicalities involved in why that's not terribly straightforward for the “big” clients like TOPCAT and Aladin (though Aladin's discovery tree already comes rather close).

Now that quite a number of people use pyVO interactively in jupyter notebooks, extending pyVO's registry interface to do data discovery in addition to the conventional service enumeration becomes an attractive target to have data discovery in practice.

I have hence created pyVO PR #289. I think some the rough edges will need to be smoothed out before it can be merged, but meanwhile I'd be grateful if you could try it out already. To facilitate that, I have prepared a jupyter notebook that shows the basic ideas.

Followup (2023-12-15)

I have just prepared a slightly updated version of the notebook.

To run it while the PR is not merged, you need to install the forked pyVO. In order to not clobber your main installation, you can install astropy using your package manager and then do the following (assuming your shell is bash or something suitably similar):
```
virtualenv --system-site-packages try-discoverdata
. try-discoverdata/bin/activate
cd try-discoverdata
git clone https://github.com/msdemlei/pyvo
cd pyvo
git checkout add-discoverdata
python3 setup.py develop
ipython3 notebook
```
That should open a browser window in which you can open the notebook (you probably want to download it into the pyvo checkout in order to make the notebook selector see it). Enjoy!

Category: Demo
DaCHS 2.5: Check your UCDs

2021-11-17 Markus Demleitner

In the background of the DaCHS 2.5 release picture: UCDs grabbed from the Registry. The factual background: DaCHS 2.5 will now moan at you when you invent or mistype UCDs

This afternoon, I have released DaCHS 2.5. As usual, I will discuss the more important changes in a blog post – this one.

A change many of you will not like too much is that DaCHS now validates UCDs you give it, and it will warn you when you do not follow the UCD rules. This may seem like nit-picking, but as blind discovery is on the verge of becoming usable in the VO, making sure these strings actually are what they should be is becoming operationally important: If I want to find resources that give errors for their photometry, I have to know whether it's stat.error;phot.mag.b or phot.mag.b;stat.error, or else I will miss half the resources out there.

So, I'm sorry if DaCHS starts complaining about half of your RDs after you update, but it's for a good cause. And don't feel bad about the complaints: DaCHS complained about close to half of my RDs after I had put in that feature.

By the way, this comes as part of a larger effort on the side of the Operations IG to improve the validity of UCDs and units in the VO, an effort that has unearthed bugs in the SSAP and SLAP specifications in that they require UCDs forbidden by the UCD standard. DaCHS 2.5 still follows SSAP and SLAP, and hence external tools like stilts will protest because of bad UCDs even if DaCHS is happy. Errata for the specifications are being worked on, and once they are accepted, DaCHS and stilts will finally agree on UCD validity, or so I hope.

Code-wise, a much more intrusive change was that asynchronous services (in particular, async TAP) now use the same formalism for parsing parameters as their synchronous counterparts. It may seem odd that that hasn't been the case up to now, but there were good reasons for that; for instance, with async, people can post incomplete parameter sets that would be rejected by normal sync processing.

Unless you are running User UWS services, you should not notice anything. If you do run User UWS services, please contact me before upgrading. I would like to work with you on how these should look like in the future.

Another change that might break your services is that DaCHS now actually complies to VOUnits, which has always forbidden whitespace of all kinds in unit strings. DaCHS, on the other hand, has foolishly encouraged putting whitespace between scale factors and pure units, as in 1e-10 m. That's not interoperable, and hence DaCHS now rejects such units. This may lead to hidden failures when dachs val doesn't notice something is a unit, and things only break during execution. I'm aware of one place where that's relevant: spectral cutout services that need to know the spectral unit If you're running those, make double sure that the spectralUnit in the SSAP mixin does not contain any whitespace. It's 0.1nm according to VOUnits, not 0.1 nm.

An update that should silently make your services more compliant is that DaCHS' representation of EPN-TAP is updated to what is currently under IVOA review. After you upgrade, DaCHS will try to update your EPN tables' metadata, which in turn should make stilts taplint a lot happier. It will also make DaCHS pass on the new, IVOA table utype to the Registry, which is how people should in the future find EPN-TAP data.

DaCHS now also contains some code that may help you import data from HDF5 files. For one, there is the HDF5 grammar, which rather directly pulls data from HDF5s written by astropy or vaex. But, really: HDF5 is a rather low-level format not particularly well suited for relational data, and it is virtually impossible to write generic code for doing something sensible with it. The two flavours DaCHS supports have very little in common, and it is therefore almost certain that if you have HDF5s coming from somewhere else, hdf5Grammar will not understand them. Still, let us know what you've got, we may be able to put support for it in.

Hdf5grammar is written in Python, and thus imports perhaps a few thousand rows per second. For Gigarow-sized data collections, that's nowhere near fast enough, and hence for vaex-written HDF5s, there is booster support. As before, if you have bulk data in HDF5 that you want to put into a database and that was not written by vaex, let us know and we'll see what we can do.

A surprisingly minor change enabled DaCHS to deal with materialised views, database views that are turned into actual tables by postgres. See the corresponding section in the tutorial for how you can use them. We do not have any materialised views in our Heidelberg data center yet. So, if you use them and notice something is clunky, your feedback is particularly appreciated.

There are many smaller changes and improvements; let me mention what the changelog euphemistically calls ”better systemd integration”, which really means that so far systemctl restart dachs simply didn't do anything at all. Apologies. And shame on everyone who was bewildered but failed to report this to dachs-support.

Also, you can use float arrays in boosters now, and DaCHS' ADQL has just leared about COALESCE. That's a SQL feature that lets you deal sensibly with NULLs in some cases: COALESCE(arg1, arg2, ...) will return the first non-NULL argument it encounters. That may sound like a slightly exotic function. Until you need it, at which point you wonder how ADQL could reach its ripe age without COALESCE.

Finally, let me mention something that is not part of the release, though it is DaCHS-related and is new since the last release: I have cleaned up the access log processing machinery we have used in Heidelberg in the past 15 years or so, and I have packaged it up for general consumption. It is, of course, a DaCHS RD that you can just check out and use in your own DaCHS installation if you have to keep access logs and want to do that with at least some basic respect for your user's rights. See http://docs.g-vo.org/DaCHS/tutorial.html#access-logs for details.

Category: Software

« Page 8 / 22 »

Update at 2022-04-27, Morning

Update at 2022-04-27, late evening

Update 2022-04-28, late evening

Update 2022-04-29, afternoon

Update 2022-04-29, late evening

Why do we require things in standards?

… and why do we validate them?