I recently got embarrassed by ADQL NULLs, i.e., the magic value indicating that a value in a given column is missing. And since that’s a common source of errors when writing ADQL queries, I’ll take this as a cue for a blog post.
The concrete background is fairly technical and registry-ish; suffice it to say that some data providers who implemented interfaces conforming to some standard didn’t properly say so in their registry records. Back in RegTAP 1.0 (that’s the standard that says how a client like TOPCAT talks to the VO Registry), I decided to work around that by fudging the pattern for how to discover those interfaces so they’d still be found.
In RegTAP 1.1, which is now under review by the VO community, I wanted to do away with that workaround. But would that break anything? This question translates to “are there vs:ParamHTTP interfaces that don’t have a role attribute of std”. Whatever “ParamHTTP” and “role attribute” actually mean, just appreciate that it looks like it might translate into SQL like
select * from rr.interface
and not intf_role='std'
I ran that query, rejoiced because it didn’t return anything, removed the workarund from the standard, and then was shot down when I read Mark’s mail (politely) saying I’m wrong and there are services still requiring the workaround. As usual: If a query returns what you expect, be double careful.
What went wrong? Well, NULL semantics. You see, in SQL NULL is never equal to anything, not even itself (it’s like NaN in IEEE floats in that: try n = float('nan');print(n==n) in Python and look again if you’re cool about it). It’s also not unequal. Don’t take my word for it. Try
select * from tap_schema.schemas where NULL=NULL
select * from tap_schema.schemas where NULL!=NULL
– you’ll get empty results in both cases.
What does that mean for science queries? Well, whenever there’s NULLs in columns (and the only safe assumption for now is that they may hide in there; we should probably add nun-null as a column property in the tap schema and in VODataService some day), you need to be careful in particular with inverted logic.
Here’s an example: Suppose you want to investigate NGC objects brighter than 10 mag in B in one bin in everything else in another. The ones brighter are simple:
select count(*) from openngc.data where mag_b<10
(try it on the TAP server at http://dc.g-vo.org/tap, it’s 383 in the current release). It becomes difficult for “the rest”. If you write
select count(*) from openngc.data where not mag_b<10
select count(*) from openngc.data where mag_b>=10
you’ll get (for the current release) 10887. However, the whole catalogue has 13954 entries, so there’s 13954-10887-383=2684 rows missing. Your “rest” has missed everything for which mag_b isn’t given. Sure enough,
select count(*) from openngc.data where mag_b is null
(and this is the only good way to compare against null) gives 2684.
The right way to say “anything for which mag_b is not smaller than 10” thus is
select count(*) from openngc.data
or mag_b is null
Morale: Unless you’re sure there are no missing values (i.e., NULLs) in a column you’re looking at, think about what these mean to your research (or other) question: Should these rows just vanish? Then you usually don’t need to do anything and the SQL semantics magically do the right thing (which is why things are defined as they are). If, however, the corresponding rows would mean something to your question, you need to be explicit, and you must have some condition involving IS NULL or IS NOT NULL.
The trouble, of course, is that just knowing this still isn’t enough. You need to remember it in the right moment. Or you’ll share my fate of suffering some public embarrassement.
Last friday, I’ve uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That’s the first major step in updating a standard, and it’s an invitation to everyone to have a look and comment.
Foof, you might say, what do I care? I’ve not even heard of that standard.
Well, but you’ve probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what’s inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.
The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article’s image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values’ distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources’ reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).
We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you’ve always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?
VODataService furthermore deals with several other topics; for instance, the STC in the registry business I’ve blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.
But don’t let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I’m just a mail away.
If you’ve always wanted to be part of a standardisation process within the IVOA (and who would not?), the time has rarely been as good as now. Because: We’re updating ADQL! Yes! The ADQL you are writing your queries in will receive a few more language elements, and we’re carefully trying to heal a few things that turned out to be warts. And while some of the changes are as dull and boring as you may expect standards work to be, on some of them you may wish to have a saying.
Also, you can try things out – the GAVO data center TAP endpoint at http://dc.g-vo.org/tap already has most of the proposed features, and the new DaCHS beta 1.1.2 (out since last Friday) does, too. So, if you’re running DaCHS yourself, you can start playing after switching to the beta repository.
You’re now supposed to write the standard crossmatch as DISTANCE(ra1, dec1, ra2, dec2)<dist. This replaces the old dance with 1=CONTAINS(POINT(), CIRCLE()) that you’ve probably learned to hate. Finally: Crossmatching without having to resort to TOPCAT’s example menu…
ADQL geometries used to require a first argument that would give the reference frame, as in POINT('ICRS', ra, dec). The hope was that services could then automagically make a statement like CONTAINS(point_in_icrs, circle_in_galactic) work as presumably intended. Few services ever did (DaCHS still tries reasonably hard), and when they did, there were all kinds of opaque oddities. One of the most common sources of confusion is the question what a service is supposed to do with POINT('GALACTIC', ra, dec), assuming it knows that ra and dec are in, say, B1950 FK4. Also, is there any expectation that services attempt to do anything beyond a simple rotation (FK4, for instance, rotates noticably against the ICRS, so proper motions would need to get fixed, too)? In all, the frame as a first argument was ill thought-out, and it’s been deprecated. Simply don’t put in the string-typed first argument any more. POINT(long, lat) does it. True: This, more than ever, calls for an ADQL astrometry library so you can easily convert, at least, between Galactic and ICRS (probably a few more would be useful, too). More on this in some future post.
Services should have CAST now. Sometimes you want to turn a number into a string or a string into a timestamp. In such cases, you can write CAST('1991-02-01', TIMESTAMP) now. The details are not quite, excuse me, cast in stone yet, so if you have a use case for this kind of thing, speak up now. The current draft also calls for a TIMESTAMP(tx) function – but since that’s really not different from CAST(tx, TIMESTAMP), I’m trying to dissuade people from adding it.
Services should have an IN_UNIT function now. That’s a nifty thing in particular when you’re re-using queries on different services. Just write, say, IN_UNIT(pmra, 'deg/yr') and never worry again if it’s arcsec/yr, mas/yr, rad/cy, or whatever. The second argument, by the way, is written according to the Units in the Virtual Observatory standard. It’s an optional feature according to the current standard, so perhaps it’s too early to party, but I’ve found this extremely useful, and so I hope we’ll see widespread adoption.
Services should now have set operations. These are UNION, EXCEPT, and INTERSECT and are useful when you have two queries that result in the same table schema (because they won’t work otherwise). Say you have two complex ways to filter rows from the table source, but you want to process both sorts of results further on – you can say then say something like
SELECT <whatever complex> FROM
(SELECT a,b,c FROM source
WHERE <crazy stuff>
GROUP BY a, b, c) as left
(SELECT a,b,c FROM source
WHERE <other crazy stuff>
GROUP BY a, b, c) as right
WHERE <more complex stuff over a, b, and c>
– and similarly, EXCEPT lets you “punch a hole” in a result table. Another interesting use case would be to query many tables on a service like VizieR in one go; that still works if you make sure the tables defined by the sub-queries have the same columns. Given that a lot of cross-table operations actually boil down to JOINs and WHERE clauses, the set operations are used less that one would expect. But if you need them, there’s no real alternative (short of downloading far too much and performing the operation locally, which of course defeats the purpose of TAP).
Common table expressions (“WITH”). DaCHS doesn’t do these yet, and it will only pick them up if someone else implements them first. In the way ADQL 2.1 has them (“nonrecursive”), CTEs are little more than syntactic sugar, and I’m not quite sure if the additional implementation complexity is worth it. If you’re curious, check CTEs in the postgres manual. If that makes you drool for WITH in ADQL, let me know. It’ll not be too hard to sway me to put them in.
Bitwise Operations. That’s when integers are treated as bit patterns. If this sounds like nerd stuff to you, well, it happens quite a bit in actual catalogs. See, for instance, Note 3 for the PPMXL. You’d need the flags column described there if you wanted to exclude PPMXL objects that replaced multiple USNO-B1.0 objects (bit 3), you will right now have to write something like MOD(flags,16)>7. That’s a bit of magic that everyone will have to think about for a while. With bitwise operations, you’ll just write BITWISE_AND(flags,8)=8, which will look familiar to everyone who has used the pattern before (in particular, it’s clear we’re talking about bit 3). There still is discussion whether bitwise operations are common enough to warrant special syntax – the draft currently says the above should be written as flags&8=8 – or whether the functions DaCHS has at the moment (they’re called BITWISE_AND, BITWISE_OR, BITWISE_XOR, and BITWISE_NOT) are good enough.
Offset. If you’ve ever done anything with ADQL, you’ll know that SELECT TOP 10 * FROM hipparcos.main ORDER BY parallax DESC will give you the 10 objects with the larges parallaxes. But what if you want the next but 10 closest stars? Well, OFFSET to the rescue:
SELECT TOP 10 *
ORDER BY parallax DESC
There is another, more sinister, application for OFFSET, which happens to be the actual reason I’ve put it into DaCHS’ ADQL ages ago: Written as OFFSET 0 several databases use it to denote a barries for the query planner. This is explained to some degree in the class DaCHS TAP example Crossmatch for a Guide Star – which still mentions the first hack I had built into DaCHS to let query authors rein in overzealous query planners.
LOWER and ILIKE. ADQL has been extremely weak on the side of text processing, so weak indeed that it wasn’t nearly enough to cover the use cases for the registry when it moved to RegTAP. ADQL 2.1 adds two basic features – LOWER, a function that lets people query in a case-insensitive fashion, and ILIKE, an operator that is like LIKE, but again ignores case. While both features are obviously great as soon as people dump any kind of text (think object names) into their databases, I’m not terribly happy with ILIKE, as it does the same as RegTAP’s ivoa_nocasematch user defined function, and it’s always bad when a two standards forsee two different mechanisms for the same thing.
Geometry-typed arguments. CIRCLE and POLYGON now accept POINTs in alternative constructor functions. That is, you can now say CIRCLE(POINT(ra, dec), radius) in addition to the traditional CIRCLE(ra, dec, radius). In itself, that’s probably not terribly exciting, but when you have actual POINTs in your database, it’s much more compact to write, say,
(which would return rows for those spectra for which the declared aperture does not contain the declared target). Before, you’d had to write some fairly ugly expression involving COORD1 and whatnot in order to achieve the same effect.
Boolean expressions. That’s another one that’s still a bit up in the air. First, the rough goal is to allow boolean values in ADQL-accessible tables, which so far have been a hack at best. In the future, you should be able to say WHERE is_broken=True. However, people coming from other languages will find that odd, and indeed, in python I’d cringe on if is_broken==True:. What I’d expect is if is_broken:. Do we want this in ADQL? Currently, it’s in the grammar (more or less like this), but this kind of thing makes it still harder to produce useful syntax error messages. Is it worth it, either way? I’m not sure.
That about concludes my quick review of the new features of ADQL 2.1. If you’d like to know more, the current draft is on the IVOA document repository, and if you can deal with version control (you should!), you can follow the bleeding edge in the ADQL document in Volute. Discussion happens on the DAL mailing list.
Update (2018-04-13): Well, as to the CTEs, I couldn’t resist after all, and they’re in with DaCHS 1.1.3. And I have to say a love them — they weren’t hard to put in, and once they’re there they make so many queries a good deal more readable than before. I’ve even put it a server-defined example for CTEs on the Heidelberg TAP service showcasing a particularly compelling use case.
A histogram of times for which the Palomar-Leiden service has images: That’s temporal service coverage right there.If you are an astronomer and you’ve ever tried looking for data in the Virtual Observatory Registry, chances are you have wondered “Why can’t I enter my position here?” Or perhaps “So, I’m looking for images in [NIII] – where would I go?”
Both of these are examples for the use of Space-Time Coordinates (STC) in data discovery – yes, spectral coordinates count as STC, too, and I could make an argument for it. But this post is about something else: None of this has worked in the Registry up to now.
It’s time to mend this blatant omission. To take the next steps, after a bit of discussion on some of the IVOA’s mailing lists, I have posted an IVOA note proposing exactly those last Thursday. It is, perhaps with a bit of over-confidence, called A Roadmap for Space-Time Discovery in the VO Registry. And I’d much appreciate feedback, in particular if you are a VO user and have ideas on what you’d like to do with such a facility.
In this post, I’d like to give a very quick run-down on what is in it for (1) VO users, (2) service operators in general, and (3) service operators who happen to run DaCHS.
First, users. We already are pretty good on spatial coverage (for about 13000 of almost 20000 resources), so it might be worth experimenting with that. For now, the corresponding table is only available on the RegTAP mirror at http://dc.g-vo.org/tap. There, you can try queries like
select ivoid from
natural join rr.stc_spatial
and ucd like 'phot.flux;em.radio%'
to find – in this case – services that have radio fluxes in the area of the Hubble Deep Field. If these lines scare you or you don’t know what to do with the stupid ivoids, check the previous post on this blog – it explains a bit more about RegTAP and why you might care.
Similarly cool things will, hopefully, some day be possible in spectrum and time. For instance, if you were interested in SII fluxes in the crab nebula in the early sixties, you could, some day, write
SELECT ivoid FROM
NATURAL JOIN rr.stc_spectral
NATURAL JOIN rr.stc_spatial
As you can see, the spectral coordiate will, following (admittedly broken) VO convention, be given in meters of vacuum wavelength, and time in MJD. In particular the thing with the wavelength isn’t quite settled yet – personally, I’d much rather have energy there. For one, it’s independent of the embedding medium, but much more excitingly, it even remains somewhat sensible when you go to non-electromagnetic messengers.
A pattern I’m trying to establish is the use of the user-defined function ivo_interval_overlaps, also defined in the Note. This is intended to allow robust query patterns in the presence of two intrinsically interval-valued things: The service’s coverage and the part of the spectrum you’re interested in, say. With the proposed pattern, either of these can degenerate to a single point and things still work. Things only break when both the service and you figure that “Aw, Hα is just 656.3 nm” and one of you omits a digit or adds one.
But that’s academic at this point, because really few resources define their coverage in time and and spectrum. Try it yourself:
SELECT COUNT(*) FROM (
SELECT DISTINCT ivoid FROM rr.stc_temporal) AS q
(the subquery with the DISTINCT is necessary because a single resource can have multiple rows for time and spectrum when there’s multiple distinct intervals – think observation campaigns). If this gives you more than a few dozen rows when you read this, I strongly suspect it’s no longer 2018.
To improve this situation, the service operators need to provide the information on the coverage in their resource records. Indeed, the registry schemas already have the notion of a coverage, and the Note, in its core, simply proposes to add three elements to the coverage element of VODataService 1.1. Two of these new elements – the coverage in time and space – are simple floating-point intervals and can be repeated in order to allow non-contiguous coverage. The third element, the spatial coverage, uses a nifty data structure called a MOC, which expands to “HEALPix Multi-Order Coverage map” and is the main reason why I claim we can now pull off STC in the Registry: MOCs let databases and other programs easily and quickly manipulate areas on the sphere. Without MOCs, that’s a pain.
So, if you have registry records somewhere, please add the elements as soon as you can – if you don’t know how to make a MOC: CDS’ Aladin is there to help. In the end, your coverage elements should look somewhat like this:
The waveband elements are remainders from VODataService 1.1. They are still in use (prominently, for one, in SPLAT), and it’s certainly still a good idea to keep giving them for the forseeable future. You can also see how you would represent multiple observing campaigns and different spectral ranges.
Finally, if you’re running DaCHS and you’re using it to generate registry records (and there’s almost no excuse for not doing so), you can simply write a coverage element into your RD starting with DaCHS 1.2 (or, if you run betas, 1.1.1, which is already available). You’ll find lots of examples at the usual place. As a relatively interesting example, the resource descriptor of plts. It has this:
This particular service archives plate scans from the Palomar-Leiden Trojan surveys; these were looking for Trojan asteroids (of Jupiter) using the Palomar 122 cm Schmidt and were conducted in several shortish campaigns between 1960 and 1977 (incidentally, if you’re looking for things near the Ecliptic, this stuff might still hold valuable insights for you). Because the fill factor for the whole time period is rather small, I manually extracted the time coverage; for that, I ran select dateobs from plts.data via TAP and made the histogram plot above. Zooming in a bit, I read off the limits in TOPCAT’s coordinate display.
The other coverages, however, were put in automatically by DaCHS. That’s what the updater element does: for each axis, you can say where DaCHS should look, and it will then fill in the appropriate data from what it guesses gives the relevant coordiantes – that’s straightforward for standard tables like the ones behind SSAP and SIAP services (or obscore tables, for that matter), perhaps a bit more involved otherwise. To say “just do it for all axis”, give the updater a single sourceTable attribute.
Finally, in this case I’m overriding mocOrder, the order down to which DaCHS tries to resolve spatial features. I’m doing this here because in determining the coverage of image services DaCHS right now only considers the centers of the images, and that’s severely underestimating the coverage here, where the data products are the beautiful large Schmidt plates. Hence, I’m lowering the resolution from the default 6 (about one degree linearly) to still give some approximation to the actual data coverage. We’ll fix the underlying deficit as soon as pgsphere, the postgres extension which is actually dealing with all the MOCs, has support for turning circles and polygons into MOCs.
When you have defined an updater, just run dachs limits q.rd, and DaCHS will carefully (preserving your indentation) re-write the RD to contain what DaCHS has worked out from your table (but careful: it will overwrite what was previously there; so, make sure you only ask DaCHS to only deal with axes you’re not dealing with manually).
If you feel like writing code discovering holes in the intervals, ideally already in the database: that would be great, because the tighter the intervals defined, the fewer false positives people will have in data discovery.
The take-away for DaCHS operators is:
Add STC coverage to your resources as soon as you’ve updated to DaCHS 1.2
If you don’t have to have the tightest coverage declaration conceivable, all you have to do to have that is add
to your RD (where my_table is the id of your service’s “main” table) and then run dachs limits q.rd
For special effects and further information, see Coverage Metadata in the DaCHS reference documentation
If you have a nice postgres function that splits a simple coverage interval up so the filling factor of a set of new intervals increases (or know a nice, database-compatible algorithm to do so) – please let me know.
RegTAP is one of those standards that a scientist will normally not see – it works in the background and makes, for instance, TOPCAT display the Cone Search services matching some key words. And it’s behind the services like WIRR, our Web Interface to the Relational Registry (“Relational Registry” being the official name for RegTAP) that lets you do some interesting data discovery beyond what current clients support. In the screenshot above, for instance (try it yourself), I’m looking for cone search services having parallaxes presumably from radio observations. You could now transmit the services you’ve found to, say, TOPCAT or your own pyvo-based program to start querying them.
The key point this query is the use of UCDs – these let services declare fairly unambiguously what kind of physics (if you take that word with a grain of salt) they are talking about. In the example, pos.parallax means, well, a parallax, and the percent character is a wildcard (coming not from UCDs, but from ADQL). That wildcard is a good idea here because without it we might miss things like pos.parallax;obs and pos.parallax;stat.fit that people might have used to distinguish “raw” and ”processed” estimates.
UCDs are great for data discovery. Really.
Sometimes, however, clicking around in menus just isn’t good enough. That’s when you want the full power of RegTAP and write your very own queries. The good news: If you know ADQL (and you should!), you’re halfway there already.
Here’s one example of direct RegTAP use I came up with the other day. The use case was discovering data collections that give the effective temperatures of components of binary star systems.
If you check the UCD list, that “physics” translates into data that has columns with UCDs of phys.temperature and meta.code.multip at the same time. To translate that into a RegTAP query, have a look at the tables that make up a RegTAP service: its ”schema”. Section 8 of the standard lists all the tables there are, and there’s an ADASS poster that has an image of the schema with the more common columns illustrated. Oh, and if you’re new to RegTAP, you’re probably better off briefly studying the examples first to get a feeling for how RegTAP is supposed to work.
You will find that a pair of ivoid – the VO’s global resource identifier – and a per-resource table index uniquely identify a table within the entire registry. So, an ADQL query to pick out all tables containing temperatures and component identifiers would look like this:
SELECT DISTINCT ivoid, table_index
rr.table_column AS t1
JOIN rr.table_column AS t2
USING (ivoid, table_index)
– the DISTINCT makes it so even tables that have lots of temperatures or codes only turn up once in our result set, and the somewhat odd self-join of the rr.table_column table with itself lets us say “make sure the two columns are actually in the same table”. Note that you could catch multi-table resources that define the components in one table and the temperatures in another by just joining on ivoid rather than ivoid and table_index.
You can run this query on any RegTAP endpoint: GAVO operates a small network of mirrors behind http://reg.g-vo.org/tap, there’s the ESAC one at http://registry.euro-vo.org/regtap/tap, and STScI runs one at http://vao.stsci.edu/RegTAP/TapService.aspx. Just use your usual TAP client.
But granted, the result isn’t terribly user-friendly: just identifiers and number. We’d at least like to see the names and descriptions of the tables so we know if the data is somehow relevant.
RegTAP is designed so you can locate the columns you would like to retrieve or constrain and then just NATURAL JOIN everything together. The table_description and table_name columns are in rr.res_table, so all it takes to see them is to take the query above and join its result like this:
SELECT table_name, table_description
NATURAL JOIN (
SELECT DISTINCT ivoid, table_index
rr.table_column AS t1
JOIN rr.table_column AS t2
USING (ivoid, table_index)
AND t2.ucd='meta.code.multip') as q
If you try this, you’ll see that we’d like to get the descriptions of the resources embedding the tables, too in order to get an idea what we can expect from a given data collection. And if we later want to find services exposing the tables (WIRR is nice for that – try the ivoid constraint –, but for this example all resources currently come from VizieR, so you can directly use VizieR’s TAP service to interact with the tables), you want the ivoids. Easy: Just join rr.resource and pick columns from there:
SELECT table_name, table_description, res_description, ivoid
NATURAL JOIN rr.resource
NATURAL JOIN (
SELECT DISTINCT ivoid, table_index
rr.table_column AS t1
JOIN rr.table_column AS t2
USING (ivoid, table_index)
AND t2.ucd='meta.code.multip') as q
If you’ve made it this far and know a bit of ADQL, you probably have all it really takes to solve really challenging data discovery problems – as far as Registry metadata reaches, that is, which currently does not include space-time coverage. But stay tuned, more on this soon.
The IOVA’s committee on science priorities (CSP) has declared the “time domain” as one of its focus topics quite a while ago, an action boiling down to a call to the IVOA member projects to think about support for time series and their analysis in services, standards, and clients.
While for several years, response has been lackluster, work on time series has gathered quite a bit of steam recently. For instance, the spectral client SPLAT (co-maintained by GAVO) has grown some preliminary support to properly display time series (very rudimentary in what’s currently released), and livelydiscussions on proper metadata for time series have been going on on the Data Models mailing list of the IVOA – if you’re interested in the time domain, this would be a good time to subscribe for a while and comment as appropriate.
Meanwhile, in our Heidelberg data center, we’ve joined the fray by publishing our first time series service (science background: searching for exoplanets in the Milky Way bulge using gravitational lensing), which is available through SSA (look for k2c9vst) and through ObsCore (at http://dc.g-vo.org/tap, collection name k2c9vst), too. For details see also the service info.
Since right now future standards are being worked out, this is a perfect time to publish your time series; this way you get to influence what people will be able to tell machines about their time series in the next couple of years. Ask our staff (contact below) if you want us to publish for you. But you can also self-publish using the DaCHS publication package. Refer to the resource descriptor of the k2c9vst service to get started.
At its heart is the table definition of the time series, which is basically
<column name="hjd" type="double precision"
description="Time this photometry corresponds to."
<column name="df" type="double precision"
description="Difference as defined by 2008MNRAS.386L..77B"
description="Error in difference flux."
– in the actual service, there are a few more columns, but time, value, and error actually make up a full time series.
Except that a machine can’t really tell what this is yet (well, perhaps it could using UCDs, but that’s a different matter). What it needs to work out is what’s the independent axis, what the frames are, etc. And to do that, the machine needs annotation, i.e., machine-readable, structured declarations alongside the data and the “classic” metadata like units and descriptions.
DaCHS, however, isolates you from the concrete details of writing VOTables. Instead, you write annotations in a JSON-inspired little language we’ve christened SIL (“Simple Instance Language”; reference). The complicated part is to know what types and attributes you have to declare, which is exactly what the data models is a bout. As said initially, the details are still in flux here, but this is what things look like right now:
If you consider this for a moment, you’ll see that each dm element corresponds to something like an object template of a certain “type”. The first, for instance, defines a measurement with a value and a statistical error. Both happen to be given as references to columns in the table defined above (as indicated by the @ signs).
The last annotation defines a data cube; a time series in this definition is simply a data cube with just a single non-degenerate independently varying axis (the independent_axis attribute; in the value the square brackets indicate a sequence) that happens to be time-like. And that hjd is time-like, VO-DML enabled clients will work out when interpreting the STC (“Space-Time-Coordinates”) annotation. In there, you will see that hjd is referenced from the time attribute and with a time-like frame that also defines that this particular flavor of HJD is what a hypothetical clock at the solar system’s barycenter would measure if it stood in the gravitational potential in Greenwhich, and had leap seconds thrown in now and then. And that long story is communicated through “literals”, constant strings like “BARYCENTER” or ”TT”, which are also legal within DaCHS data model annotations.
This may seem a bit complicated at first. I argue, though, that given what time series clients will have to do anyway, going through the cube and STC annotations is actually about the most straightforward thing you can do.
But perhaps I’m wrong, so again: None of this is cast in stone right now. Comments are even more welcome than usual, either below or at firstname.lastname@example.org.
The 3. Asterics DADI Tech Forum took place last week in Strasbourg – and many GAVO members made contributions as well.
This time, there were 3 slots for hackathon sessions, which were also used for discussions. We’ll mention two highlights of our contributions here.
We took the opportunity to push our Provenance Data Model efforts and used the hackathon slots for provenance discussions.
One topic was the links between the simulation data model and ProvenanceDM, and how to map from SimDM to ProvenanceDM classes. This mapping works quite well and will be included in the working draft for the data model. We also had an interesting talk by José Enrique Ruiz on his view on Provenance, workflows, and – very important – the “deployer” and “system” provenance for storing all the environment variables that may be needed to rerun the processing of some observational data. Michèle Sanguillon also presented for the first time her extension to the prov Python library (W3C) with extensions from our IVOA Provenance Data Model. We also had interested people from outside the usual provenance-interested people joining in, e.g. from the Astron project. More about our Provenance modelling efforts can be found at IVOA Provenance wiki page.
A world premiere (of sorts) was the first discussion of RegTAP 1.1. RegTAP is a search interface to the VO Registry; it is what TOPCAT or other VO clients uses when you type in keywords to locate services. A fairly direct web-basd interface is our WIRR registry interface. RegTAP will need a bit of a makeover since VOResource, the underlying metadata scheme is currently receiving one, allowing, in particular, for including DOIs and ORCIDs (John Does of this world, rejoice: People can finally uniquely find your data and not that of all the other J. Does) in Registry records and figuring out licenses on data. Licensing may not matter when you use data in a paper but it does matter if you want to redistribute data, e.g. for planetarium programs with catalog data or pretty pictures, or when re-mixing data.
But of course the GAVOistas happily joined the fray on the many other topics discussed, from a standard format for a time series to interoperable authentication, from datalink applications to figuring out if data coming into a program should be treated as a collection of spectra or rather an object catalog – the latter in the context of the upcoming version 10 of the VO’s premier image tool Aladin, which we saw (probably another premiere) demoed. We can already promise you an exciting update!
DaCHS, the Data Center Helper Suite, is a comprehensive suite for publishing astronomical data to the Virtual Observatory, supporting most major protocols out there. On Dec 12, GAVO released a new version, 0.9.8. The most notable change is that now SODA is supported as specified in the last IVOA Proposed Recommendation.
This is fairly big news, as SODA is the VO’s answer to providing cutout services and the like, which obviously is important part with datasets in the Multi-Gigabyte range and the VO’s wider programme of trying to enable users to only download what they need. But even for spectra, which aren’t typically terribly large, we have been using SODA; for instance, when you just want to see the development of a single line over time, say,, it’s nice to not have to bother with the the full spectrum. The spectral client SPLAT has been offering such functionality for a couple of years now — watch out for the scissors icon in discovery results. These indicate SODA support on the respective services.
Another client that will support SODA and its basis Datalink is Aladin – we’ve seen a promising demo of that during the last Interop in Trieste. Until the clients are there, DaCHS contains a (largely re-usable) stylesheet that generates simple UIs for Datalink documents and SODA services. Some examples:
Note again that all of these are not actually web pages, they’re machine-readable metadata collections; if you don’t believe it, pull the URLs with curl. To learn more about the combo of Datalink and SODA, check out this ADASS 2015 poster (preferably before even looking at the not terribly readable standards texts).
UWS stands for Universal Worker Service and is an IVOA standard provides a protocol which can be used for accessing databases and other web services from the command line, e.g. using the python uws-client.
This allows to create (asynchronous) jobs for a web service (e.g. an SQL query), check their status, retrieve their results, abort or delete them.
The updated version 1.1 was approved at the InterOperability Meeting last week and brings some nice new features:
Job list filtering: When retrieving the job list, one can now retrieve only jobs created after a certain date, the latest n jobs or jobs with a certain phase (e.g. EXECUTING or COMPLETED)
WAIT: When asking for job details, it is now possible to append a WAIT parameter and provide an integer as wait-time in seconds. This means that the job details will only be returned when the wait-time is over or the job’s phase has changed, whichever comes first.