Posts with the Tag TAP:

Persistent TAP Uploads Update: A Management Interface

2025-05-21 Markus Demleitner

Setting Lifetimes
Creating Indexes
Special Index Types

a screenshot of a python notebook; a few lines of python yield the current date, a few more a date a year from now.

There is a new version of the jupyter notebook showing off the persistent TAP uploads in python coming with this post, too: Get it.

Six months ago, I reported on my proposal for persistent uploads into TAP services on this very blog: Basically, you could have and keep your own tables in databases of TAP servers supporting this, either by uploading them or by creating them with an ADQL query. Each such table has a URI; you PUT to it to create it, you GET from it to inspect its metadata, VOSI-style, and you DELETE to it to drop the table once you're done.

Back then, I enumerated a few open issues; two of these I have recently addressed: Lifetime management and index creation. Here is how:

Setting Lifetimes

In my scheme, services assign a lifetime to user-uploaded tables, mainly in order to nicely recover when users don't keep books on what they created. The service will eventually clean up their tables after them, in the case of the reference implementation in DaCHS after a week.

However, there are clearly cases when you would like to extend the lifetime of your table beyond that week. To let users do that, my new interface copies the pattern of UWS. There, jobs have a destruction child. You can post DALI-style timestamps[1] there, and both the POST and the GET return a DALI timestamp of the destruction time actually set; this may be different from what you asked for because services may set hard limits (in my case, a year).

For instance, to find out when the service will drop the table you will create when you follow last October's post you could run:

curl http://dc.g-vo.org/tap/user_tables/my_upload/destruction

To request that the table be preserved until the Epochalypse you would say:

curl -F DESTRUCTION=2038-01-19T03:14:07Z http://dc.g-vo.org/tap/user_tables/my_upload/destruction

Incidentally (and as you can see from the POST response), until January 2037, my service will reduce your request to “a year from now”.

I can't say I'm too wild about posting a parameter called “DESTRUCTION” to an endpoint that's called “destruction” (even if it weren't such a mean word). UWS did that because they wanted it make it easy to operate a compliant service from a web browser. Whether that still is a reasonable design goal (in particular because everyone seems to be wild on dumping 20 metric tons of Javascript on their users even things like UWS would make it easy to not do that) is certainly debatable. But I thought it's better to have a single questionable pattern throughout rather than have something a little odd in one place and something a little less odd in another place.

Creating Indexes

For many applications of database systems, having indexes is crucial. You really, really don't want to have to go through a table with 2 billion rows just to find a single object (that's a quarter of a day when you manage to pull through 100'000 rows a second; ok: today's computers are faster than that). While persistently uploaded tables won't (regularly) have two billion rows any time soon, indexes are already very valuable even for tables in the million-row range.

On the other hand, there are many sorts of indexes, and there are many ways to qualify indexes. To get an idea of what you might want to tell a database about an index, see Postgres' CREATE INDEX docs. And that's just for Postgres; other database systems still do it differently, and of course when you index on expressions, there is no limit to the complexity you can build into your indexes.

Building a cross-database API that would reflect all that is entirely out of the question. Hence, I went for the other extreme: You just specify which column(s) you would like to have indexed, and the service is supposed to choose a plausible index type for you.

Following the model of destruction (typography matters!), this is done by POST-ing one or more column names in INDEX parameters to the index child of the table url. For instance, if you have put a table my_upload that has a column Kmag (e.g., from last October's post), you would say:

curl -L -F INDEX=Kmag http://dc.g-vo.org/tap/user_tables/my_upload/index

The -L makes curl follow the redirect that this issues. Why would it redirect, you ask? The index request creates a UWS job behin the scenes, that is, something like a TAP async job. What you get redirected to is that job.

The background is that for large tables and complex indexes, you may easily get your (appartently idle) connection cut while the index is being created, and you would never learn if a problem had materialised or when the indexing is done. Against that, UWS lets us keep running, and you have a URI at which to inspect the progress of the indexing operation (well, frankly: nothing yet beyond “is it done?”).

Speaking UWS with curl is no fun, but then you don't need to: The job starts in QUEUED and will automatically execute when the machine next has time. In case you are curious, see the notebook linked above, where there is an example for manually following the job's progress. You could use generic UWS clients to watch it, too.

A weak point of the scheme (and one that's surprisingly hard to fix) is that the index is immediately shown in the table metadata the notebook linked to above shows this; I'll spare you the VODataService XML that curl-ing the table URL will spit at you, but in there you will see the Kmag index whether or not the indexer job has run.

It shares this deficit with another way to look at indexes. You see, since there is so much backend-specific stuff you may want to know about an index, I am also proposing that when you GET the index child, you get back the actual database statements, or at least something rather similar. This is expressly not supposed to be machine readable, if only because what you see is highly dependent on the underlying database.

Here is how this looks like on DaCHS over postgres after the index call on Kmag:

$ curl http://dc.g-vo.org/tap/user_tables/my_upload/index
Indexes on table tap_user.my_upload

CREATE INDEX my_upload_Kmag ON tap_user.my_upload (Kmag)

I would not want to claim that this particular human-readable either. But humans that try to understand why a computer does not behave as they expect will certainly appreciate something like this.

Special Index Types

If you look at the tmp.vot from last october's post, you will see that there is an a pair of equatorial coordinates in _RAJ2000 and _DEJ2000. It is nicely marked up with pos.eq UCDs, and the units are deg: This is an example of a column set that DaCHS has special index magic for. Try it:

curl -L -F INDEX=_RAJ2000 -F INDEX=_DEJ2000 \
  http://dc.g-vo.org/tap/user_tables/my_upload/index > /dev/null

Another GET against index will show you that this index is a bit different, stuttering something about q3c (or perhaps spoint at another time or on another service):

Indexes on table tap_user.my_upload

CREATE INDEX my_upload__RAJ2000__DEJ2000 ON tap_user.my_upload (q3c_ang2ipix("_RAJ2000","_DEJ2000"))
CLUSTER my_upload__RAJ2000__DEJ2000 ON tap_user.my_upload
CREATE INDEX my_upload_Kmag ON tap_user.my_upload (Kmag)

DaCHS will also recognise spatial points. Let's quickly create a table with a few points by running:

CREATE TABLE tap_user.somepoints AS
SELECT TOP 30 preview, ssa_location
FROM gdr3spec.ssameta

on the TAP server at http://dc.g-vo.org/tap, for instance in TOPCAT (as explained in post one, the “Table contained no rows” message you will see then is to be expected). Since TOPCAT does not know about persistent uploads yet, you have to create the index using curl:

curl -LF INDEX=ssa_location http://dc.g-vo.org/tap/user_tables/somepoints/index

GET-ting the index URL after that will yield:

Indexes on table tap_user.somepoints

CREATE INDEX ndpmaliptmpa_ssa_location ON tap_user.ndpmaliptmpa USING GIST (ssa_location)
CLUSTER ndpmaliptmpa_ssa_location ON tap_user.ndpmaliptmpa

The slightly shocking name of the table is an implementation detail that I might want to hide at some point; the important thing here is the USING GIST that indicates DaCHS has realised that for spatial queries to be able to use the index, a special method is necessary.

Incidentally, I was (and still am) not entirely sure what to do when someone asks for this:

curl -L -F INDEX=_Glon -F INDEX=_DEJ2000 \
  http://dc.g-vo.org/tap/user_tables/my_upload/index > /dev/null

That's a latitude and a longitude all right, but of course they don't belong together. Do I want to treat these as two random columns being indexed together, or do I decide that the user very probably wants to use a very odd coordinate system here?

Well, try it and see how I decided; after this post, you know what to do.

[1]	Many people call that “ISO format”, but I cannot resist pointing out that ISO, in addition to charging people who want to read their standards an arm and leg, admits a panic-inducing variety of date formats, and so “ISO format” not a particularly useful term.

A New Constraint Class in PyVO's Registry API: UAT

2025-02-14 Markus Demleitner

A scan of a book page: lots of astronomy-relevant topics ranging from "Cronometrie" to "Kosmologie, Relativitätstheorie". Overlaid a title page stating "Astronomischer Jahresbericht. Die Literatur des Jahres 1967".

This was how they did what I am talking about here almost 60 years ago: a page of the table of contents of the “Astronomischer Jahresbericht” for 1967, the last volume before it was turned into the English-language Astronomy and Astrophysics Abstracts, which were the main tool for literature work in astronomy until the ADS came along in the late 1990ies.

Thesauri and the UAT
Why Keywords?
The UAT constraint
Implementation

I have recently created a pull request against pyVO to furnish the library with a new constraint to search for data and services: Search by a concept drawn from the Unified Astronomy Thesaurus UAT. This is not entirely different from the classical search by subject keywords that was what everyone did before we had the ADS, which is what I am trying to illustrate above. But it has some twists that, I would argue, still make it valuable even in the age of full-text indexes.

To make my argument, let me first set the stage.

Thesauri and the UAT

(Disclaimer: I am currently a member of the UAT steering committee and therefore cannot claim neutrality. However, I would not claim neutrality otherwise, either: the UAT is not perfect, but it's already great)

Librarians (and I am one at heart) love thesauri. Or taxonomies. Or perhaps even ontologies. What may sound like things out of a Harry Potter novel are actually ways to organise a part of the world (a “domain”) into “concepts”. If you are suitably minded, you can think of a “concept“ as a subset of the domain; “suitably minded“ here means that you consider the world as a large set of things and a domain a subset of this world. The IVOA Vocabularies specification contains some additional philosophical background on this way of thinking in sect. 5.2.4.

On this other hand, if you are not suitably minded, a “concept” is not much different from a topic.

There are differences in how each of thesaurus, taxonomy, and ontology does that organising (and people don't always agree on the differences). Ontologies, for instance, let you link concepts in every way, as in “a (bicycle) (is steered) (using) a (handle bar) (made of) ((steel) or (aluminum))“; every parenthesised phrase would be a node (which is a better term in ontologies than “concept”) in a suitably general ontology, and connecting these nodes creates a fine-graned representation of knowledge about the world.

That is potentially extremely powerful, but also almost too hard for humans. Check out WordNet for how far one can take ontologies if very many very smart people spend very many years.

Thesauri, on the other hand, are not as powerful, but they are simpler and within reach for mere humans: there, concepts are simply organised into something like a tree, perhaps (and that is what many people would call a taxonomy) using is-a relationships: A human is a primate is a mammal is a vertebrate is an animal. The UAT actually is using somewhat vaguer notions called “narrower” and “wider”. This lets you state useful if somewhat loose relationships like “asteroid-rotation is narrower than asteroid-dynamics”. For experts: The UAT is using a formalism called SKOS; but don't worry if you can't seem to care.

The UAT is standing on the shoulders of giants: Before it, there has been the IAU thesaurus in 1993, and an astronomy thesaurus was also produced under the auspices of the IVOA. And then there were (and to some extent still are) the numerous keyword schemes designed by journal publishers that would also count as some sort of taxonomy or astronomy.

“Numerous” is not good when people have to assign keywords to their journal articles: If A&A use something drastically or only subtly different from ApJ, and MNRAS still something else, people submitting to multiple journals will quite likely lose their patience and diligence with the keywords. For reasons I will discuss in a second, that is a shame.

Therefore, at least the big American journals have now all switched to using UAT keywords, and I sincerely hope that their international counterparts will follow their example where that has not already happened.

Why Keywords?

Of course, you can argue that when you can do full-text searches, why would you even bother with controlled keyword lists? Against that, I would first argue that it is extremely useful to have a clear idea of what a thing is called: For example, is it delta Cephei stars, Cepheids, δ Cep stars or still something else? Full text search would need to be rather smart to be able to sort out terminological turmoil of this kind for you.

And then you would still not know if W Virginis stars (or should you say “Type II Cepheids”? You see how useful proper terminology is) are included in whatever your author called Cepheids (or whatever they called it). Defining concepts as precisely as possible thus is already great.

The keyword system becomes even more useful when the hiearchy we see in the Cepheid example becomes visible to computers. If a computer knows that there is some relationship between W Virgins stars and classical Cepheids, it can, for instance, expand or refine your queries (“give me data for all kinds of Cepheids”) as necessary. To give you an idea of how this looks in practice, here is how SemBaReBro displays the Cepheid area in the UAT:

Arrows between texts like "Type II Cepheid variable stars", "Cepheid variable stars", and "Young disk Cepheid variable stars"

In that image, only concepts associated with resources in the Registry have a spiffy IVOA logo; that so few VO resources claim to deal with Cepheids tells you that our data providers can probably improve their annotations quite a bit. But that is for another day; the hope is that as more people search using UAT concepts, the data providers will see a larger benefit in choosing them wisely[1].

By the way, if you are a regular around here, you will have seen images like that before; I have talked about Sembarebro in 2021 already, and that post contains more reasons for having and maintaining vocabularies.

Oh, and for the definitions of the concepts, you can (in general; in the UAT, there are still a few concepts without definitions) dereference the concept URI, which in the VO is always of the form <vocabulary uri>#<term identifier>, where the vocabulary URI starts with http://www.ivoa.net/rdf, after which there is the vocabulary name.

Thus, if you point your web browser to https://www.ivoa.net/rdf/uat#cepheid-variable-stars [2], you will learn that a Cepheid is:

A class of luminous, yellow supergiants that are pulsating variables and whose period of variation is a function of their luminosity. These stars expand and contract at extremely regular periods, in the range 1-50 days [...]

The UAT constraint

Remember? This was supposed to be a blog post about a new search constraint in pyVO. Well, after all the preliminaries I can finally reveal that once pyVO PR #649 is merged, you can search by UAT concepts:

>>> from pyvo import registry
>>> print(registry.search(registry.UAT("variable-stars")))
<DALResultsTable length=2010>
              ivoid               ...
                                  ...
              object              ...
--------------------------------- ...
         ivo://cds.vizier/b/corot ...
          ivo://cds.vizier/b/gcvs ...
           ivo://cds.vizier/b/vsx ...
          ivo://cds.vizier/i/280b ...
           ivo://cds.vizier/i/345 ...
           ivo://cds.vizier/i/350 ...
                              ... ...
            ivo://cds.vizier/v/97 ...
         ivo://cds.vizier/vii/293 ...
   ivo://org.gavo.dc/apass/q/cone ...
ivo://org.gavo.dc/bgds/l/meanphot ...
     ivo://org.gavo.dc/bgds/l/ssa ...
     ivo://org.gavo.dc/bgds/q/sia ...

In case you have never used pyVO's Registry API before, you may want to skim my post on that topic before continuing.

Since the default keyword search also queries RegTAP's res_subject table (which is what this constraint is based on), this is perhaps not too exciting. At least there is a built-in protection against typos:

>>> print(registry.search(registry.UAT("varialbe-stars")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/msdemlei/gavo/src/pyvo/pyvo/registry/rtcons.py", line 713, in __init__
    raise dalq.DALQueryError(
pyvo.dal.exceptions.DALQueryError: varialbe-stars does not identify an IVOA uat concept (see http://www.ivoa.net/rdf/uat).

It becomes more exciting when you start exploiting the intrinsic hierarchy; the constraint constructor supports optional keyword arguments expand_up and expand_down, giving the number of levels of parent and child concepts to include. For instance, to discover resources talking about any sort of supernova, you would say:

>>> print(registry.search(registry.UAT("supernovae", expand_down=10)))
<DALResultsTable length=593>
                 ivoid                   ...
                                         ...
                 object                  ...
---------------------------------------- ...
                   ivo://cds.vizier/b/sn ...
                 ivo://cds.vizier/ii/159 ...
                 ivo://cds.vizier/ii/189 ...
                 ivo://cds.vizier/ii/205 ...
                ivo://cds.vizier/ii/214a ...
                 ivo://cds.vizier/ii/218 ...
                                     ... ...
           ivo://cds.vizier/j/pasp/122/1 ...
       ivo://cds.vizier/j/pasp/131/a4002 ...
           ivo://cds.vizier/j/pazh/30/37 ...
          ivo://cds.vizier/j/pazh/37/837 ...
ivo://edu.gavo.org/eurovo/aida_snconfirm ...
                ivo://mast.stsci/candels ...

There is no overwhelming magic in this, as you can see when you tell pyVO to show you the query it actually runs:

>>> print(registry.get_RegTAP_query(registry.UAT("supernovae", expand_down=10)))
SELECT
  [crazy stuff elided]
WHERE
(ivoid IN (SELECT DISTINCT ivoid FROM rr.res_subject WHERE res_subject in (
  'core-collapse-supernovae', 'hypernovae', 'supernovae',
  'type-ia-supernovae', 'type-ib-supernovae', 'type-ic-supernovae',
  'type-ii-supernovae')))
GROUP BY [whatever]

Incidentally, some services have an ADQL extension (a “user defined function“ or UDF) that lets you do these kinds of things on the server side; that is particularly nice when you do not have the power of Python at your fingertips, as for instance interactively in TOPCAT. This UDF is:

gavo_vocmatch(vocname STRING, term STRING, matchagainst STRING) -> INTEGER

(documentation at the GAVO data centre). There are technical differences, some of which I try to explain in amoment. But if you run something like:

SELECT ivoid FROM rr.res_subject
WHERE 1=gavo_vocmatch('uat', 'supernovae', res_subject)

on the TAP service at http://dc.g-vo.org/tap, you will get what you would get with registry.UAT("supernovae", expand_down=1). That UDF also works with other vocabularies. I particularly like the combination of product-type, obscore, and gavo_vocmatch.

If you wonder why gavo_vocmatch does not go on expanding towards narrower concepts as far as it can go: That is because what pyVO does is semantically somewhat questionable.

You see, SKOS' notions of what is wider and narrower are not transitive. This means that just because A is wider than B and B is wider than C it is not certain that A is wider than C. In the UAT, this sometimes leads to odd results when you follow a branch of concepts toward narrower concepts, mostly because narrower sometimes means part-of (“Meronymy”) and sometimes is-a (“Hyponymy“). Here is an example discovered by my colleague Adrian Lucy:

interstellar-medium wider nebulae wider emission-nebulae wider planetary-nebulae wider planetary-nebulae-nuclei

Certainly, nobody would argue that that the central stars of planetary nebulae somehow are a sort of or are part of the interstellar medium, although each individual relationship in that chain makes sense as such.

Since SKOS relationships are not transitive, gavo_vocmatch, being a general tool, has to stop at one level of expansion. By the way, it will not do that for the other flavours of IVOA vocabularies, which have other (transitive) notions of narrower-ness. With the UAT constraint, I have fewer scruples, in particular since the expansion depth is under user control.

Implementation

Talking about technicalities, let me use this opportunity to invite you to contribute your own Registry constraints to pyVO. They are not particularly hard to write if you know both ADQL and Python. You will find several examples – between trivial and service-sensing complex in pyvo.registry.rtcons. The code for UAT looks like this (documentation removed for clarity[3]):

class UAT(SubqueriedConstraint):
    _keyword = "uat"
    _subquery_table = "rr.res_subject"
    _condition = "res_subject in {query_terms}"
    _uat = None

    @classmethod
    def _expand(cls, term, level, direction):
        result = {term}
        new_concepts = cls._uat[term][direction]
        if level:
            for concept in new_concepts:
                result |= cls._expand(concept, level-1, direction)
        return result

    def __init__(self, uat_keyword, *, expand_up=0, expand_down=0):
        if self.__class__._uat is None:
            self.__class__._uat = vocabularies.get_vocabulary("uat")["terms"]

        if uat_keyword not in self._uat:
            raise dalq.DALQueryError(
                f"{uat_keyword} does not identify an IVOA uat"
                " concept (see http://www.ivoa.net/rdf/uat).")

        query_terms = {uat_keyword}
        if expand_up:
            query_terms |= self._expand(uat_keyword, expand_up, "wider")
        if expand_down:
            query_terms |= self._expand(uat_keyword, expand_down, "narrower")

        self._fillers = {"query_terms": query_terms}

Let me briefly describe what is going on here. First, we inherit from the base class SubqueriedConstraint. This is a class that takes care that your constraints are nicely encapsulated in a subquery, which generally is what you want in pyVO. Calmly adding natural joins as recommended by the RegTAP specification is a dangerous thing for pyVO because as soon as a resource matches your constraint more than once (think “columns with a given UCD”), the RegistryResult lists in pyVO will turn funny.

To make a concrete SubqueriedConstraint, you have to fill out:

the table it will operate on, which is in the _subquery_table class attribute;
an expression suitable for a WHERE clause in the _condition attribute, which is a template for str.format. This is often computed in the constructor, but here it is just a constant expression and thus works fine as a class attribute;
a mapping _fillers mapping the substitutions in the _condition string template to Python values. PyVO's RegTAP machinery will worry about making SQL literals out of these, so feel free to just dump Python values in there. See the make_SQL_literal for what kinds of types it understands and expand it as necessary.

There is an extra class attribute called _keyword. This is used by the pyvo.regtap machinery to let users say, for instance, registry.search(uat="foo.bar") instead of registry.search(registry.UAT("foo.bar")). This is a fairly popular shortcut when your constraints can be expressed as simple strings, but in the case of the UAT constraint you would be missing out on all the interesting functionality (viz., the query expansion that is only available through optional arguments to its constructor).

This particular class has some extra logic. For one, we cache a copy of the UAT terms on first use at the class level. That is not critical for performance because caching already happens at the level of get_vocabulary; but it is convenient when we want query expansion in a class method, which in turn to me feels right because the expansion does not depend on the instance. If you don't grok the __class__ magic, don't worry. It's a nerd thing.

More interesting is what happens in the _expand class method. This takes the term to expand, the number of levels to go, and whether to go up or down in the concept trees (which are of the computer science sort, i.e., with the root at the top) in the direction argument, which can be wider or narrower, following the names of properties in Desise, the format we get our vocabulary in. To learn more about Desise, see section 3.2 of Vocabularies in the VO 2.

At each level, the method now collects the wider or narrower terms, and if there are still levels to include, calls itself on each new term, just with level reduced by one. I consider this a particularly natural application of recursion. Finally. everything coming back is merged into a set, which then is the return value.

And that's really it. Come on: write your own RegTAP constraints, and also have fun with vocabularies. As you see here, it's really not that magic.

[1]	Also, just so you don't leave with the impression I don't believe in AI tech at all, something like SciX's KAILAS might also help improving Registry subject keywords.

[2]

Yes, in a little sleight of hand, I've switched the URI scheme to https here. That's not really right, because the term URIs are supposed to be opaque, but some browsers currently forget the fragment identifiers when the IVOA web server redirects them to https, and so https is safer for this demonstration. This is a good example of why the web would be a better place if http had been evolved to support transparent, client-controlled encryption (rather than inventing https).

[3]	I've always wanted to write this.

Doing Large-Scale ADQL Queries

2025-01-27 Markus Demleitner

Raising the Match Limit
Giving the Query More Time
Planning for Large Result Sets? Get in Contact!
Uniform-Length Partitions
Equal-Size Partitions

You can do many interesting things with TAP and ADQL while just running queries returning a few thousand rows after a few seconds. Most examples you would find in tutorials are of that type, and when the right indexes exist on the queried tables, the scope of, let's say, casual ADQL goes far beyond toy examples.

Actually, arranging things such that you only fetch the data you need for the analysis at hand – and that often is not much more than the couple of kilobytes that go into a plot or a regression or whatever – is a big reason why TAP and ADQL were invented in the first place.

But there are times when the right indexes are not in place, or when you absolutely have to do something for almost everything in a large table. Database folks call that a sequential scan or seqscan for short. For larger tables (to give an order of magnitude: beyond 10⁷ rows in my data centre, but that obviously depends), this means you have to allow for longer run times. There are even times when you may need to fetch large portions of such a large table, which means you will probably run into hard match limits when there is just no way to retrieve your full result set in one go.

This post is about ways to deal with such situations. But let me state already that having to go these paths (in particular the partitioning we will get to towards the end of the post) may be a sign that you want to re-think what you are doing, and below I am briefly giving pointers on that, too.

Raising the Match Limit

Most TAP services will not let you retrieve arbitrarily many rows in one go. Mine, for instance, at this point will snip results off at 20'000 rows by default, mainly to protect you and your network connection against being swamped by huge results you did not expect.

You can, and frequently will have to (even for an all-sky level 6 HEALPix map, for instance, as that will retrieve 49'152 rows), raise that match limit. In TOPCAT, that is done through a little combo box above the query input (you can enter custom values if you want):

A screenshot with a few widgets from TOPCAT. A combo box is opened, and the selection is on "20000 (default)".

If you are somewhat confident that you know what you are doing, there is nothing wrong with picking the maximum limit right away. On the other hand, if you are not prepared to do something sensible with, say, two million rows, then perhaps put in a smaller limit just to be sure.

In pyVO, which we will be using in the rest of this post, this is the maxrec argument to run_sync and its sibling methods on TAPService.

Giving the Query More Time

When dealing with non-trivial queries on large tables, you will often also have to give the query some extra time. On my service, for instance, you only have a few seconds of CPU time when your client uses TAP's synchronous mode (by calling TAPService.run_sync method). If your query needs more time, you will have to go async. In the simplest case, all that takes is write run_async rather than run_sync (below, we will use a somewhat more involved API; find out more about this in our pyVO course).

In async mode, you have two hours on my box at this point; this kind of time limit is, I think, fairly typical. If even that is not enough, you can ask for more time by changing the job's execution_duration parameter (before submitting it to the database engine; you cannot change the execution duration of a running job, sorry).

Let us take the example of a colour-magnitude diagram for stars in Gaia DR3 with distances of about 300 pc according to Bailer-Jones et al (2021); to make things a bit more entertaining, we want to load the result in TOPCAT without first downloading it locally; instead, we will transmit the result's URI directly to TOPCAT[1], which means that your code does not have to parse and re-package the (potentially large) data.

On the first reading, focus on the main function, though; the SAMP fun is for later:

import time
import pyvo

QUERY = """
SELECT
    source_id, phot_g_mean_mag, pseudocolour,
    pseudocolour_error, phot_g_mean_flux_over_error
FROM gedr3dist.litewithdist
WHERE
    r_med_photogeo between 290 and 310
    AND ruwe<1.4
    AND pseudocolour BETWEEN 1.0 AND 1.8
"""

def send_table_url_to_topcat(conn, table_url):
    client_id = pyvo.samp.find_client_id(conn, "topcat")
    message = {
        "samp.mtype": "table.load.votable",
        "samp.params": {
            "url": table_url,
            "name": "TAP result",}
    }
    conn.notify(client_id, message)


def main():
    svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
    job = svc.submit_job(QUERY, maxrec=3000)
    try:
        job.execution_duration=10000  # that's 10000 seconds
        job.run()
        job.wait()
        assert job.phase=="COMPLETED"

        with pyvo.samp.connection(addr="127.0.0.1") as conn:
            send_table_url_to_topcat(conn, job.result_uri)
    finally:
        job.delete()

if __name__=="__main__":
    main()

As written, this will be fast thanks to maxrec=3000, and you wouldn't really have to bother with async just yet. The result looks nicely familiar, which means that in that distance range, the Bailer-Jones distances are pretty good:

A rather sparsely populated colour-magnitude diagram with a pretty visible main sequence.

Now raise the match limit to 30000, and you will already need async. Here is what the result looks like:

A more densely populated colour-magnitude diagram with a pretty visible main sequence, where a giant branch starts to show up.

Ha! Numbers matter: at least we are seeing a nice giant branch now! And of course the dot colours do not represent the colours of the stars with the respective pseudocolour; the directions of blue and red are ok, but most of what you are seeing here will look rather ruddy in reality.

You will not really need to change execution_duration here, nor will you need it even when setting maxrec=1000000 (or anything more, for that matter, as the full result set size is 330'545), as that ends up finishing within something like ten minutes. Incidentally, the result for the entire 300 pc shell, now as a saner density plot, looks like this:

A full colour-magnitude diagram with densities coded in colours. A huge blob is at the red end of the main sequence, and there is a well-defined giant branch and a very visible horizontal branch.

Ha! Numbers matter even more. There is now even a (to me surprisingly clear) horizontal branch in the plot.

Planning for Large Result Sets? Get in Contact!

Note that if you were after a global colour-magnitude diagram as the one I have just shown, you should probably do server-side aggregation (that is: compute the densities in a few hundred or thousand bins on the server and only retrieve those then) rather than load ever larger result sets and then have the aggregation be performed by TOPCAT. More generally, it usually pays to try and optimise ADQL queries that are slow and have huge result sets before fiddling with async and, even more, with partitioning.

Most operators will be happy to help you do that; you will find some contact information in TOPCAT's service tab, for instance. In pyVO, you could use the get_contact method of the objects you get back from the Registry API[2]:

>>> pyvo.registry.search(ivoid="ivo://org.gavo.dc/tap")[0].get_contact()
'GAVO Data Centre Team (+49 6221 54 1837) <gavo@ari.uni-heidelberg.de>'

That said: sometimes neither optimisation nor server-side aggregation will do it: You just have to pull more rows than the service's match limit. You see, most servers will not let you pull billions of rows in one go. Mine, for instance, will cap the maxrec at 16'000'000. What you need to do if you need to pull more than that is chunking up your query such that you can process the whole sky (or whatever else huge thing makes the table large) in manageable chunks. That is called partitioning.

Uniform-Length Partitions

To partition a table, you first need something to partition on. In database lingo, a good thing to partition on is called a primary key, typically a reasonably short string or, even better, an integer that maps injectively to the rows (i.e., not two rows have the same key). Let's keep Gaia as an example: the primary key designed for it is the source_id.

In the simplest case, you can “uniformly” partition between 0 and the largest source_id, which you will find by querying for the maximum:

SELECT max(source_id) FROM gaia.dr3lite

This should be fast. If it is not, then there is likely no sufficiently capable index on the column you picked, and hence your choice of the primary key probably is not a good one. This would be another reason to turn to the service's contact address as above.

In the present case, the query is fast and yields 6917528997577384320. With that number, you can write a program like this to split up your problem into N_PART sub-problems:

import pyvo

MAX_ID, N_PART = 6917528997577384320+1, 100
partition_limits = [(MAX_ID//N_PART)*i
  for i in range(N_PART+1)]

svc = pyvo.dal.TAPService("http://dc.g-vo.org/tap")
main_query = "SELECT count(*) FROM ({part}) AS q"

for lower, upper in zip(partition_limits[:-1], partition_limits[1:]):
  result = svc.run_sync(main_query.format(part=
    "SELECT * FROM gaia.dr3lite"
    "  WHERE source_id BETWEEN {} and {} ".format(lower, upper-1)))
  print(result)

Exercise: Can you see why the +1 is necessary in the MAX_ID assignment?

This range trick will obviously not work when the primary key is a string; I would probably partition by first letter(s) in that case.

Equal-Size Partitions

However, this is not the end of the story. Gaia's (well thought-out) enumeration scheme reflects to a large degree sky positions. So do, by the way, the IAU conventions for object designations. Since most astronomical objects are distributed highly unevenly on the sky, creating partitions with of equal size in identifier space will yield chunks of dramatically different (a factor of 100 is not uncommon) sizes in all-sky surveys.

In the rather common event that you have a use case in which you need a guaranteed maximum result size per partition, you will therefore have to use two passes, first figuring out the distribution of objects and then computing the desired partition from that.

Here is an example for how one might go about this:

from astropy import table
import pyvo

MAX_ID, ROW_TARGET = 6917528997577384320+1, 10000000

ENDPOINT = "http://dc.g-vo.org/tap"

# the 10000 is just the number of bins to use; make it too small, and
# your inital bins may already overflow ROW_TARGET
ID_DIVISOR = MAX_ID/10000

DISTRIBUTION_QUERY = f"""
select round(source_id/{ID_DIVISOR}) as bin, count(*) as ct
from gaia.dr3lite
group by bin
"""


def get_bin_sizes():
  """returns a ordered sequence of (bin_center, num_objects) rows.
  """
  # since the partitioning query already is expensive, cache it,
  # and use the cache if it's there.
  try:
    with open("partitions.vot", "rb") as f:
      tbl = table.Table.read(f)
  except IOError:
    # Fetch from source; takes about 1 hour
    print("Fetching partitions from source; this will take a while"
      " (provide partitions.vot to avoid re-querying)")
    svc = pyvo.dal.TAPService(ENDPOINT)
    res = svc.run_async(DISTRIBUTION_QUERY, maxrec=1000000)
    tbl = res.table
    with open("partitions.vot", "wb") as f:
      tbl.write(output=f, format="votable")

  res = [(row["bin"], row["ct"]) for row in tbl]
  res.sort()
  return res


def get_partition_limits(bin_sizes):
  """returns a list of limits of source_id ranges exhausting the whole
  catalogue.

  bin_sizes is what get_bin_sizes returns (and it must be sorted by
  bin center).
  """
  limits, cur_count = [0], 0
  for bin_center, bin_count in bin_sizes:
    if cur_count+bin_count>MAX_ROWS:
      limits.append(int(bin_center*ID_DIVISOR-ID_DIVISOR/2))
      cur_count = 0
    cur_count += bin_count
  limits.append(MAX_ID)
  return limits


def get_data_for(svc, query, low, high):
  """returns a TAP result for the (simple) query in the partition
  between low and high.

  query needs to query the ``sample`` table.
  """
  job = svc.submit_job("WITH sample AS "
    "(SELECT * FROM gaia.dr3lite"
    "  WHERE source_id BETWEEN {} and {}) ".format(lower, upper-1)
    +query, maxrec=ROW_TARGET)
  try:
    job.run()
    job.wait()
    return job.fetch_result()
  finally:
    job.delete()


def main():
  svc = pyvo.dal.TAPService(ENDPOINT)
  limits = get_partition_limits(get_bin_sizes())
  for ct, (low, high) in enumerate(zip(limits[:-1], limits[1:])):
    print("{}/{}".format(ct, len(limits)))
    res = get_data_for(svc, <a query over a table sample>, low, high-1)
    # do your thing here

But let me stress again: If you think you need partitioning, you are probably doing it wrong. One last time: If in any sort of doubt, try the services' contact addresses.

[1]

Of course, if you are doing long-running queries, you probably will postpone the deletion of the service until you are sure you have the result wherever you want it. Me, I'd probably print the result URL (for when something goes wrong on SAMP or in TOPCAT) and a curl command line to delete the job when done. Oh, and perhaps a reminder that one ought to execute the curl command line once the data is saved.

[2]

Exposing the contact information in the service objects themselves would be a nice little project if you are looking for contributions you could make to pyVO; you would probably do a natural join between the rr.interface and the rr.res_role tables and thus go from the access URL (you generally don't have the ivoid in pyVO service objects) to the contact role.

DaCHS 2.11: Persistent TAP Uploads

2024-12-16 Markus Demleitner
The traditional autumn release of GAVO's server package DaCHS is somewhat late this year, but not so late that could not still claim it comes after the interop. So, here it is: DaCHS 2.11 and the traditional what's new post.

But first, while I may have DaCHS operators' attention: If you have always wondered why things in DaCHS are as they are, you will probably enjoy the article Declarative Data Publication with DaCHS, which one day will be in the proceedings of ADASS XXXIV (and before that probably on arXiv). You can read it in a pre-preprint version already now at https://docs.g-vo.org/I301.pdf, and feedback is most welcome.
Persistent TAP Uploads

The potentially most important new feature of DaCHS 2.11 (in my opinion) will not be news to regular readers of this blog: Persistent TAP Uploads.

At this point, no client supports this, and presumably when clients do support it, it will look somewhat different, but if you like the bleeding edge and have users that don't mind an occasional curl or requests call, you would be more than welcome to help try the persistent uploads. As an operator, it should be sufficient to type:
```
dachs imp //tap_user
```
To make this more useful, you probably want to hand out proper credentials (make them with dachs adm adduser) to people who want to play with this, and point the interested users to the demo jupyter notebook.

I am of course grateful for any feedback, in particular on how people find ways to use these features to give operators a headache. For instance, I really would like to avoid writing a quota system. But I strongly suspect will have to…
On-loaded Execute-s

DaCHS has a built-in cron-type mechanism, the execute Element. So far, you could tell it to run jobs every x seconds or at certain times of the day. That is fine for what this was made for: updates of “living” data. For instance, the RegTAP RD (which is what's behind the Registry service you are probably using if you are reading this) has something like this:
```
<execute title="harvest RofR" every="40000">
  <job><code>
      execDef.spawnPython("bin/harvestRofR.py")
  </code></job>
</execute>
```
This will pull in new publishing registries from the Registry of Registries, though that is tangential; the main thing is that some code will run every 40 kiloseconds (or about 12 hours).

Against using plain cron, the advantage is that DaCHS knows context (for instance, the RD's resdir is not necessary in the example call), that you can sync with DaCHS' own facilities, and most of all that everything is in once place and can be moved together. By the way, it is surprisingly simple to run a RegTAP service of your own if you already run DaCHS. Feel free to inquire if you are interested.

In DaCHS 2.11, I extended this facility to include “events” in the life of an RD. The use case seems rather remote from living data: Sometimes you have code you want to share between, say, a datalink service and some ingestion code. This is too resource-bound for keeping it in the local namespace, and that would again violate RD locality on top.

So, the functions somehow need to sit on the RD, and something needs to stick them there. To do that, I recommended a rather hacky technique with a LOOP with codeItems in the respective howDoI section. But that was clearly rather odious – and fragile on top because the RD you manipulated was just being parsed (but scroll down in the howDoI and you will still see it).

Now, you can instead tell DaCHS to run your code when the RD has finished loading and everything should be in place. In a recent example I used this to have common functions to fetch photometric points. In an abridged version:
```
<execute on="loaded" title="define functions"><job>
  <setup imports="h5py, numpy"/>
  <code>
  def get_photpoints(field, quadrant, quadrant_id):
    """returns the photometry points for the specified time series
    from the HDF5 as a numpy array.

    [...]
    """
    dest_path = "data/ROME-FIELD-{:02d}_quad{:d}_photometry.hdf5".format(
      field, quadrant)
    srchdf = h5py.File(rd.getAbsPath(dest_path))
    _, arr = next(iter(srchdf.items()))

    photpoints = arr[quadrant_id-1]
    photpoints = numpy.array(photpoints)
    photpoints[photpoints==0] = numpy.nan
    photpoints[photpoints==-9999.99] = numpy.nan

    return photpoints


  def get_photpoints_for_rome_id(rome_id):
    """as get_photpoints, but taking an integer rome_id.
    """
    field = rome_id//10000000
    quadrant = (rome_id//1000000)%10
    quadrant_id = (rome_id%1000000)
    base.ui.notifyInfo(f"{field} {quadrant} {quadrant_id}")
    return get_photpoints(field, quadrant, quadrant_id)

  rd.get_photpoints = get_photpoints
  rd.get_photpoints_for_rome_id = get_photpoints_for_rome_id
</code></job></execute>
```
(full version; if this is asking you to log in, tell your browser not to wantonly switch to https). What is done here in detail again is not terribly relevant: it's the usual messing around with identifiers and paths and more or less broken null values that is a data publisher's everyday lot. The important thing is that with the last two statements, you will see these functions whereever you see the RD, which in RD-near Python code is just about everywhere.
dachs start taptable

Since 2018, DaCHS has supported kickstarting the authoring of RDs, which is, I claim, the fun part of a data publisher's tasks, through a set of templates mildly customised by the dachs start command. Nobody should start a data publication with an empty editor window any more. Just pass the sort of data you would like to publish and start answering sensible questions. Well, “sort of data” within reason:
```
$ dachs start list
epntap -- Solar system data via EPN-TAP 2.0
siap -- Image collections via SIAP2 and TAP
scs -- Catalogs via SCS and TAP
ssap+datalink -- Spectra via SSAP and TAP, going through datalink
taptable -- Any sort of data via a plain TAP table
```
There is a new entry in this list in 2.11: taptable. In both my own work and watching other DaCHS operators, I have noticed that my advice “if you want to TAP-publish any old material, just take the SCS template and remove everything that has scs in it” was not a good one. It is not as simple as that. I hope taptable fits better.

A plan for 2.12 would be to make the ssap+datalink template less of a nightmare. So far, you basically have to fill out the whole thing before you can start experimenting, and that is not right. Being able to work incrementally is a big morale booster.
VOTable 1.5

VOTable 1.5 (at this point still a proposed recommendation) is a rather minor, cleanup-type update to the VO's main table format. Still, DaCHS has to say it is what it is if we want to be able to declare refposition in COOSYS (which we do). Operators should not notice much of this, but it is good to be aware of the change in case there are overeager VOTable parsers out there or in case you have played with DaCHS MIVOT generator; in 2.10, you could ask it to do its spiel by requesting the format application/x-votable+xml;version=1.5. In 2.11, it's application/x-votable+xml;version=1.6. If you have no idea what I was just saying, relax. If this becomes important, you will meet it somewhere else.
Minor Changes

That's almost it for the more noteworthy news; as usual, there are a plethora of minor improvements, bug fixes and the like. Let me briefly mention a few of these:
- The ADQL form interface's registry record now includes the site name. In case you are in this list, please say dachs pub //adql after upgrading.
- More visible legal info, temporal, and spatial coverage in table and service infos; one more reason to regularly run dachs limits!
- VOUnit's % is now known to DaCHS (it should have been since about 2.9)
- More vocabulary validation for VOResource generation; so, dachs pub might now complain to you when it previously did not. It is now right and was wrong before.
- If you annotate a column as meta.bib.bibcode, it will be rendered as ADS links
- The RD info links to resrecs (non-DaCHS resources, essentially), too.
Upgrade As Convenient

As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.

Category: Software
A Proposal for Persistent TAP Uploads

2024-10-11 Markus Demleitner
From its beginning, the IVOA's Table Access Protocol TAP has let users upload their own tables into the services' databases, which is an important element of TAP's power (cf. our upload crossmatch use case for a minimal example). But these uploads only exist for the duration of the request. Having more persistent user-uploaded tables, however, has quite a few interesting applications.

Inspired by Pat Dowler's 2018 Interop talk on youcat I have therefore written a simple implementation for persistent tables in GAVO's server package DaCHS. This post discusses what is implemented, what is clearly still missing, and how you can play with it.

If all you care about is using this from Python, you can jump directly to a Jupyter notebook showing off the features; it by and large explains the same things as this blogpost, but using Python instead of curl and TOPCAT. Since pyVO does not know about the proposed extensions, the code necessarily is still a bit clunky in places, but if something like this will become more standard, working with persistent uploads will look a lot less like black art.

Before I dive in: This is certainly not what will eventually become a standard in every detail. Do not do large implementations against what is discussed here unless you are prepared to throw away significant parts of what you write.
Creating and Deleting Uploads

Where Pat's 2018 proposal re-used the VOSI tables endpoint that every TAP service has, I have provisionally created a sibling resource user_tables – and I found that usual VOSI tables and the persistent uploads share virtually no server-side code, so for now this seems a smart thing to do. Let's see what client implementors think about it.

What this means is that for a service with a base URL of http://dc.g-vo.org/tap [1], you would talk to (children of) http://dc.g-vo.org/tap/user_tables to operate the persistent tables.

As with Pat's proposal, to create a persistent table, you do an http PUT to a suitably named child of user_tables:
```
$ curl -o tmp.vot https://docs.g-vo.org/upload_for_regressiontest.vot
$ curl -H "content-type: application/x-votable+xml" -T tmp.vot \
  http://dc.g-vo.org/tap/user_tables/my_upload
Query this table as tap_user.my_upload
```
The actual upload at this point returns a reasonably informative plain-text string, which feels a bit ad-hoc. Better ideas are welcome, in particular after careful research of the rules for 30x responses to PUT requests.

Trying to create tables with names that will not work as ADQL regular table identifiers will fail with a DALI-style error. Try something like:
```
$ curl -H "content-type: application/x-votable+xml" -T tmp.vot
  http://dc.g-vo.org/tap/user_tables/join
... <INFO name="QUERY_STATUS" value="ERROR">'join' cannot be used as an
  upload table name (which must be regular ADQL identifiers, in
  particular not ADQL reserved words).</INFO> ...
```
After a successful upload, you can query the VOTable's content as tap_user.my_upload:

TOPCAT (which is what painted these pixels) does not find the table metadata for tap_user tables (yet), as I do not include them in the “public“ VOSI tables. This is why you see the reddish syntax complaints here.

I happen to believe there are many good reasons for why the volatile and quickly-changing user table metadata should not be mixed up with the public VOSI tables, which can be several 10s of megabytes (in the case of VizieR). You do not want to have to re-read that (or discard caches) just because of a table upload.

If you have the table URL of a persistent upload, however, you inspect its metadata by GET-ting the table URL:
```
$ curl http://dc.g-vo.org/tap/user_tables/my_upload | xmlstarlet fo
<vtm:table [...]>
  <name>tap_user.my_upload</name>
  <column>
    <name>"_r"</name>
    <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
    <unit>arcmin</unit>
    <ucd>pos.angDistance</ucd>
    <dataType xsi:type="vs:VOTableType">float</dataType>
    <flag>nullable</flag>
  </column>
  ...
```
– this is a response as from VOSI tables for a single table. Once you are authenticated (see below), you can also retrieve a full list of tables from user_tables itself as a VOSI tableset. Enabling that for anonymous uploads did not seem wise to me.

When done, you can remove the persistent table, which again follows Pat's proposal:
```
$ curl -X DELETE http://dc.g-vo.org/tap/user_tables/my_upload
Dropped user table my_upload
```
And again, the text/plain response seems somewhat ad hoc, but in this case it is somewhat harder to imagine something less awkward than in the upload case.

If you do not delete yourself, the server will garbage-collect the upload at some point. On my server, that's after seven days. DaCHS operators can configure that grace period on their services with the [ivoa]userTableDays setting.
Authenticated Use

Of course, as long as you do not authenticate, anyone can drop or overwrite your uploads. That may be acceptable in some situations, in particular given that anonymous users cannot browse their uploaded tables. But obviously, all this is intended to be used by authenticated users. DaCHS at this point can only do HTTP basic authentication with locally created accounts. If you want one in Heidelberg, let me know (and otherwise push for some sort of federated VO-wide authentication, but please do not push me).

To just play around, you can use uptest as both username and password on my service. For instance:
```
  $ curl -H "content-type: application/x-votable+xml" -T tmp.vot \
  --user uptest:uptest \
  http://dc.g-vo.org/tap/user_tables/privtab
Query this table as tap_user.privtab
```
In recent TOPCATs, you would enter the credentials once you hit the Log In/Out button in the TAP client window. Then you can query your own private copy of the uploaded table:

There is a second way to create persistent tables (that would also work for anonymous): run a query and prepend it with CREATE TABLE. For instance:

The “error message” about the empty table here is to be expected; since this is a TAP query, it stands to reason that some sort of table should come back for a successful request. Sending the entire newly created table back without solicitation seems a waste of resources, and so for now I am returning a “stub” VOTable without rows.

As an authenticated user, you can also retrieve a full tableset for what user-uploaded tables you have:
```
$ curl --user uptest:uptest http://dc.g-vo.org/tap/user_tables | xmlstarlet fo
<vtm:tableset ...>
  <schema>
    <name>tap_user</name>
    <description>A schema containing users' uploads. ...  </description>
    <table>
      <name>tap_user.privtab</name>
      <column>
        <name>"_r"</name>
        <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
        <unit>arcmin</unit>
        <ucd>pos.angDistance</ucd>
        <dataType xsi:type="vs:VOTableType">float</dataType>
        <flag>nullable</flag>
      </column>
      ...
    </table>
    <table>
      <name>tap_user.my_upload</name>
      <column>
        <name>"_r"</name>
        <description>Distance from center (RAJ2000=274.465528, DEJ2000=-15.903352)</description>
        <unit>arcmin</unit>
        <ucd>pos.angDistance</ucd>
        <dataType xsi:type="vs:VOTableType">float</dataType>
        <flag>nullable</flag>
      </column>
      ...
    </table>
  </schema>
</vtm:tableset>
```
Open Questions

Apart from the obvious question whether any of this will gain community traction, there are a few obvious open points:
1. Indexing. For tables of non-trivial sizes, one would like to give users an interface to say something like “create an index over ra and dec interpreted as spherical coordinates and cluster the table according to it”. Because this kind of thing can change runtimes by many orders of magnitude, enabling it is not just some optional embellishment.
  
  On the other hand, what I just wrote already suggests that even expressing the users' requests in a sufficiently flexible cross-platform way is going to be hard. Also, indexing can be a fairly slow operation, which means it will probably need some sort of UWS interface.
2. Other people's tables. It is conceivable that people might want to share their persistent tables with other users. If we want to enable that, one would need some interface on which to define who should be able to read (write?) what table, some other interface on which users can find what tables have been shared with them, and finally some way to let query writers reference these tables (tap_user.<username>.<tablename> seems tricky since with federated auth, user names may be just about anything).
  
  Given all this, for now I doubt that this is a use case sufficiently important to make all the tough nuts delay a first version of user uploads.
3. Deferring destruction. Right now, you can delete your table early, but you cannot tell my server that you would like to keep it for longer. I suppose POST-ing to a destruction child of the table resource in UWS style would be straightforward enough. But I'd rather wait whether the other lacunae require a completely different pattern before I will touch this; for now, I don't believe many persistent tables will remain in use beyond a few hours after their creation.
4. Scaling. Right now, I am not streaming the upload, and several other implementation details limit the size of realistic user tables. Making things more robust (and perhaps scalable) hence will certainly be an issue. Until then I hope that the sort of table that worked for in-request uploads will be fine for persistent uploads, too.
Implemented in DaCHS

If you run a DaCHS-based data centre, you can let your users play with the stuff I have shown here already. Just upgrade to the 2.10.2 beta (you will need to enable the beta repo for that to happen) and then type the magic words:
```
dachs imp //tap_user
```
It is my intention that users cannot create tables in your DaCHS database server unless you say these words. And once you say dachs drop --system //tap_user, you are safe from their huge tables again. I would consider any other behaviour a bug – of which there are probably still quite a few. Which is why I am particularly grateful to all DaCHS operators that try persistent uploads now.

[1] As already said in the notebook, if http bothers you, you can write https, too; but then it's much harder to watch what's going on using ngrep or friends.
Category: Standards

Page 1 / 4 »

Posts with the Tag TAP:

Setting Lifetimes

Creating Indexes

Special Index Types

Thesauri and the UAT

Why Keywords?

The UAT constraint

Implementation

Raising the Match Limit

Giving the Query More Time

Planning for Large Result Sets? Get in Contact!

Uniform-Length Partitions

Equal-Size Partitions

Persistent TAP Uploads

On-loaded Execute-s

dachs start taptable

VOTable 1.5

Minor Changes

Upgrade As Convenient

Creating and Deleting Uploads

Authenticated Use

Open Questions

Implemented in DaCHS