The week of March 24, in Tampere (Finland), the joint international database conference EDBT/ICDT was held, with EDBT being the community that studies new technologies in practice, and ICDT being the community that studies its theoretical foundation. The fact is that, this year, inLab FIB had a presence at ICDT with a full paper by the author of this very article. From here, the inevitable question is… what is a software engineering researcher doing at a pure database theory conference? Can more practice-oriented researchers contribute in a purely theoretical context? Does it make sense to try in a context where AI is eating everything?
The key to all of this is hidden in an even tougher question: why do you want to do research? Indeed, if your goal is to make a name for yourself in research, with a bit of luck every paper you submit will be rejected with enough force for you to give up before you end up in therapy (if not you, then your loved ones). And yes, you tend to need a lot, and a lot of luck in research.
Suppose, however, that you are genuinely interested in a problem. For example, the problem discussed in the previous article is query containment in a database (DB). In a DB we say that a query q1 is contained in a query q2 if the results of q1 are always contained in q2, regardless of the DB. To simplify, suppose we only have boolean queries. In this case, we say that q1 is contained in q2 if and only if whenever q1 is true, q2 is also true.
A common requirement in academic papers is the use of small didactic examples with a motivational tone. To continue paraphrasing Foster Wallace, let us suppose we want to run a query on an animal database asking whether there are recorded animals that live in water. In Datalog language, we would write it more or less like this:
q1() :- Animal(x), LivesIn(x,Water)
The fact is that we could also be interested in another similar query, for example, a query asking whether we have aquatic animals that are mammals:
q2() :- Animal(y), LivesIn(y,Water), IsMammal(y)
Clearly, query q2 is contained in q1, since, in short, q2 is asking the same as q1, but applying an additional filter (which would be the IsMammal check). Realising query containment can help us in several problems, such as optimizing database performance, or even finding which data providers offer data that may be of interest to a client.
All of this is very good, but how on earth can we solve this problem? Well, here we have very good news: for more than 40 years we have known that, if queries have a specific (and very common!) form, a query q is contained in another query q’ if, and only if, we can rewrite q’ in terms of q. In our example, q2 is literally the same query as q1 but changing the variable y to x and adding an additional IsMammal check. The best thing about all of this is that this check can be done with a very simple test: we just need to create a DB that contains all the elements of query q2 (replacing each variable with a constant, for example, replacing the variable y with a constant #y), and if we run query q1 against this DB, the DB will take care of finding that q2 is like q1 but changing the variable x to y.
But what is this specific form that queries must satisfy for all of this to work? Essentially, the query must be positive, which means it cannot include mathematical operations (sums, subtractions, or greater-than/less-than comparisons), nor negated conditions. In other words, we are talking about SQL queries that only use EQUI-JOINS. Formally speaking, we say that a conjunctive query is contained in another conjunctive query, under set semantics, if there exists a homomorphism from the second query to the first (Homomorphism Theorem). The interested reader can find more details in the article itself, but don’t worry, none of these formalisms end up being excessively complicated, nor do they drift too far from the intuitions described here. The article aims to be understandable, to the point of losing the two great advantages of unintelligible texts that Foster Wallace mentioned: 1) the fewer people understand the article, the more erudite some authors feel (reaching Wilde’s maxim where not even the author himself understands what he is saying), and 2) the fewer people understand the article, the fewer people can criticize it (reaching the maxim of seeing papers accepted in conferences because nobody has understood them).
But let’s not be so negative, and let’s talk about what happens when we add negation to queries. For example, suppose that now we ask whether there exist animals that live in water but do not live on land:
q3() :- Animal(y), LivesIn(y,Water), not LivesIn(y,Land).
In short, when queries include negation, the Homomorphism Theorem breaks. That is, there are queries contained in others even though it is not possible to rewrite one in terms of the other. The breakdown of the Homomorphism Theorem is quite problematic, since while implementing a homomorphism test is fairly simple, the tests that have been devised to check containment between two queries with negation are, in general, difficult and computationally complex. As a software engineer, I like simple implementation ideas, so I admit I got quite stuck here trying to find a characterization that tells when a query, even with negation, satisfies the Homomorphism Theorem, so as to be able to use this test. You don’t want to know the mess I was getting myself into, incrementally building a highly clever characterization in Wildean terms to solve an increasingly frustrating and increasingly demotivating problem.
In This Is Water, Foster Wallace said that the main skill acquired at university is not how to think, but how to decide what to think about. Faced with adversity, before continuing to curse our bones, Foster Wallace invites us to step out of our internal monologue and take conscious attention to what is happening around us.
Let’s forget for a moment the problem we have with negation, and focus on another formalism called tuple-generating dependencies (TGD). A TGD is a rule that, from some data in the DB, generates other data. For example, we can have a TGD that says that if I know that x is a Child, I can generate a new piece of data saying that x is also a Person:
Child(x) -> Person(x)
The main problem with TGDs is that they can generate an infinite amount of data, since the generation of a new piece of data can trigger the execution of the same TGD, and so on indefinitely. For example, suppose a TGD that says that every father has had another father:
FatherOf(x, y) -> FatherOf(y, z)
Clearly, when we create a FatherOf tuple, we need to create another FatherOf tuple, and so on ad infinitum.
To solve this problem, the database theory community has studied conditions that guarantee that a TGD does not generate data indefinitely. If a TGD can only be executed, recursively, a maximum of 5 times, we say that the TGD is 5-bounded. Similarly, if when we execute a TGD we cannot re-execute it recursively again, we say that the TGD is one-bounded. For example, imagine a TGD that, in summary, says that if an animal lives in water, it must also live on land:
Animal(x), ViuA(x, Aigua) -> ViuA(x, Terra).
This TGD is one-bounded since, when creating new LivesIn tuples, these cannot trigger the same rule again.
Going back to queries, there is a certain relationship between them and TGDs. In particular, the previous TGD represents the data that must be created to make query q3 unsatisfied. Indeed, each data instance that can satisfy q3 (for example, if we find an animal “Turtle” that lives in Water but not on Land) can be used in the TGD to create a new fact (LivesIn(Turtle, Land)) that invalidates q3. Given a query with negation, it is fairly simple to construct its corresponding TGD that generates data to falsify it. In fact, it is enough to take the negated condition and add it to the right-hand side of the TGD.
The fact is that, in the paper, we prove that a query satisfies the Homomorphism Theorem if and only if its corresponding TGD is one-bounded. The main intuition is that, to prove that a query q1 is contained in another query q2, it is enough to analyze whether, over a DB where q1 is true, the TGD of q2 can be executed in such a way that we make q2 false while still keeping q1 true. The point is that the homomorphism test takes care of validating that q1 remains true after executing the TGD once, but it fails to verify whether q1 remains true if the TGD is executed recursively multiple times.
It is still poetic to see that, in the world of database queries, just as in research, or as Foster Wallace would say, in the trenches of everyday adult life, the inability to stop looping within ourselves prevents us from finding a simple and sincere answer in our surroundings. Perhaps research is about building bridges between different fields. Perhaps it is about communicating them in a straightforward way to facilitate that connection to someone else. Perhaps it is a conscious exercise of collectively building something positive to reverse so much negativity. Wherever it comes from, if you want to dedicate yourself to research, I wish you much more than luck.
