CS 442 - Normalization

Normalization of Relational Databases

This page will try to summarize the normalization theory for BCNF and 3NF normal forms. We will start at the beginning so that all of our terms and concepts will be well defined. In fact, we will begin with the definition of an attribute and work our way, carefully, up to the two desired normal forms and their decomposition algorithms.

1.) An attribute is a string of letters from a specific alphabet.

The choice of alphabet does not matter here. For most of our discussions, the attributes will be single capital letters, like "A", "B" and "C". But in real applications, the attributes are usually names like "AccountNo", "StudentID", or "SSN".

2.) A type is a set of values.

Some examples of types are,

the set of all integers,
the set of all positive integers,
the set of all strings of length 3
the set of integers {1, 2, 4, 6, 8, 10},
the set of characters {'r', 'g', 'b'}.

Any set of values can be a type. We will not need to make much use of types in the following discussion.

3.) A schema is a set of pairs where each pair contains an attribute and a type. No two pairs can contain the same attribute.

A simple example of a schema might be S={ A::Integer, B::Char, C::{1,2,3} }. This schema has three attributes, A, B and C, and each attribute has associated with it a type. Notice that the condition that no two attribute::type pairs can contain the same attribute means that you can't have the attribute A with type Integer and in the same schema also have the attribute A with the type Char.

In what follows, the types are not crucial, so we will often omit them (or just assume that all the attributes have type "Positive Integer"). We will usually abbreviate the notation for a schema as S={A, B, C}, or even S(A, B, C). But remember, each attribute really does have a type.

4.) A tuple over a schema S is a set of pairs where each pair contains an attribute from S and a value from that attribute's type. A tuple over the schema S should have as many pairs as S has attributes, and each attribute in S should also be in the tuple.

For the schema example from paragraph (3), some example tuples are { (A,5), (B,c), (C,1) }, or { (C,3), (A,2008), (B,m) }. Notice that since a tuple is a set, the order of the pairs is not important (and that is why the pairs need attributes in them, so that you can tell who the values belong to). If we are willing to take the order of the attributes in a schema's definition as a kind of "preferred order", then we can abbreviate the notation for a tuple by dropping the attributes from the tuple (this is pretty much what SQL does in its INSERT INTO statement). So the last two tuples could be written as (5, c, 1) and (2008, m, 3).

5.) A functional dependency (abbreviated as FD) over a schema S is a symbol of the form X->Y where X and Y are subsets of the attributes from S.

We will refrain in this paragraph from saying what functional dependencies mean. For now, they are just symbols. For the schema S(A, B, C), some examples of functional dependencies are {A}->{B,C} or {A,C}->{A}. When our attributes are just single letters, it is convenient to adopt an abbreviated notation for functional dependencies and not explicitly use the set notation. So the last two FD's could be written as A->BC and AC->A.

6.) Given a schema S, a functional dependency X->Y over S, and two tuples t₁ and t₂ over S, we say that the tuples are constrained by the FD if the following statement is true. If for each attribute in X the tuples t₁ and t₂ have the same value, then the tuples have the same value for each of the attributes in Y.

Here is another way to write this definition. Let the expression t₁[X] = t₂[X] mean that for every attribute in X, the value of that attribute in tuple t₁ is equal to the value of the same attribute in t₂. Then tuples t₁ and t₂ are constrained by the FD X->Y if


t₁[X] = t₂[X]
⇒
t₁[Y] = t₂[Y]

For example, given the schema S(A,B,C) and the FD A->C, the tuples (1,2,3) and (1,1,3) are constrained by the FD (but they would not be constrained by the FD A->B).

Notice that if you are given just one tuple, it does not make any sense to say that the tuple is constrained by a FD. It takes two (or more) tuples to make a FD constraint meaningful.

7.) A relation R(S, F) is a schema S and a set F of functional dependencies over that schema.

An example of a simple relation would be R( S(A,B,C), {AB->C, C->A} ). This relation has the schema S(A,B,C) with three attributes and the relation has two functional dependencies. Another simple relation would be R( S(A,B,C), {A->BC, C->A} ). This relation has the same schema, S(A,B,C), and two FD's, but one of the FD's is different than in our first example. This second relation is in fact a very different relation than the first one. As we will soon see, these two relations have very different properties.

8.) An instance of a relation R(S, F) (or, a table over R) is a set of tuples over the schema S such that any two of the tuples are constrained by each of the FD's in F.

For example, given the first relation R( S(A,B,C), {AB->C, C->A} ) from paragraph (6), the table on the left is an example of an instance of R.

A B C

1 2 4

2 3 5

1 4 4

2 1 5

A B C

1 2 4

2 2 5

3 4 6

4 4 7

Notice however that the table on the left is not an instance of the second relation R( S(A,B,C), {A->BC, C->A} ) from paragraph (6) (which shows that these two relations are not the same). The table on the right however is an instance of this second relation.

9.) When we defined FD's in paragraph (5) we did not say anything about what they "mean", we just defined them as symbols. Functional dependencies gain their "meaning" from the above definition of tables over a relation. First of all, functional dependencies are constraints since they constrain the values that can be placed in a table that is an instance of the relation that the functional dependencies come from. But here is another, more "semantic", way to think about functional dependencies. Given a functional dependency X->Y from a relation R(S,F) and a tuple t₁ from a table over the relation R(S,F), we imagine that the attribute values from X identify some particular entity and the attribute values from Y provide some information about that entity. Now suppose that some other tuple t₂ from the same table agrees with tuple t₁ on the set X. So both of the tuples are in the table and both of the tuples are referring to the exact same entity. So it would be strange if the two tuples contained in their Y attributes conflicting information about that entity. So it makes sense to say that if two tuples refer to the same entity (that is, if t₁[X] = t₂[X]) then we should require that (i.e., constrain that) they provide the same information about that entity (that is, t₁[Y] = t₂[Y]).

10.) Given a relation R(S,F) and a functional dependency f over S, we say that F entails f if the relations R(S, F) and R(S, F ∪ {f}) have the exact same set of tables. That is, adding the FD f to the FD's in the relation R(S, F) does not change the set of allowed tables.

11.) Given a relation R(S,F), the closure F⁺ of the set F of FD's is the set of all FD's entailed by F.

This and the previous definition are very difficult to work with since they imply the need to check an infinite number of conditions. For example, to determine if a FD f is entailed by the set F of FD's we would need to check every one of a (possibly) infinite set of tables and determine if each table is an instance of the new set of FD's F ∪ {f}). This is not practical. The next few paragraphs make working with entailment and closures a bit easier.

12.) Theorem: (The Soundness of Armstrong's Axioms) Let R(S,F) be a relation and let X, Y and Z be subsets of attributes from the schema S. Then

If Y ⊂ X then the FD X->Y is entailed by F. (That is, trivial FD's are always in F⁺.)
If X->Y and Y->Z are entailed by F, then X->Z is entailed by F. (This is the transitivity axiom.)
If X->Y is entailed by F, then XZ->YZ is entailed by F. (This is the augmentation axiom.)

Rather than proving this theorem, let us see an example of how it can be used. In paragraphs (6) and (8) we considered the relations

R( S(A,B,C), {AB->C, C->A} )
R( S(A,B,C), {A->BC, C->A} ). Let us show that the FD AB->C from the first relation is entailed by the FD's F={A->BC, C->A} from the second relation.

We have AB->A entailed by F since this is a trivial FD.
Since we have AB->A and A->BC entailed by F, we have AB->BC entailed by transitivity
We have BC->C entailed by F since this is a trivial FD.
Since we have AB->BC and BC->C entailed by F, we have AB->C entailed by transitivity

This result proves that every table of the second relation is also a table for the first relation. (But not the other way around, as we saw in paragraph (8).)

13.) (Three more "axioms") Let R(S,F) be a relation and let X, Y, Z and W be subsets of attributes from the schema S. Then

If X->YZ is entailed by F, then X->Y and X->Z are entailed by F. (This is the decomposition axiom.)
If X->Y and X->Z are entailed by F, then X->YZ is entailed by F. (This is the union axiom.)
If X->Y and ZY->W are entailed by F, then XZ->W is entailed by F. (This is the pseudotransitivity axiom.)

Try to derive these axioms from the three Armstrong axioms. (Note: The notation XZ is just an abbreviation of the notation X∪Z for the union of X and Z.)

14.) Theorem: (The Completeness of Armstrong's Axioms) Let R(S,F) be a relation and let f be a FD in F⁺. Then we can derive, in a finite number of applications of the three Armstrong Axioms, f from the FD's in F.

This theorem does not tell you how to derive f from the FD's in F, it just tells you that it can always be done. Discovering the right sequence of axioms can be challenging. (We will not be using this theorem, so we will not prove it.)

15.) Let R(S,F) be a relation and let X be a subset of the attributes in the schema S. The attribute closure of X, denoted by X⁺_F, is the set of all attributes A from S such that the FD X->A is entailed by F (that is X->A ∈ F⁺).

The following theorem shows how closely related are the notions of closure and attribute closure. Its proof is an exercise in using Armstrong's axioms.

Theorem: Let R(S,F) be a relation. Then X->Y ∈ F⁺ if and only if Y ⊆ X⁺_F.

Attribute closure turns out to be a very useful way to work with the entailment and closure of FD's. The reason why is that there is a simple algorithm for computing attribute closures.

16.) Let R(S,F) be a relation and let X be a subset of the attributes in the schema S. Then the following Attribute Closure Algorithm computes X⁺_F.

   closure = X;
   repeat
      previous_closure = closure;
      if there exits sets Z, Y such that (Z->Y ∈ F) && (Z ⊂ closure) && (Y ⊄ closure) then
         closure = closure ∪ Y
   until ( closure == previous_closure );
   return closure;

To see how this algorithm works, first notice that if A is an attribute in X, then X->A is entailed by F by Armstrong's first axiom, since X->A is a trivial FD. So we see that X ⊂ X⁺_F. So the algorithm starts by letting the closure begin as X. The algorithm then grows the closure one step at a time by finding an FD Z->Y where all the attributes of Z are already in the closure but not all the attributes of Y are in the closure yet. We need to show that every attribute in Y is in fact in the closure. If A is an attribute from Y, then Y->A is entailed by F since this is a trivial FD. We are assuming that for every attribute B ∈ Z we have X->B entailed by F. So by the Union axiom in paragraph (13) above we have X->Z entailed by F. So now we have X->Z, Z->Y, and Y->A, so by transitivity, X->A, so A is in the closure for every attribute A in Y. The algorithm halts when we can no longer find another FD Z->Y with all the attributes of Z already in the closure but some attribute of Y not in the closure. (Note: So far, we have really only proven that the attribute closure algorithm gives us a set that is a subset of the attribute closure, not the whole closure.)

17.) Let R(S,F) be a relation. A subset X of attributes from the schema S is a super key for the relation R if X⁺_F = S, that is, the attribute closure of X is the whole schema S.

18.) Let R(S,F) be a relation. A subset X of attributes from S is a key for the relation R if X is a super key for R and no proper subset of X is also a super key. In other words, keys are minimal super keys. If you remove a single attribute from a key, it is no longer a key.

Note that many textbooks make the incorrect statement that the values of a tuple on a key determine the values of the tuple on all the other attributes. Here is an example that shows this to be wrong. Suppose we have the simple relation R( S(A,B,C), {A->BC} ). Notice that the single FD says that the single attribute set {A} is a key. Suppose we begin to fill in an instance of this relation and we start with the following partially filled in tuple.

A B C

1

Does the value of 1 for the key attribute A determine the values for the other two attributes? No, not at all. The other two attributes are not at all constrained (except by their respective types). Suppose that we fill in these two attributes.

A B C

1 8 4

Suppose that we now wish to update this tuple. Does the key value of 1 influence in any way how we can update this tuple? Again, the answer is no. We can change the values in this tuple any way we wish. The key value of 1 in no way determines the values of the other tuples.

19.) The following theorem could be taken as the definition of a super key. However, the definition for super key that we gave above will be easier for us to compute with. As an exercise in using some of the above ideas, you should try to prove this theorem.

Theorem: If X is a subset of the attributes from a relation R(S,F), then X is a super key for the relation if and only if the FD X->S is entailed by F. In other words, X is a super key if and only if X->S ∈ F⁺.

Notice that in this theorem you cannot replace "super key" with "key". The resulting theorem will not be true. However, it will be true in one direction (which direction?) and false in the other direction.

20.) Here is an algorithm that determines if a subset X of attributes from a relation R(S,F) is a key for the relation. First, compute the attribute closure X⁺_F (using the algorithm from paragraph 16). If the attribute closure is not S, then X is not a super key, and so it is not a key. If the attribute closure is S, then X is a super key, and we have to determine if it is in fact a key. For each attribute A ∈ X, compute the attribute closure of X - {A}. If any one of these attribute closures is S, then X is not a key (since X contains a super key). If every one of these attribute closures is not S, then X is a key.

21.) Let us show that every super key contains a key. We prove this by providing an algorithm for finding a key contained in any super key. Let X be a super key for the relation R(S,F). Use the algorithm from the last paragraph to determine if X is a key. If it is, we are done. If X is not a key, then there is an attribute A in X such that X - {A} is a super key. Let X - {A} be our new super key and apply this algorithm to it. Eventually we must find a key, since our super key keeps on shrinking and it can't shrink forever. In particular, if the super key shrinks to a single attribute, then the super key is a key since the empty set cannot be a super key (the attribute closure of the empty set is the empty set).

22.) By using the algorithm for finding a key in a super key, we can prove that every relation R(S,F) has a key. Since the attribute closure of S is S, the set S is always a super key for the relation R(S.F). And since every super key must contain a key, the relation R(S,F) must have a key.

23.) Here is an algorithm that finds all the keys of a relation R(S,F). For each attribute A ∈ S, determine if the set {A} is a key for the relation. After finding all the single attribute keys, find all the double attribute keys. That is, for each pair of attributes {A,B} where neither A nor B is a key, determine if the set {A,B} is a key. After finding all the double attribute keys, find all the triple attribute keys. That is, for each triple of attributes {A,B,C} such that no one of the attributes is a key and no pair of the attributes is a key, determine if the set {A,B,C} is a key. Keep on going this way until you have checked for every possible size of key.

24.) Let us do a concrete example involving the calculation of keys. Let us compute the keys for the two relations from paragraphs (6), (8) and (12)

R( S(A,B,C), {AB->C, C->A} )
R( S(A,B,C), {A->BC, C->A} ). Let us start with the first relation. Notice that the attribute closure of {A} is {A} and the attribute closure of {B} is {B}, so neither A nor B is a key. Now let us compute the attribute closure of {C}. The FD C->A allows us, after one iteration of the attribute closure algorithm, to grow the set {C} to the set {A,C}. But we cannot iterate the attribute closure algorithm again, so the attribute closure of {C} is {A,C}, so {C} is not a key. So this relation does not have any single attribute keys. Let us now look at pairs of attributes. The set {A,B} is a super key since the FD AB->C all by itself implies that the attribute closure of {A,B} is all of the attributes. Since neither of the individual attributes is a key, the super key {A,B} is in fact a key. It is not hard to see that the attribute closure of the set {A,C} is itself, so {A,C} is not a key. Now let us compute the attribute closure of {B,C}. The FD C->A allows us, after one iteration of the attribute closure algorithm, to grow the set {B,C} to the set {A,B,C}, so {B,C} is a super key. Since neither of the individual attributes is a key, the super key {B,C} is in fact a key. So this relation has two keys, the sets {A,B} and {B,C}.

Now let us find the keys for the second relation. The set {A} is a key since the FD A->BC all by itself implies that the attribute closure of {A} is all of the attributes. Now let us compute the attribute closure of {C}. The FD C->A allows us, after one iteration of the attribute closure algorithm, to grow the set {C} to the set {A,C}. Then the FD A->BC allows us, after another iteration of the attribute closure algorithm, to grow the set {A,C} to the set {A,B,C}. So know we know that the attribute closure of {C} is S. So {C} is also a key for the second relation. Now, notice that the attribute closure of {B} is {B}, so B is not a key. So we know that the two attributes A and C are keys. There are no pairs of attributes which do not contain a key, so none of the pairs of attributes is a key. So A and C are the only two keys for the second relation.

25.) A relation R(S,F) is in Third Normal Form, or 3NF, if, for every functional dependency X->Y ∈ F, one of the following three statements is true.

Y ⊂ X. (In other words, this is a trivial functional dependency.)
X is a super key.
Each attribute in Y - X is contained in some key.

Note that these three statements are not mutually exclusive. So for some functional dependencies it may be that more than one of them is true. The requirement for 3NF is that for each functional dependency in F at least one of these three statements is true.

26.) A relation R(S,F) is in Boyce-Codd Normal Form, or BCNF, if, for every functional dependency X->Y ∈ F, one of the following two statements is true.

Y ⊂ X. (In other words, this is a trivial functional dependency.)
X is a super key.

Note that these two statements are the first two statements from the definition of 3NF. So any relation that is in 3NF is certainly in BCNF.

27.) Notice that both the definition of 3NF and the definition of BCNF is a statement about a relation. It is not a statement about tables over a relation. It doesn't make sense to state that a table is in 3NF or BCNF. Those are statements that we make about the relation that the table is an instance of.

It is worth thinking for a moment about what it would take to determine if a relation is in 3NF or BCNF. Given a relation R(S,F), the question of whether it is in 3NF or BCNF is mostly a question about the functional dependencies in F. In particular, it is mostly a question about the super keys and keys for the relation. To test if R(S,F) is in BCNF, for each functional dependency in F we first should check if it is trivial, and this is very trivial to do. If the FD X->Y is not trivial, then we should check if the left hand side X of the FD is a super key. This is not too hard to do. We just use the Attribute Closure Algorithm to compute X⁺_F and see if it is S. So testing if a relation is in BCNF is quite doable. To test if a relation is in 3NF, we might need to test for the third of the three statements. To test the third statement, we need to take each attribute that is in Y, but not in X, and determine if that attribute is in a key. To determine if an attribute A is in some key, we either need to know what all the keys are (which is very difficult to compute) or we need to take every subset of S that contains A and determine if that subset is a key. This is a real lot of work (though not as much work as finding all of the keys). So BCNF is computationally an easier normal form to check for than 3NF.

28.) Let us determine if the two relations from paragraphs (6), (8), (12) and (24) are in 3NF or BCNF.

R( S(A,B,C), {AB->C, C->A} )
R( S(A,B,C), {A->BC, C->A} ). Let us do the second relation first. We showed that it has only two keys, {A} and {C}. But this shows us right away that each of the two FD's in the relation have a key as their left hand side and therefore satisfy the second statement in the definition of BCNF. So the second relation is in BCNF.

For the first of the two relations, recall that it has two keys, {A,B} and {B,C}. So the first of the two FD's in the relation satisfies the second statement in the definitions of both 3NF and BCNF. But the second FD in this relation is not a trivial FD and it does not satisfy the second statement in the definition of BCNF, so this relation is not in BCNF. But the second FD in this relation does have a right hand side that is contained in the key {A,B}. So the second FD satisfies the third statement in the definition of 3NF. So the first relation is in 3NF.

29.) A decomposition of the schema from a relation R(S,F) is a pair of proper subsets S₁,S₂ of S such that S₁ ∪ S₂ = S.

A decomposition of a schema is said to be a lossless decomposition if the following identity holds for every table T over the relation R(S,F).

T = proj(S₁, T) join proj(S₂, T) where join is the relational algebra natural join operator and proj is the relational algebra projection operator.

30.) Lemma: Let S₁, S₂ be a decomposition of the schema from the relation R(S,F). Then for every table T over the relation R(S,F) we have the following.

T ⊂ proj(S₁, T) join proj(S₂, T) In other words, the join of the two projections will always contain all of the original tuples from T, but it might also contain extra tuples. To prove this fact, let t be a tuple from T, let t[S₁] denote the tuple in proj(S₁,T) that t projects onto, and let t[S₂] denote the tuple in proj(S₂,T) that t projects onto. Recall that the first step in computing the natural join is to take the product of the tables proj(S₁,T) and proj(S₂,T). In this product table there will be a tuple that is the concatenation of the tuples t[S₁] and t[S₂]. The next step in computing the natural join is to remove from the product table those tuples that do not agree in the duplicated attributes that come from S₁ ∩ S₂. But since t[S₂] and t[S₂] are projected from the same tuple t, these two tuples will agree on the attributes in S₁ ∩ S₂. So the tuple in the product that is the concatenation of these two tuples will not be deleted from the product. The last step in computing the natural join is to remove from the product table one copy of each duplicate attribute from S₁ ∩ S₂. But this step will make the concatenation of t[S₁] and t[S₂] back into the tuple t. This shows that t is in the join, and so it proves that T is a subset of the join.

To see that T might be a proper subset of the above join, let us look at a specific example. Consider the relation R( S(A,B,C), {A->BC, C->A} ) and let us use the decomposition S₁={A,B} and S₂={B,C}. A table T over this relation is

A B C

1 2 4

2 2 5

The projections of T look like the following tables

A B

1 2

2 2

B C

2 4

2 5

And proj(S₁, T) join proj(S₂, T), the join of these two projections, looks like this.

A B C

1 2 4

1 2 5

2 2 4

2 2 5

Notice two facts about this last table. Not only is it larger than the original table T, but this table is not even an instance of the relation R(S,F) that we started with. In the process of taking an instance of the relation R(S,F), decomposing it, and then joining the two projections, some constraint (i.e., some FD) from R(S,F) ended up getting lost.

31.) Theorem: If S₁, S₂ is a decomposition of the schema from the relation R(S,F), and if either of the following two statements is true

S₁ ⊆ (S₁ ∩ S₂)⁺_F, or
S₂ ⊆ (S₁ ∩ S₂)⁺_F

then the decomposition is a lossless decomposition.

Notice that the test for a lossless decomposition is that the attribute closure of S₁∩S₂ contains either S₁ or S₂. Some books state this condition as "the set S₁∩S₂ must contain a key for S₁ or a key for S₂". The problem with this statement is that neither S₁ nor S₂ are relations, so they do not have keys (but one of the steps below is to show how, under certain circumstances, to define S₁ and S₂ as relations).

32.) Let S₁,S₂ be a decomposition of the schema from the relation R(S,F). Suppose we have a way of choosing functional dependencies F₁ and F₂ for the schemas S₁ and S₂. Then we can define the relations R₁(S₁,F₁) and R₂(S₂,F₂). Once we have the relations R₁(S₁,F₁) and R₂(S₂,F₂) we can talk about tables which are instances of those relations. This lets us ask the following two questions (which are really questions about the two sets of functional dependencies F₁ and F₂).

If table T is an instance of R(S,F), is proj(S₁,T) an instance of R₁(S₁,F₁), and is proj(S₂,T) an instance of R₂(S₂,F₂)?
If T₁ is an instance of R₁(S₁,F₁) and T₂ is an instance of R₂(S₂,F₂), is T₁ join T₂ an instance of R(S,F)?

Notice, for example, that if the set F₁ contains a large number of FD's (more than in F say), then there are a lot of constraints placed on the instances of the relation R₁(S₁,F₁), and so it is unlikely that proj(S₁,T) will be an instance of R₁(S₁,F₁). On the other hand, if the sets F₁ and F₂ contain very few FD's, then there are very few constraints placed on the instances T₁ and T₂ of the relations R₁(S₁,F₁) and R₂(S₂,F₂), and so it is unlikely that T₁ join T₂ will be an instance of R(S,F). What we want to do in the next few paragraphs is discuss conditions on the sets of functional dependencies F₁ and F₂ under which the answers to the above two questions is yes.

33.) Let S₁,S₂ be a decomposition of the schema from the relation R(S,F) and let F₁ and F₂ be sets of functional dependencies on the schemas S₁,S₂. We say that the relations R₁(S₁,F₁) and R₂(S₂,F₂) are a dependency-preserving decomposition of R(S,F) if T₁ join T₂ is an instance of R(S,F) for any instances T₁ and T₂ of relations R₁(S₁,F₁) and R₂(S₂,F₂).

34.) Theorem: Let S₁,S₂ be a decomposition of the schema from the relation R(S,F) and let F₁ and F₂ be sets of functional dependencies on the schemas S₁,S₂. Then R₁(S₁,F₁) and R₂(S₂,F₂) is a dependency-preserving decomposition of R(S,F) if and only if (F₁ ∪ F₂)⁺ = F⁺.

Unfortunately, this theorem is not very useful. Because working with closures is so difficult, this theorem does not give us a practical way to determine the sets of functional dependencies F₁ and F₂.

35.) Whenever we decompose a relation, we definitely need the decomposition to be lossless and we would like the decomposition to be dependency preserving. However, not all relations have a lossless, dependency preserving decomposition. Here is an example of one. In paragraph (28) we showed that the relation

R(S,F) =R({A,B,C}, {AB->C, C->A}) is in 3NF. Since this relation is in 3NF, its tables have some redundancy which we would like to remove by decomposing the relation. Unfortunately, there is no lossless, dependency-preserving decomposition of this relation. To show this, notice that any schema with exactly three attribute has only three possible decompositions, S₁={A,B}, S₂={A,C} or S₁={B,A}, S₂={B,C} or S₁={C,A}, S₂={C,B}. It is not hard to use the condition for lossless join from paragraph (29) to see that only S₁={C,A}, S₂={C,B} works as a lossless decomposition. Let us now show that this cannot be made into a dependency preserving decomposition. So far we have just the schemas for a decomposition. We need to show that for any choice of functional dependencies F₁ and F₂, the relations R₁(S₁,F₁) and R₂(S₂,F₂) are not dependency preserving. Consider the following two tables, which are both instances of R(S,F).

A	B	C
3	1	1
3	2	2

A	B	C
1	4	1
2	4	2

Here are the projections of these two tables onto the schemas S₁={C,A} and S₂={C,B} respectively. Notice that no matter what choices we make for the functional dependencies F₁ and F₂, these two tables must be instances of the relations R₁(S₁,F₁) and R₂(S₂,F₂) respectively, otherwise we cannot even make the decompositions.

A	C
3	1
3	2

B	C
4	1
4	2

Here is the natural join of the above two tables, Notice that this table is not an instance of R(S,F). This shows that there is no way to make our decomposition S₁={C,A}, S₂={C,B} into a dependency-preserving decomposition.

A	B	C
3	4	1
3	4	2

As we will see shortly, the fact that some relations do not have lossless, dependency-preserving decompositions means that the algorithm for decomposing tables into BCNF cannot guarantee that the resulting decompositions are dependency-preserving (but the algorithm for decomposing tables into 3NF will guarantee that the decompositions are dependency-preserving).

NOTE: You should check that the example we did in this paragraph is the HasAccount example from page 217 of our textbook.

36.) Let S₁,S₂ be a decomposition of the schema from the relation R(S,F). Here is one way to choose the functional dependencies F₁ and F₂ for the schemas S₁ and S₂. The projection of F onto S₁ is the set of all functional dependencies from F⁺ that involve attributes only from S₁. Similarly, the projection of F onto S₂ is the set of all functional dependencies from F⁺ that involve attributes only from S₂.

Notice the use of F⁺ in the definition of the projection of F onto a subset of attributes. If we replace F⁺ in this definition with just F, then our decomposed relations R₁(S₁,F₁) and R₂(S₂,F₂) may not contain enough functional dependencies and the resulting decomposition may not be dependency-preserving (remember, the fewer constraints there are in F₁ and F₂, the more likely it is that when we join tables from R₁(S₁,F₁) and R₂(S₂,F₂), the resulting table will not obey all the constraints in R(S,F) ).

It is important to realize that giving the decomposition the functional dependencies defined by the above projection does not guarantee that the decomposition will be dependency-preserving. Some relations just do not have dependency-preserving decompositions. (We will see later what the 3NF decomposition algorithm has to do in order to guarantee that its decompositions are dependency-preserving.)

37.) Let R(S,F) be a relation that is not in BCNF. Since the relation is not in BCNF, we can find a functional dependency of the form X->A where A is a single attribute and X is not a super key. Let

S₁=S-{A}
S₂=X ∪ {A}. Let F₁ and F₂ be the projections of F onto S₁ and S₂. Then the relations R₁(S₁,F₁) and R₂(S₂,F₂) are a lossless decomposition of R(S,F).

To prove that the decomposition is lossless, we need to compute (S₁ ∩ S₂)⁺_F. Since the FD X->A, violates BCNF, it is not a trivial FD, so A is not in X. Therefore, S₁ ∩ S₂ = X. Since X->A, one iteration of the attribute closure algorithm grows X to X ∪ {A}, so obviously we have S₂ ⊂ X⁺_F. Therefore S₂ ⊂ (S₁ ∩ S₂)⁺_F which by paragraph (31) means that our decomposition is lossless.

38.) BCNF Decomposition Algorithm Let R(S,F) be a relation that is not in BCNF. Apply the step in paragraph (37) to get the lossless decomposition R₁(S₁,F₁) and R₂(S₂,F₂). If one (or both) of the relations R₁(S₁,F₁) and R₂(S₂,F₂) is not in BCNF, then recursively apply this step to that relation.

39.) Example Let us decompose the relation R(S,F) = R({A,B,C,D}, {A->B, C->D}). You should check that this relation has just one key, {A,C}. So both of the FD's in R(S,F) violate BCNF. Let us apply one step of the BCNF decomposition algorithm using the FD C->D. Then S₁={A,B,C} and S₂={C,D}. And F₁={A->B} and F₂={C->D}.

You should check that the relation R₂(S₂,F₂) = R₂({C,D}, {C->D}) is in BCNF. On the other hand, the relation R₁(S₁,F₁) = R₁({A,B,C}, {A->B}) has the single key {A,C}, so the FD A->B still violates BCNF. So we apply one step of the decomposition algorithm using A->B. Then S₃={A,C} and S₄={A,B}. And F₃={} and F₄={A->B}.

You should check that the relations R₃(S₃,F₃) = R₃({A,C}, {}) and R₄(S₄,F₄) = R₄({A,B}, {A->B}) are in BCNF.

So the final decomposition is
R₂(S₂,F₂) = R₂({C,D}, {C->D}),
R₃(S₃,F₃) = R₃({A,C}, {}),
R₄(S₄,F₄) = R₄({A,B}, {A->B}).

Let us apply some semantics to this example. Change the attribute A to SupplierID, change B to ZIP, change C to PartNo, and change D to Weight. An instance of this relation might look like this.

SupplierID ZIP PartNo Weight

1 46323 10 3

2 46394 20 7

2 46384 10 3

1 46323 20 7

There is a lot of redundancy in this table. The normalized tables look like this.

SupplierID PartNo

1 10

1 20

2 10

2 20

SupplierID ZIP

1 46323

2 46394

PartNo Weight

10 3

20 7

Notice that this is just the famous Suppliers-Parts database. The normalization algorithm factored the original table into two "entity" tables and one "relationship" table.

40.) As a partial summary of the above material, here is a list of calculations that you should be able to do when given a relation R(S,F).

Given a subset X of S, compute the attribute closure X⁺_F.
Given a subset X of S, determine if X is a super key.
Given a subset X of S, determine if X is a key.
Compute all the keys of R(S,F).
Determine if the relation R(S,F) is in 3NF or BCNF (or neither).
Given a decompostion of the schema S, determine if the decomposition is lossless.
Given a decompostion S₁, S₂ of the schema S, compute the projections F₁, F₂ of F onto S₁ and S₂.
Given a decompostion of the schema S, determine if the decomposition is dependency-preserving.
If it is not, be able to give an example of how a specific FD is not preserved in a natural join.
Given a FD from F that violates BCNF, compute the lossless decomposition R₁(S₁,F₁) and R₂(S₂,F₂).
Compute the complete BCNF decomposition of a relation that is not in BCNF.

Return to the CS 442 lecture page.
Return to the CS 442 home page.

compliments and criticisms