** An Extensive Examination of Data Structures **
** Part 6: Efficiently Representing Sets **
Scott Mitchell
4GuysFromRolla.com
April 2004
** Summary: ** Scott Mitchell discusses data structures for implementing general and disjoint sets. A set is an unordered collection of unique items that can be enumerated and compared to other sets in a variety of ways. (20 printed pages)
Download the Sets.msi sample file .
** Contents **
Introduction
The Fundamentals of Sets
Implementing an Efficient Set Data Structure
Maintaining a Collection of Disjoint Sets
References
Related Books
** Introduction **
One of the most basic mathematical constructs is a set , which is an unordered collection of unique objects. The objects contained within a set are referred to as the set's elements . Formally, a set is denoted as a capital, italic letter, with its elements appearing within curly braces ( {...} ). Examples of this notation can be seen below:
_ S _ = { 1, 3, 5, 7, 9 }
_ T _ = { Scott, Jisun, Sam }
_ U _ = { -4, 3.14159, Todd, x }
In mathematics, sets are typically comprised of numbers, such as set S above, which contains the odd positive integers less than 10. However, notice that the elements of a set can be anything—numbers, people, strings, letters, variables, and so on. Set T , for example, contains peoples' names, and set U contains a mix of numbers, names, and variables.
In this article, we'll start with a basic introduction of sets, including common notation and the operations that can be performed on sets. Following that, we'll examine how to efficiently implement a set data structure with a defined universe. The article concludes with an examination of disjoint sets, and the best data structures to use.
** The Fundamentals of Sets **
Recall that a set is simply a collection of elements. The "element of" operator, denoted x Î S , implies that x is an element in the set S . For example, if set S contains the odd positive integers less than 10, then 1 Î S . When reading such notation, you'd say, "1 is an element of S ." In addition to 1 being an element of S , we have 3 Î S , 5 Î S , 7 Î S , and 9 Î S . The "not an element of" operator, denoted x Ï S , means that x is not an element of set S .
The number of unique elements in a set is the set's cardinality . The set {1, 2, 3} has cardinality 3, just as does the set {1, 1, 1, 1, 1, 1, 1, 2, 3} (because it only has three unique elements). A set may have no elements in it at all. Such a set is called the empty set , and is denoted as {} or Æ , and has a cardinality of 0.
When first learning about sets, many developers assume they are the same as collections, like an ArrayList. However, there are some subtle differences. An ArrayList is an ordered collection of elements—each element in an ArrayList has an associated ordinal index, which implies order. Too, there can be duplicate elements in an ArrayList.
A set, on the other hand, is unordered and contains unique items. Since sets are unordered, the elements of a set may be listed in any order. That is, the sets {1, 2, 3} and {3, 1, 2} are considered equivalent. Also, any duplicates in a set are considered redundant. The set {1, 1, 1, 2, 3} and the set {1, 2, 3} are equivalent. Two sets are equivalent if they have the same elements. (Equivalence is denoted with the = sign; if S and T are equivalent they are written as S = T .)
** Note??? ** In mathematics, an ordered collection of elements that allows duplicates is referred to as a list . Two lists, L 1 and L 2 are considered equal if and only if for i ranging from 1 to the number of elements in the list the i th element in L 1 equals the i th element in L 2 .
Typically the elements that can appear in a set are restricted to some universe . The universe is the set of all possible values that can appear in a set. For example, we might only be interested in working with sets whose universe are integers. By restricting the universe to integers, we can't have a set that has a non-integer element, like 8.125, or Sam. (The universe is denoted as the set U .)
** Relational Operators of Sets **
There are a bevy of relational operators that are commonly used with numbers. Some of the more often used ones, especially in programming languages, include < , <= , = , != , > , and >= . A relational operator determines if the operand on the left hand side is related to the operand on the right hand side based on criteria defined by the relational operator. Relational operators return a "true" or "false" value, indicating whether or not the relationship holds between the operands. For example, x < y returns true if x is less than y , and false otherwise. (Of course the meaning of "less than" depends on the data type of x and y .)
Relational operators like < , <= , = , != , > , and >= are typically used with numbers. Sets, as we've seen, use the = relational operator to indicate that two sets are equivalent (and can likewise use != to denote that two sets are not equivalent), but relational operators < , <= , > , and >= are not defined for sets. After all, how is one to determine if the set {1, 2, 3} is less than the set {Scott, 3.14159}?
Instead of notions of < and <= , sets use the relational operators subset and proper subset , denoted Í and Ì , respectively. (Some older texts will use Ì for subset and Í for proper subset.) S is a subset of T —denoted S Í T —if every element in S is in T . That is, S is a subset of T if it is contained within T . If S = {1, 2, 3}, and T = {0, 1, 2, 3, 4, 5}, then S Í T since every element in S —1, 2 and 3—is an element in T . S is a proper subset of T —denoted S Ì T —if S Í T and S ¹ T . That is, if S = {1, 2, 3} and T = {1, 2, 3}, then S Í T since every element in S is an element in T , but S T since S = T . (Notice that there is a similarity between the relational operators < and <= for numbers and the relational operators Í and Ì for sets.)
Using the new subset operator, we can more formally define set equality. Given sets S and T , S = T if and only if S Í T and T Í S . In English, S and T are equivalent if and only if every element in S is in T , and every element in T is in S .
** Note??? ** Since Í is analogous to <= , it would make sense that there exists a set relational operator analogous to >= . This relational operator is called superset , and is denoted Ê ; a proper superset is denoted . Like with <= and >= , S Ê T if and only if T Í S .
** Set Operations **
As with the relational operators, many operations defined for numbers don't translate well to sets. Common operations on numbers include addition, multiplication, subtraction, exponentiation, and so on. For sets, there are four basic operations:
1. ????????????????? ** Union ** – the union of two sets, denoted S È T , is akin to addition for numbers. The union operator returns a set that contains all of the elements in S and all of the elements in T . For example, {1, 2, 3} È {2, 4, 6} equals {1, 2, 3, 2, 4, 6}. (The duplicate 2 can be removed to provide a more concise answer, yielding {1, 2, 3, 4, 6}.) Formally, S È T = { x : x Î S or x Î T }. In English, this translates to S union T results in the set that contains an element x if x is in S or in T .
2. ????????????????? ** Intersection __ ** – **__ ** the intersection of two sets, denoted S Ç T , is the set of elements that S and T have in common. For example, {1, 2, 3} Ç {2, 4, 6} equals {2}, since that's the only element both {1, 2, 3} and {2, 4, 6} share in common. Formally, S Ç T = { x : x Î S and x Î T }. In English, this translates to S intersect T results in the set that contains an element x if x is both in S and in T .
3. ????????????????? ** Difference ** – the difference of two sets, denoted S - T , are all of the elements in S that are not in T . For example, {1, 2, 3} - {2, 4, 6} equals {1, 3}, since 1 and 3 are the elements in S that are not in T . Formally, S - T = { x : x Î S and x Ï T }. In English, this translates to S set difference T results in the set that contains an element x if x is in S and not in T .
4. ????????????????? ** Complement ** – Earlier we discussed how typically sets are limited to a known universe of possible values, such as the integers. The complement of a set, denoted S ', is U - S . (Recall that U is the universe set.) If our universe is the integers 1 through 10, and S = {1, 4, 9, 10}, then S ' = {2, 3, 5, 6, 7, 8}. (Complementing a set is akin to negating a number. Just like negating a number twice will give you the original number back—that is, -- x = x —complementing a set twice will give you the original set back— S '' = S .)
When examining new operations, it is always important to get a solid grasp on the nature of the operations. Some questions to ask yourself when learning about any operation—be it one defined for numbers or one defined for sets—are:
· ???????????????????? Is the operation commutative? An operator op is commutative if x op y is equivalent to y op x . In the realm of numbers, addition is an example of a commutative operator, while division is not commutative.
· ???????????????????? Is the operation associative? That is, does the order of operations matter. If an operator op is associative, then x op ( y op z ) is equivalent to ( x op y ) op z . Again, in the realm of numbers addition is associative, but division is not.
For sets, the union and intersection operations are both commutative and associative. S È T is equivalent to T È S , and S È ( T È V ) is equivalent to ( S È T ) È V . Set difference, however, is neither commutative nor associative. (To see that set difference is not commutative, consider that {1, 2, 3} - {3, 4, 5} = {1, 2}, but {3, 4, 5} - {1, 2, 3} = {4, 5}.)
** Finite Sets and Infinite Sets **
All of the set examples we've looked at thus far have dealt with finite sets. A finite set is a set that has a finite number of elements. While it may seem counterintuitive at first, a set can contain an infinite number of elements. The set of positive integers, for example, is an infinite set since there are no bounds to the number of elements in the set.
In mathematics, there are a couple infinite sets that are used so often that they are given a special symbol to represent them. These include:
· ???????????????????? N = {0, 1, 2, …}
· ???????????????????? Z = {…, -2, -1, 0, 1, 2, …}
· ???????????????????? Q = {a/b: a Î Z , b Î Z, and b ¹ 0}
· ???????????????????? R = set of real numbers
N is the set of natural numbers , or positive integers greater than or equal to 0. Z is the set of integers . Q is the set of rational numbers , which are numbers that can be expressed as a fraction of two integers. Finally, R is the set of real numbers , which are all rational numbers, plus irrational numbers as well (numbers that cannot be expressed as a fraction of two integers, such as pi , and the square root of 2).
Infinite sets, of course, can't be written down in their entirety, as you'd never finish jotting down the elements, but instead are expressed more tersely using mathematical notation like so:
_ S _ = { x : x Î N and x > 100}
Here S would be the set of all natural numbers greater than 100.
In this article we will be looking at data structures for representing finite sets. While infinite sets definitely have their place in mathematics, rarely will we need to work with infinite sets in a computer program. Too, there are unique challenges with representing and operating upon infinite sets, since an infinite set's contents cannot be completely stored in a data structure or enumerated.
** Note??? ** Computing the cardinality of finite sets is simple—just count up the number of elements. But how does one compute the cardinality of an infinite set? This discussion is beyond the scope of this article, but realize that there are different types of cardinality for infinite sets. Interestingly, there is the same "number" of positive integers, as there are all integers, but there are more real numbers than there are integers.
** Sets in Programming Languages **
C++, C#, Visual Basic .NET, and Java don't provide inherent language features for working with sets. If you want to use sets, you need to create your own set class with the appropriate methods, properties, and logic. (We'll do precisely this in the next section!) There have been programming languages in the past, though, that have offered sets as a fundamental building block in the language. Pascal, for example, provides a set construct that can be used to create sets with an explicitly defined universe. To work with sets, Pascal provides the in operator to determine if an element is in a particular set. The operators + , * , and – are used for union, intersection, and set difference. The following Pascal code illustrates the syntax used to work with sets:
** /* declares a variable named possibleNumbers, a set whose universe is **
** the set of integers between 1 and 100... */ **
** var ? **
** ? possibleNumbers = set of 1..100; **
** ? **
** ... **
** ? **
** /* Assigns the set {1, 45, 23, 87, 14} to possibleNumbers */ **
** possibleNumbers := [1, 45, 23, 87, 14]; **
** ? **
<P class=MsoNormal style="BACKGROUND: #eeeeee; MARGIN: 12pt 0cm; TEXT-ALIGN: left; mso-pagination: widow-orphan; tab-stops: 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt" a