Indexing

Indexing Basics

This section includes the fundamentals of indexing including basic concepts, data structures, and their applications.

Dense & Sparse Index

Definition

Dense Index: Dense indexes usually have one-to-one relationship with relation tuples. (There can be extra indexes, but for every tuple, there should be an index pointing to it)

Sparse Index: Sparse indexes have one-to-many relationship with relation tuples. Usually, each sparse index points to a single block of data containing multiple tuples, but there should be at least one sparse index for each block

Secondary Index: Secondary index are indexing to non-primary attributes. Therefore, secondary indexes do not point to tuples/blocks in sequential orders.

Key Points

Dense index reveals the exact location of a target tuple
Sparse index refer to the relative region of the target tuple, we might need extra searching.
Secondary index must be a dense index because files are not sequentially ordered for attributes indexed by the secondary index. Therefore, we cannot infer the location of target tuples from addresses pointed by the secondary index.

Indirection & Multilevel Index

Similar to operating systems, levels of indirection can be really helpful to improve space efficiency in database systems indexing. Indexes (regardless of types) can point to arrays of pointers (a.k.a buckets) where each pointer points to a target tuple.

Furthermore, we can combine all indexing structures mentioned above to form a multi-level indexing. That is, we can add indexes to indexes that index our relation tuples.

Note

All higher level indexes (2+: indexes that do not directly point toward relation tuples) should be sparse indexes because dense indexing has an one-to-one relationship, and hence does not serve any purpose for optimizing space/performance.

Inverted Index

While indexing increases our speed to lookup tuples without having to fetch in all datablocks from the disk and scan for target tuples, aggresive searching in documents, such as text searching, needs better indexing structure than existing ones. Therefore, we can use inverted index which points to a bucket that contains multiple pointers for tuples (documents) containing specific texts. We can then create a sparse index that indexes this array of buckets to conduct operations such as document text search efficiently.

B-Tree & B+Tree

B-Tree and B+Tree are data structures that efficiently store indirect indexing. The only difference between B-Tree and B+Tree is that for a B-Tree, the target node can be everywhere along the traversal path whereas in a B+Tree, the target node can only be in the leaf nodes, and they should all be on the same level.

Operations

In this section, I'll go through how common operations can be done on a B-Tree and a B+Tree. The animation is done using David Galley's algorithm visualization tool

Lookup

B-TreeB+Tree

Insertion

B-TreeB+Tree

Case1: sufficient space in the leaf node (Insert 6000)

Case2: Insufficient leaf node (Insert 6500)

Case3: Insufficient leaft node and parent node (Insert 7006)

Case1: sufficient space in the leaf node (Insert 6000)

Case2: Insufficient leaf node (Insert 6500)

Case3: Insufficient leaft node and parent node (Insert 7006)

Deletion

B-TreeB+Tree

Dynamic Hash Table

We need to dynamically increase the size of hash tables since we do not know how many unique key existing before doing the computation. If size of the hash table is too small, we might face a lot of overflow.

Extensible Hash Table

Uses n most significant bits
Needs a bucket of pointers to keep track of currently using blocks
Bucket double upon expansion
Block split into two for every insertion into a full block, and increment block counter
Whenver block counter == bucket counter (before block splitting), double the bucket, and increment the bucket counter.

Linear HashTable

Uses n least significant bits
Does not need a bucket of pointers
A threshold is set (e.g. 80%) on the ratio of filled spaces and total spaces in the hashtable. Before going above this threshold, the hashtable does not scale up.
When inserting a value into a block that is full, simply use overflow blocks.
Upon breaking the threshold, and another block to the hashtable. If the current encoding bits i cannot encode the number of blocks (not including overflown ones) currently in the hashtable, increment i by 1.