Hash Join AlgorithmThe Hash Join algorithm is used to perform the natural join or equi join operations. The concept behind the Hash join algorithm is to partition the tuples of each given relation into sets. The partition is done on the basis of the same hash value on the join attributes. The hash function provides the hash value. The main goal of using the hash function in the algorithm is to reduce the number of comparisons and increase the efficiency to complete the join operation on the relations. For example, suppose there are two tuples a and b where both of them satisfy the join condition. It means they have the same value for the join attributes. Suppose that both a and b tuples consist of a hash value as i. It implies that tuple a should be in a_{i}, and tuple b should be in b_{i}. Thus, we only compare a tuples in a_{i} with b tuples of b_{i}. We do not need to compare the b tuples in any other partition. Therefore, in this way, the hash join operation works. Hash Join Algorithm//Partition s// for each tuple t_{s} in s do begin i = h(t_{s} [JoinAttrs]); Hsi = Hsi U {t_{s}}; end //Partition r// for each tuple t_{r} in r do begin i = h(t_{r}[JoinAttrs]); Hri = Hri U {t_{r}}; end //Perform the join operation on each partition// for i= 0 to nh do begin read Hsi and build an inmemory hash index on it; for each tuple t_{r} in Hri do begin probe the hash index on Hsi to locate all tuples such that t_{s}[JoinAttrs] = t_{r}[JoinAttrs]; for each matching tuple t_{s} in Hsi do begin add t_{r} ⋈ t_{s} to the result; end end end It is the Hash join algorithm in which we have computed the natural join of two given relations r and s. In the algorithm, there are various terms used: t_{r} ⋈ t_{s}: It defines the concatenation of the attributes of tuple t_{r} and t_{s}, which is further followed by projecting out the repeated attributes. t_{r} and t_{s}: These are the tuples of relations r and s, respectively. Let's understand the hash join algorithm with the following steps: Step 1: In the algorithm, firstly, we have partitioned both relations r and s. Step 2: After partitioning, we perform a separate indexed nestedloop join on each of the partition pairs i using for loop as i = 0 to n_{h}. Step 3: For performing the nestedloop join, it initially creates a hash index on each s_{i }and then probes with tuples from r_{i}. In the algorithm, relation r is the probe input, and relation s is the build input. There is a benefit of using the Hash Join algorithm i.e., the hash index on si is builtin memory, so for fetching the tuples, we do not need to access the disk. It is good to use smaller input relations as the build relations. Recursive Partitioning in Hash JoinRecursive partitioning is the one in which the system repeats the partitioning of the input until each partition of the build input fits into the memory. The recursive partitioning is needed when the value of n_{h} is greater than or equal to the number of memory blocks. It becomes difficult to split the relation in one pass since there can be insufficient buffer blocks. So, it's better to split the relation in repeated passes. In one pass, we can split the input as several partitions because there are sufficient blocks available to be used as output buffers. Each bucket build by the pass is read separately and further partitioned in the next pass so as to create smaller partitions. Also, the hash functions are different in different passes. So, it is better to use recursive partitioning for handling such cases. Overflows in Hash JoinThe overflow condition in hashtable occurs in any partition i of the build relation s due to the following cases: Case 1: When the hash index on s_{i} is greater than the main memory, the overflow condition occurs. Case 2: When there are multiple tuples in the build relation with the same values for the join attributes. Case 3: When the hash function does not hold randomness and uniformity characteristics. Case 4: When some of the partitions have more tuples than the average and others have fewer tuples, then such type of partitioning is known as skewed. Handling the OverflowsWe can handle such cases of hashtable overflows using various methods.
We can handle a small amount of skew by increasing the number of partitions with the use of the fudge factor. The fudge factor is a small value that increases the number of partitions. So, it will help to reduce the expected size of each partition, including their hash index less than the memory size. Unfortunately, the use of a fudge factor makes the user conservative on the size of the partitions. Thus, the chances of overflow are still possible. However, the use of the fudge factor is suitable for handling small overflows, but it is not sufficient for handling large overflows in the hashtable. As a result, we have two more methods for handling the overflows. 1. Overflow Resolution The overflow resolution method is applied during the build phase when a hash index overflow is detected. The overflow resolution works in the following way: It finds s_{i} for any partition i if having size larger than the memory size. It again partitions such build relation s_{i} into smaller partitions through a different hash function. Similarly, it partitions the probe relation r_{i} through the new hash function, and only those tuples are joined, which are having matching partitions. But, it is a less careful approach because this method waits for such conditions to occur, and then take the necessary actions to resolve the problem. 2. Overflow Avoidance The overflow avoidance method uses a careful approach while partitioning in order to avoid the occurrence of overflow in the build phase. The overflow avoidance works in the following way: It initially partitions the build relation s into several small partitions and then combines some of the partitions. These partitions are combined in such a way that each combined partition fits in the memory. Similarly, it partitions the probe relation r as the combined partitions on s. But, the size of r_{i} does not matter in this method. Both overflow resolution and overflow avoidance methods may fail on some partitions if a large number of tuples in s have the same value for the join attributes. In such a case, it is better to use block nestedloop join rather than applying the hash join technique for completing the join operation on those partitions. Cost Analysis of Hash JoinFor analyzing the cost of a hash join, we consider that no overflow occurs in the hash join. We will consider only two cases where: 1. Recursive partitioning is not needed We need to read and write relations r and s completely for partitioning them. For this, a total of 2(b _{r} + b _{s} ) block transfers are required. The term b _{r} and b _{s} are the number of blocks holding records of relations r and s. Both relations read each partition once for more b_{r} + b_{s} blocks transfers. However, the partitions might have occupied slightly more number of blocks than b_{r} + b_{s,} which results in partially filled blocks. To access such partially filled blocks can include the overhead of 2n_{h} approximately for each relation. Thus, a hash join cost estimates need: Number of block transfers = 3(b_{r} + b_{s}) + 4n_{h} Here, we can neglect the overhead value of 4n_{h} since it is much smaller than b_{r} + b_{s} value. Number of disk seeks = 2(Γb_{r}/b_{b}ꓶ + Γb_{s}/b_{b}ꓶ) + 2n_{h} Here, we have assumed that each input buffers are allocated with b_{b} blocks, and the build, as well as probe phase, needs only one seek for each n_{h} partition of the relation, as we can read each partition sequentially. 2. Recursive partition is required In this case, each pass reduces the size of each partition by M1 expected factor, and also passes are repeated until it makes the size of each partition as M blocks at most. Therefore, for partitioning the relation s, we need: Number of passes = Γlog_{M1}(b_{s})  1ꓶ The number of passes required in the partitioning of the build and probe relations is the same. As in each pass, each block of s is read and written out and needs a total of 2b_{s}Γlog_{M1}(b_{s})  1ꓶ block transfers for splitting relation s. Thus, a hash join cost estimates need: Number of block transfers = 2(b_{r} + b_{s})Γlog_{M1}(b_{s})  1ꓶ + b_{r} + b_{s} Number of disk seeks = 2(Γb_{r}/b_{b}ꓶ + Γb_{s}/b_{b}ꓶ)Γlog_{M1}(b_{s})  1ꓶ Here, we assume that for buffering each partition we allocate b_{b} blocks to them. Also, we have neglected a relatively small number of seeks during the build and probe phase. As a result, the hash join algorithm can be further improved if the size of the main memory increases or is large. Hybrid Hash JoinIt is a type of hash join that is useful for performing the join operations in which the memory size is relatively large. But still, the build relation does not fit in the memory completely. So, the hybrid hash join algorithm resolves the drawback of the hash join algorithm.
Next TopicMaterialization in Query Processing
