## Hash Join AlgorithmThe Hash Join algorithm is used to perform the natural join or equi join operations. The concept behind the Hash join algorithm is to partition the tuples of each given relation into sets. The partition is done on the basis of the same hash value on the join attributes. The hash function provides the hash value. The main goal of using the hash function in the algorithm is to reduce the number of comparisons and increase the efficiency to complete the join operation on the relations.
## Hash Join Algorithm//Partition s// for each tuple t It is the Hash join algorithm in which we have computed the natural join of two given relations r and s. In the algorithm, there are various terms used:
_{r} and t_{s}, which is further followed by projecting out the repeated attributes.
There is a benefit of using the Hash Join algorithm i.e., the hash index on si is built-in memory, so for fetching the tuples, we do not need to access the disk. It is good to use smaller input relations as the build relations. ## Recursive Partitioning in Hash JoinRecursive partitioning is the one in which the system repeats the partitioning of the input until each partition of the build input fits into the memory. The recursive partitioning is needed when the value of n ## Overflows in Hash JoinThe overflow condition in hash-table occurs in any partition i of the build relation s due to the following cases:
## Handling the OverflowsWe can handle such cases of hash-table overflows using various methods. **Using Fudge Factor**
We can handle a small amount of skew by increasing the number of partitions with the use of the As a result, we have two more methods for handling the overflows.
The overflow resolution method is applied during the build phase when a hash index overflow is detected. The overflow resolution works in the following way: It finds s
The overflow avoidance method uses a careful approach while partitioning in order to avoid the occurrence of overflow in the build phase. The overflow avoidance works in the following way: It initially partitions the build relation s into several small partitions and then combines some of the partitions. These partitions are combined in such a way that each combined partition fits in the memory. Similarly, it partitions the probe relation r as the combined partitions on s. But, the size of r Both overflow resolution and overflow avoidance methods may fail on some partitions if a large number of tuples in s have the same value for the join attributes. In such a case, it is better to use block nested-loop join rather than applying the hash join technique for completing the join operation on those partitions. ## Cost Analysis of Hash JoinFor analyzing the cost of a hash join, we consider that no overflow occurs in the hash join. We will consider only two cases where:
We need to read and write relations r and s completely for partitioning them. For this, a total of b nd_{r} a b are the number of blocks holding records of relations r and s. Both relations read each partition once for more b_{s}_{r} + b_{s} blocks transfers. However, the partitions might have occupied slightly more number of blocks than b_{r} + b_{s,} which results in partially filled blocks. To access such partially filled blocks can include the overhead of 2n_{h} approximately for each relation. Thus, a hash join cost estimates need:Number of block transfers = 3(b Here, we can neglect the overhead value of 4n Number of disk seeks = 2(Γb Here, we have assumed that each input buffers are allocated with b
In this case, each pass reduces the size of each partition by M-1 expected factor, and also passes are repeated until it makes the size of each partition as M blocks at most. Therefore, for partitioning the relation s, we need: Number of passes = Γlog The number of passes required in the partitioning of the build and probe relations is the same. As in each pass, each block of s is read and written out and needs a total of 2b Number of block transfers = 2(b Number of disk seeks = 2(Γb Here, we assume that for buffering each partition we allocate b As a result, the hash join algorithm can be further improved if the size of the main memory increases or is large. ## Hybrid Hash JoinIt is a type of hash join that is useful for performing the join operations in which the memory size is relatively large. But still, the build relation does not fit in the memory completely. So, the hybrid hash join algorithm resolves the drawback of the hash join algorithm. Next TopicMaterialization in Query Processing |