Merge Join Algorithm

The merge joins are used for performing natural joins and equi-joins for given relations r and s. We use an algorithm for performing the merge join, known as the Merge Join algorithm. It is also known as a sort-merge-join algorithm.

Merge Join Algorithm

The merge join algorithm is given below:

pr = address of first tuple of relation r;
ps = address of first of relation s;
while (ps!=null && pr!=null) do begin
	 t_s = tuple to which ps points;
	 S_s = {t_s};
	set ps to point the next tuple of relation s;
	done = false;
	while (!done && ps!=null) do begin
		t_s? = tuple to which ps points;
		 if (t_s?[JoinAttrs] = t_s[JoinAttrs])
		   begin
			S_s = S_s U {t_s?};
			set ps to point the next tuple of relation s;
			end
		else
		    done = true;
	end 
   t_r = tuple to which pr points;
   while (pr !=null && t_r [JoinAttrs] < t_s[JoinAttrs]) do begin
	for each t_s in S_s do begin
	      add t_s   ⋈    t_r to result;
	  end
	set pr to point nest tuple of r;
          t_r = tuple to which pr points;
         end

In the algorithm, there are various terms used:

JoinAttrs: It denotes the attributes in the intersection of r ꓵ s.

r ꓵ s: The r ꓵ s refers to those attributes which are common in relations r and s.

t_s ⋈ t_r: A concatenated expression of the attributes of t_s and t_r tuples. It is further followed by projecting out repeated attributes.

t_s and t_r: These are two tuples having the same value of JoinAttrs.

S_s: It reads those join attributes of a group of tuples of a relation which are having the same values.

In the merge join algorithm, it associates each relation with a pointer. Initially, the pointer points to the first tuple of the relation and then moves towards the next one as soon the algorithm proceeds. Also, the algorithm needs that each set of tuples S_s fits in the memory even if the size of the relation s is large. However, if for some attribute values, S_s seem larger than the available memory size, we can perform block nested-loop join for it. Somehow if the given input relations r and s are not sorted on the join attributes or anyone is unsorted, we need to sort them before applying the merge join algorithm.

Cost Analysis of Merge Join Algorithm

If the relations are sorted and tuples having the same value on the join attributes are placed consecutively. Then we need to read each tuple only once, and thus the block will also be read for once. Thus,

Number of block transfers = b_r + b_s

Also, in both files, the number of block transfers is equal.

Number of Disk seeks = [b_r/b_b] + [b_s/b_b]

Here, b_r and b_s are the numbers of blocks of the given relations r and s. The term b_b means that we are assuming that b_b buffer blocks are allocated to both relations. But, we know that data transfer is less expensive than disk seeks, so we should allocate multiple blocks to each given relation. Consequently, it will provide extra memory space too.

Hybrid Merge Join Algorithm

The Hybrid merge join is different from the merge join. In merge join operation, we saw that it is a must to sort the given relations before applying the merge join technique. However, both join attributes consist of secondary indices, then also we can perform a variation of the merge join operations on unsorted tuples too. For doing so, the applied merge join algorithm will scan the records through the indices, which will enable to retrieve the records in a sorted manner. Thus, such variation of the merge join operations leads to a significant drawback, i.e.:

It is possible that the records might be placed in different file blocks. It means they might be scattered In several blocks of files. So, for accessing each tuple, we also need to access the particular file block, and it is a costly step.

For preventing ourselves from such expensive access, we use a new technique which is known as 'Hybrid Merge Join' technique. The hybrid merge join operation combines the indices with merge join.

To understand the hybrid merge join operation, let's take an example:

Consider that we have two relations from which one of the relations is sorted, and the other one is unsorted. But the relation which is not sorted contains a secondary B⁺-tree index on the join attributes.

In such a case, initially, the hybrid merge join process will merge or combine the sorted relation with the leaf entries of the secondary B⁺-tree index. Then the result file will carry tuples from sorted relation and address for tuples of the unsorted relation. Further, the result file is sorted again on the addresses of unsorted relation's tuples to enable efficient retrieval of the corresponding tuples to complete the join in the sequence of physical storage. Such type of method is known as the Hybrid Merge Join method.

Next TopicNested-Loop Join Algorithm

← prev next →