Double-Pipelined Join Algorithm

In our previous section, we understood about pipelining and the ways through which a system creates and implement pipeline to evaluate multiple operations using demand-driven or producer-driven pipelining.

Here, we will discuss an evaluation algorithm for implementing pipelining.

There are several operations used in accessing the data from any particular system. But few of them are inherently blocking operations, and others are not. Blocking operations are those which do not output any results until all the input tuples are examined.

For example, operations such as hash-join is a blocking operation as before outputting any result. It needs both its input to be fetched entirely as well as partitioned. On the other hand, the indexed nested loop is able to output the resulting tuples as soon it gets tuples for the outer relation. So, it is pipelined at its outer relation and blocking on its indexed input. It is so because the indexed is created completely before the execution of the indexed nested loop.

But in some cases where we want to perform join operation on two inputs. However, both inputs are not already sorted, and we need to put them in a pipeline of the join operation. For such cases, we use an alternative approach known as the Double-pipelined join method. The double-pipelined join technique uses an evaluation algorithm for the implementation of the pipeline, which is known as the Double-pipelined join algorithm.

Double-pipelined Join Algorithm

Below we have described the double-pipelined join algorithm:

done_r = false;
done_s = false;
r = Ø;
s = Ø;
result = Ø;
while !done_r or !done_s do
      begin
              if queue is empty, wait until it is not empty;
              t = top entry in the queue;
              if t = End_r then 
                 done_r = true
else if t = End_s then
                done_s = true
           else if t is from input r
             then
                 begin
                       r = r U {t};
                       result = result U ({t} ⋈ s);
             end
          /* t is from input s */
          else
             begin
                 s = s U {t};
                result = result U (r ⋈ {t});
              end 
end

The above-described algorithm is performed on two input relations r and s. It is assumed that the input tuples of these relations are pipelined. The tuples which are provided to both r and s relations are queued to process in one queue. In the algorithm, End_r and End_s are the special queues, which are the end-of-file markers. These special queues are inserted in the queue only after generating all the tuples from relation r and s, respectively. Also, as more tuples get added to relations r and s, appropriate indices should be built on both the relations. Keeping the indices upto date leads to an efficient evaluation of the operation.

In this algorithm, we have also assumed that both the inputs are fit in memory. But, the double-pipelined join technique also supports the case in which the size of the two inputs exceeds the size of memory, i.e., larger than the memory size. It is because the double-pipelined join method can work as usual until the available memory becomes full. When the memory becomes full, the arrived tuples of both relations r and s upto that point can be treated as being in r₀ and s₀ partitions, respectively. The tuples which have subsequently arrived for relations r and s are assigned to partitions r₁ and s₁. Although these assigned tuples to partitions r₁ and s₁ are not included to the in-memory index, they are written to the disk. Also, before writing these tuples assigned to r₁ and s₁ to the disk, they are used to probe partitions r₀ and s₀. As a result, it also concludes the join of r₁ with s₀ and s₀ with r₁ in a pipeline. After processing both relations r and s completely, we compute the join of r₁ tuples with s₁ tuples in order to complete the join operation. Also, we can use any join operation or method for performing join on partition r₁ with s₁ partition.