Friday, March 05, 2010

Storing trees in databases

A tree is a graph which is connected, uni-directed and acyclic. Lets look at different options of storing such trees in databases.

Parent - Child Model

The most common model for storing hierarchical information is storing the reference of the parent node along with the child node.

so for a tree


For this model, the table to be created would be:

create table employee_tree(employee varchar(10) not null primary key, boss varchar(10), salary decimal(10,2) not null);

Some sample operations :
  • Finding the root node : select * from employee_tree where boss is NULL
  • Finding all leaf nodes : select e1.* from employee_tree as e1 left join employee_tree as e2 on e1.employee=e2.boss where e2.employee is null
  • Retrieving a single path : select e1.employee as lev1, e2.employee as lev2, e3.employee as lev3 from employee_tree as e1 left join employee_tree as e2 on e2.boss = e1.employee left join employee_tree as e3 on e3.boss = e2.employee where e1.employee = 'A' and e3.employee='F';

Some problems with this model:
  • Update : if you run "update employee_tree set employee = 'C1' where employee='C'", then all nodes below "C" are left with boss who does notexist. To handle this, you should run a transaction with two queries, one being the above query and another being the query "update employee_tree set boss = 'C1' where employee = 'C'"
  • Insert : if you run a query "Insert into employee_tree (employee, boss) values ('G','H'),('H','G')" you get two employees 'G' & 'H' who are both employees and bosses of each other.
  • Insert : Another problem with insert is "Insert into employee_tree (employee, boss) values ('M','M')". Now 'M' is the boss of 'M'.
  • Delete : Run "Delete from employee_tree where emp = 'C'". Again we create a situation where 'D','E','F' are left with a boss who does not exist. So in effect, deletion would require the tree to be traversed.
  • Descendants : It is difficult to get the number of descendents for a given node in a single query. You will have to write some script and un some recursive function to get that number

Fixing this model: Following checks should be in place to fix this model.
  • An employee cannot be his own boss : Check if ( boss != employee ) or (boss = 0 and employee = 0)
  • On deletion use cascade to remove the employees to that boss
  • To prevent cyclic relations : add unique key (employee, boss)
  • Check for validity of tree (edges = nodes - 1) : count(*) from employee_tree - 1 = count(boss) from employee_tree where boss is not null
  • Check for only single root : select count(*) from employee_tree where boss_id is null = 1

Path Enumeration model

create table employee(emp_id char(1) not null primary key, emp_name varchar(10), emp_path varchar(255) not null);
AA NameA
The sample operations are easy here
  • Finding root node : select * from employee where length(emp_path)=1;
  • Finding all leaf nodes : This is going to be difficult, You will have to write a script to extract this info
  • Prevent cyclic relations : check whether after removing the node from the path string, the path is reduced by only 1 node. Query : CHECK(NOT EXISTS(select * from employee as e1, employee as e2 where length(replace(e1.emp_path,e2.emp_id,'')) < (length(e1.emp_path)-length(e2.emp_id)) ));
  • No path can be longer than the total no of nodes in the tree : CHECK( (select max(length(e1.emp_path)) from employee as e1) <= (select coun t(e2.emp_id) * length(e2.emp_id) from employee as e2) )
  • Depth of the tree is the path divided by the length of emp_id string : CHAR_LENGTH(emp_path) / CHAR_LENGTH(emp_id); If emp_ids are not fixed length then the depth is : CHAR_LENGTH(emp_path) - CHAR_LENGTH(REPLACE(emp_path,'/','')) + 1 [here / is the separator that is path is written as 'A/B/C'. If you use any other separator, you need to change it in the function as well]
  • To find all subtrees for a parent : the immediate solution is "Select * from employee where emp_path like concat('%',<Parent_emp_id>,'%')". But this query does not use indexes. Another way of doing this is "select * from employee where emp_path like concat((select emp_path from employee where emp_id='<Parent_emp_id>'),'%');"
  • To find the immediate subchildren of a parent : "select * from employee where emp_path like concat((select emp_path from employee where emp_id='<Parent_emp_id>'),'_');"
  • To search for superiors : "create table sequence(seq int not null); insert into sequence values (1),(2),(3),(4),(5); select substring(e1.emp_path from (seq * char_length(e1.emp_id)) for char_length(e1.emp_id)) as emp_id from employee e1, sequence s1 where e1.emp_id = '<emp_id_to_search_for>' and s1.seq <= char_length(e1.emp_path)/char_length(e1.emp_id);"
  • To find the relationships among superiors : "select e2.* from employee e1, employee e2 where e1.emp_id = '<emp_id_to_search_for>' and POSITION(e2.emp_path in e1.emp_path) = 1;"
  • To delete a subtree : "DELETE from employee where emp_path like concat(select emp_path from employee where emp_id='<Dead_node_emp_id>','%')" We may need to write the two queries separately;
  • To delete a node : "Delete from employee where emp_id='<emp_id_of_dead_node>'" And "Update employee set emp_path = replace(emp_path,'<emp_id_of_dead_node>','')" Both queries should be in a single transaction.
  • Inserting a node : "select emp_path from employee where emp_id = '<emp_id_of_boss>'; insert into employee values ('<new_emp_id>','<new_emp_name>',concat('<selected_emp_boss_path>','<new_emp_id>'))"

A derivative of the path enumeration model is the edge enumeration model. In Edge enumeration model the path consists of a set of integers that make the path from the root to the node - from left to right.

For example :
Employee nameEdge path
The benefit of the edge enumeration model over the path enumeration model is that you do not have to worry about long strings. The numbers give an implied ordering to siblings that have meaning.

Nested set model

Here each node is assigned a beginning and an ending node hierarchy number based on the total amount of data in the hierarchical tree.

An example table could be :

create table emp(emp_id int not null auto_increment primary key, emp_name varchar(10), lft int not null, rgt int not null);
Emp idEmp NameLftRgt
Operations in this model:
  • Finding the root node : "select * from emp where lft = 1"
  • Finding leaf nodes : "select * from emp where lft = rgt - 1"
  • Finding subtree : select node.* from emp as node, emp as parent where node.lft > parent.lft and node.lft < parent.rgt and parent.emp_id='&l t;parent_employee_id>';"
  • Retrieving a single path : "select parent.* from emp as node, emp as parent where node.lft > parent.lft and node.lft < parent.rgt and node. emp_id='<leaf_node_id>' order by parent.lft;"
  • Finding depth of nodes : "select node.emp_name, (COUNT(parent.emp_name)-1) as depth from emp as node, emp as parent where node.lft BETWEEN parent.lft and parent.rgt group by node.emp_name order by node.lft;"
  • Add a single node as child of node with no existing children : If you want to insert a node 'G' below 'B', then the new node 'G' would have left and right values '3' and '4' and all nodes after it (in sequence) would have to increase their left and right values by 2. The procedure
    for it would be
    lock table emp write;
    select @myleft := lft from emp where emp_name='<add_below_this_node>';
    update emp set rgt = rgt+2 where rgt > @myleft;
    update emp set lft = lft+2 where lft > @myleft;
    insert into emp values ('','G',&myleft+1,@myleft+2);
    unlock tables;
  • As a single node after a node(not as a new child) : modify the earlier procedure a little.
    lock table emp write;
    select @myright := rgt from emp where emp_name='<add_after_this_node>';
    update emp set rgt = rgt+2 where rgt > @myright;
    update emp set lft = lft+2 where lft > @myright;
    insert into emp values ('','G',&myright+1,@myright+2);
    unlock tables;
  • Delete a node and its children : Deletion is just opposite to a adding. After deletion, its width has to be deleted from all nodes to the right.
    lock table emp write;
    select @myleft := lft, @myright := rgt, @mywidth := rgt - lft + 1 from emp where emp_name='<node_to_be_deleted>';
    delete from emp where lft between @myleft and @myright;
    update emp set rgt = rgt - @mywidth where rgt > @myright;
    update emp set lft = lft - @mywidth where lft > @myright;
    unlock tables;
  • Delete only the parent node: In this case the child nodes will have to be moved up to the level of the parent.
    lock table emp write;
    select @myleft := lft, @myright := rgt, @mywidth := rgt - lft + 1 from emp where emp_name='<parent_to_be_deleted>';
    delete from emp where lft = @myleft;
    update emp set rgt = rgt - 1, lft = lft - 1 where lft between @myleft and @myright;
    update emp set rgt = rgt - 2 where rgt > @myright;
    update emp set lft = lft - 2 where lft > @myleft;
    unlock tables;
Hybrid model

A parent child with nested set model would server the purpose much better. (node, parent, left, right). But with hybrid models, inserts would be more expensive as compared to selects. So it is important to evaluate the pros and cons and figure out what would be best for the purpose tha t your application serves.

Nested intervals model.

This model is similar to the nested set model with the use of rational numbers (a/b) instead of numbers. So in our above tree the left and right values would change.
Emp idEmp NameLftRgt

This is a bit complex to implement, but plays well with all the queries.

Proprietary solutions

Oracle, DB2 and Microsoft SQL Server support heirarchial data. In opensource Postgresql 8.4 supports the addition and querying of heirarchial data basically by using a "CONNECT BY" clause. It would be better if their official documentation is referred to for info regarding the syntax for implementing heirarchial data in relational databases.