Friday, August 24, 2018

Grokking the Data Model

© DC Comics
In a previous blog entry, I alluded to tools and techniques for quickly getting into the sweet spot for solving data layer problems.

One Really Useful Skill is being able to quickly grok a data model. When in a rush (which is more often than not :), this does not constitute a formal logical data modeling exercise, but simply querying the data dictionary, data profiling, and intuiting the signal from the noise to build a mental map of the relevant data structures.

How to quickly grok a data model? Spend 10,000 hours of on-the job practice and I suspect you'll be able to do this quite rapidly.  Not really a shortcut, eh?. Fair enough :).  Let's dig into this a little deeper then and see if an An Easier Way can be found.



What is it good for?

https://mashe.hawksey.info/2011/10/live-twitter-data-from-fote-fote11/
typical simple data model
(from MASHe The musing of Martin Hawksey, #Fote11 community)
Why care? Why go to the trouble to learn how to Grok a Data Model?

I'm a science fiction bookworm, so cannot resist the Heinlein reference here:

Grok /ˈɡrɒk/ is a word coined by American writer Robert A. Heinlein for his 1961 science fiction novel Stranger in a Strange Land. [...] The Oxford English Dictionary summarizes the meaning of grok as "to understand intuitively or by empathy, to establish rapport with" and "to empathize or communicate sympathetically (with); also, to experience enjoyment".

Don't you want to establish an empathic, enjoyable rapport with your data structures? If not, maybe you should stop reading now :).

The structure of data (i.e, the data model) is a fundamental complexity at the core of the development, maintenance and understanding of database systems. As modern database systems are becoming larger and more complex, it is increasingly difficult to conceptualize a view of the significant structures within a database. This lack of understanding increases system development and maintenance cost, increases the complexity of systems interfaces, and decreases the efficiency of development teams. In general, a comprehensive understanding of the data structures of large, complex database systems is unattainable to all but a few experts that have developed specialized knowledge in this area.

another simple data model
(from https://github.com/CMPUT301W15T09/Team9Project/wiki/UML)


Algorithms and methods that support programmatic analysis of database structures, identification of significant database structures, and organization of complex database systems into more comprehensible structural groups are key enablers in reducing complexity and improving manageability of many tasks in the lifecycle of database systems.


Where to start? (Answer: Functional Dependencies/Third Normal Form)


Instead of trying to boil the ocean, let's apply some necessary discipline and focus on a specific area.

Abstracting a “conceptual skeleton” out of low-level structural details provides a framework for comprehension of complex database systems. Apt analogies for doing without this are trying to grasp what is going on in a program if you only have access to its machine-level description, or trying to read a book a letter at a time.

Data structures in the guts of real-world systems I typically encounter (for good and bad reasons) are littered with redundancies that get in the way of understanding what they really are.

A proposed specific goal is to experiment with programmatically identifying single column functional dependencies in a database table. 

For you database-heads out there, I am really just talking about Third Normal Form; eliminating non-key columns that are dependent on other non-key columns within the same table (if you are not familiar with Third Normal Form, see the wikipedia entry).

So, let's see if we can codify domain knowledge and human expertise into methods that can be programmatically applied and leveraged.

I'm not sure yet if this is simple, hard or even possible. Let's find out together. This is a journey with an unknown destination. I may get lost, never arrive or change destinations. The old adage: "What is important is the journey, not the destination" might apply here. We will inevitably learn something useful along the way.

Ok, let's get started...


Create a simple table and fill with sample data

create table lab_test_rslt (
   seq_no number,
   subj_no number,
   site_no number,
   site_name varchar2(35),
   test_code varchar2(35),
   test_val varchar2(35),
   test_rslt varchar2(35),
   primary key (seq_no)
);
 
insert into lab_test_rslt values (1,1,10,'Happy Clinic','HR',60,'OK');
insert into lab_test_rslt values (2,2,10,'Happy Clinic','HR',70,'OK');
insert into lab_test_rslt values (3, 3,10,'Happy Clinic','HR',80,'OK');
insert into lab_test_rslt values (4, 4,10,'Happy Clinic','HR',55,'OK');
insert into lab_test_rslt values (5, 1,20,'Regional Med Ctr','HR',0,'DEATH');
insert into lab_test_rslt values (6,2,20,'Regional Med Ctr','HR',170,'HIGH');
insert into lab_test_rslt values (7,1,30,'Central Hospital','HR',65,'OK');
-- whoops! maybe a typo in site_name!
insert into lab_test_rslt values (8,3,30,'Centrel Hospital','HR',20,'LOW');
commit;
select * from lab_test_rslt order by 1;
 
SEQ_NO   SUBJ_NO   SITE_NO SITE_NAME          TEST_CODE   TEST_VAL   TEST_RSLT
    1         1        10 Happy Clinic       HR          60         OK
    2         2        10 Happy Clinic       HR          70         OK
    3         3        10 Happy Clinic       HR          80         OK
    4         4        10 Happy Clinic       HR          55         OK
    5         1        20 Regional Med Ctr   HR          0          DEATH
    6         2        20 Regional Med Ctr   HR          170        HIGH
    7         1        30 Central Hospital   HR          65         OK
    8         3        30 Centrel Hospital   HR          20         LOW
 
 8 rows selected. 

It is easy for a mere human to eyeball this example and quickly observe that site_name looks like it is dependent on site_no. That is a common pattern and will stand out to most practitioners. Let's see if we can write some code that will observe the same.


First attempt at a Functional Dependency algorithm; prototype in SQL

Pseudocode:

For all site_nos in the lab_test_rslt table:
 if the count of distinct count_names for every site_no is 1, then
  site_name is functionally dependent on site_no
 if the count of distinct count_names for any site_no is >1, then
  site_name is NOT functionally dependent on site_no

Ok, this sounds like a job for SQL "GROUP BY":

select
  site_no,
  count(distinct site_name)
from lab_test_rslt
group by site_no
order by 1;

  SITE_NO   COUNT(DISTINCTSITE_NAME)
       10                          1
       20                          1
       30                          2


So, is site_name functionally dependent on site_no?  In this example, site_no 30 has two different site_names, so you cannot derive the site_name from just knowing the site_no = 30.  So the answer is NO: site_name is not functionally dependent on site_no.

But, if there was no typo in "Centrel Hospital" there WOULD be a FD. Hmmmm... site_name is "usually" FD on site_no.

Can we improve upon this? Perhaps assign a score instead of just Yes/No?

-- what is the unique site_no values ratio of unique site_names that are FD on site_no?
select
    sum(case when count(distinct site_name) = 1 then 1 else 0 end)||' / ('||
    sum(case when count(distinct site_name) = 1 then 1 else 0 end)||' + '||
    sum(case when count(distinct site_name) > 1 then 1 else 0 end)||')'
    "fd_cnt / (fd_cnt + no_fd_cnt)",
    round(sum(case when count(distinct site_name) = 1 then 1 else 0 end) /
    (
        sum(case when count(distinct site_name) = 1 then 1 else 0 end) +
        sum(case when count(distinct site_name) > 1 then 1 else 0 end) 
    ), 2) fd_score
from lab_test_rslt
group by site_no;

fd_cnt / (fd_cnt + no_fd_cnt)     FD_SCORE
2 / (2 + 1)                           0.67


67% of unique site_names are FD on site_no, which is a little more useful, but does not really reflect that only a single row is not FD.


Can we further improve by weighing the score by the number of rows? YES

-- what is the row ratio of unique site_names that are FD on site_no?
select
    sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end)||' / ('||
    sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end)||' + '||
    sum(case when count(distinct site_name) > 1 then count(site_no) else 0 end)||')'
    "fd_cnt / (fd_cnt + no_fd_cnt)",
    sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end) /
    (
        sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end) +
        sum(case when count(distinct site_name) > 1 then count(site_no) else 0 end) 
    ) fd_score
from lab_test_rslt
group by site_no;

fd_cnt / (fd_cnt + no_fd_cnt)     FD_SCORE
6 / (6 + 2)                           0.75

FD_SCORE represents the ratio of rows where C1->C2.

In plain English:
  • C2 is functionally dependent on C1
  • the value of column C2 depends on value of column C1
  • C2 can be derived if you know the value of C1

75% of the rows have site_names that are FD on site_no.  75% is getting fairly close to 100%, so suggests an "almost" FD. This is getting a little more useful.

Parking lot: An interesting possible use case for an FD analyzer: finding almost "patterns" that suggest data errors.(!!?!)



Is it possible to simplify this SQL? YES

-- break out the intermediate results so I can understand what is going on
select
    -- the number of rows where there is a unique site_name for this site_no (a FD!)
    case when count(distinct site_name) = 1 then count(site_no) else 0 end n1,
    -- the total number of rows for this site_no
    count(*) n2,
    -- the ratio of n1:n2
    case when count(distinct site_name) = 1 then count(site_no) else 0 end / 
        count(*) 
    fd_score
from lab_test_rslt
group by site_no;

  N1   N2   FD_SCORE
   0    2          0
   2    2          1
   4    4          1

-- now I can find the ratio of the sum of the numbers, eg:
--    sum(N1) / sum(N2)
--    0+2+4 / 2+2+4
--    6 / 8
--    0.75

-- here is the final SQL for that:
select
    sum(
        case when count(distinct site_name) = 1 then count(site_no) else 0 end) / 
        sum(count(*)
    ) fd_score
from lab_test_rslt
group by site_no;

  FD_SCORE
     .75

Ok, so that does not seem too hard - it is almost a one-liner now!



Is this approach sufficient for non-trivial cases?

The most common scenario for an "FD analyzer" would be to point it at a new mystery table and get it to report out all functional dependencies. The FD analyzer would need to analyze all possible pairs of columns to look for functional dependencies. Analyzing each pair of columns would require scanning the entire set of values for each of the pair of columns. This would typically be a full tablescan in a traditional relational database for EACH PAIR OF COLUMNS, a full filescan for files, or two full column scans in a columnar database. Uggghh.

How many 2 column (order-specific) permutations exist for a table with n columns?   Here is the formula and some worked examples:

n!
P(n,k)= ------
(n-k)!

n = number of columns in the table
k = number of columns to analyze at a time (two)

Ex: a table with 4 columns has 12 combinations:

1 2
1 3
1 4
2 3
2 4
3 4
4 3
4 2
3 2
4 1
3 1
2 1

4!
P(4,2)= ------
(4-2)!

4!    24
------ = --- = 12
(4-2)!    2

Ex: 30 columns have 435 combinations

n!
------
(n-k)!

30!
------ = 435
(30-2)!

Ex: 100 columns have 9900 combinations

100!
------ = 9900
(100-2)!

The current SQL approach will require scanning the data once for each FD within the same grouping, requiring k!/(k-2)! table scans. Therefore a table with 100 columns would require 9,900 scans. Seems a bit inefficient (understatement). There must be a better way.



Is it possible to calculate multiple FDs in a single pass? (within the same grouping) YES

select 'SITE_NO' c1, name c2, FD_SCORE from (
    select
        sum(
            case when count(distinct site_name) = 1 then count(site_no) else 0 end) / 
            sum(count(*)
    ) site_name,
        sum(
            case when count(distinct test_rslt) = 1 then count(site_no) else 0 end) / 
            sum(count(*)
    ) test_rslt
    from lab_test_rslt
    group by site_no
)
unpivot include nulls (
    (fd_score)
    for name in (
        site_name,
        test_rslt
    )
);

C1        C2              FD_SCORE
SITE_NO   SITE_NAME           0.75
SITE_NO   test_rslt            0.5

I introduced the unpivot() function (an Oracle-ism) to conveniently report analysis of each column pair on a separate row.

This SQL approach will require one pass over the data for each column in the table, (ex: the lab_test_rslt_table has 7 columns so this would be 7 table scans). Much better. Can we do more?



Is it possible to calculate multiple FDs in a single pass? (even in different groupings)? YES

with x as (
    select 
        case when count(distinct site_name) over (partition by site_no) = 1 
            then 1 else 0 
        end c1_cnt,
        case when count(distinct test_rslt) over (partition by site_no) = 1 
            then 1 else 0 
        end c2_cnt,
        case when count(distinct test_rslt) over (partition by site_name) = 1 
            then 1 else 0
        end c3_cnt
    from lab_test_rslt
), x2 as (
    select 
        'SITE_NO' c1_name_1,
        'SITE_NAME' c1_name_2,
        sum(c1_cnt) / count(*) c1_fd_score,
        'SITE_NO' c2_name_1,
        'test_rslt' c2_name_2,
        sum(c2_cnt) / count(*) c2_fd_score,
        'SITE_NAME' c3_name_1,
        'test_rslt' c3_name_2,
        sum(c3_cnt) / count(*) c3_fd_score
    from x
)
select c1, c2, fd_score
from x2
unpivot include nulls (
    (c1, c2, fd_score)
    for name in (
        (c1_name_1, c1_name_2, c1_fd_score),
        (c2_name_1, c2_name_2, c2_fd_score),
        (c3_name_1, c3_name_2, c3_fd_score)
    )
);

C1          C2              FD_SCORE
SITE_NO     SITE_NAME           0.75
SITE_NO     test_rslt            0.5
SITE_NAME   test_rslt           0.75

I replaced the GROUP BY with a series of COUNT DISTINCT window functions in the first WITH clause.. This allowed for the equivalent of multiple GROUP BYs within a single SQL statement. Then, I refactored the math into a 2nd WITH clause, which was required anyways to avoid:
ORA-30483: window  functions are not allowed here. Window functions are allowed only in the SELECT list of a query. And, window function cannot be an argument to another window or group function.
This SQL approach will require scanning the lab_test_rslt table ONCE for all FD pairs in the table. Therefore a table with 100 columns will require A SINGLE PASS. 9,900 vs. 1  = MUCH BETTER!





Next steps (to be addressed in a future post)


We now have prototyped what appears to be a reasonably efficient approach at programmatically finding single-column functional dependencies within a database table. We have developed an Oracle SQL implementation and run it through some basic tests, it appears to work and is possibly a kernel for building something useful.

The next step is to implement FD Analyzer (pretty uncreative name - I need to come up with something better!): a utility that will accept as input a table name and return a list of one or more single-column functional dependencies.


Using the FD Analyzer utility, mere humans will then be spared part of the manual effort and complexity of querying the data dictionary, data profiling, and intuiting the signal from the noise to build a mental map of the relevant data structures. Using this utility, one should be able to do many wondrous things:
  • Programmatically identify Third Normal Form violations in a physical data model to aid in understanding/producing a logical model.
  • For quick data model grokking, functional dependents, once identified, can be ignored as derivative and not conceptually significant.
  • As we have now automated what was previously a manual, human task, this, along with other similar approaches could be scaled up to analyze entire databases and tied into automated data curation and dictionary facilities.
But... this is just a lot of hot air until implemented, right? Stay tuned.

No comments:

Post a Comment