Monday, October 29, 2018

Now Hiring! Looking for an experienced DB Code Librarian


We are currently looking to hire a Database Code Librarian. Join our seasoned team of data engineering professionals to take your career to the next level. As a Database Code Librarian, you will face complex and up-to-date challenges by helping our team address our database code management needs.

Job Type: Full-Time, all nights, weekends and holidays (24x7x365), location flexible.

This is a wonderful opportunity for the right candidate! Honestly, it's a pretty boring, repetitive, detail-oriented job. Nevertheless, either we make it someone's full-time job, or continue to deal with the pain and limitations of doing without, but that is becoming simply insufferable.

Job Opportunity and Challenges

Our team has the same challenges dealing with our database code that are found throughout the industry (although perhaps at a larger than average scale), so there should be candidates out there who have relevant experience.

Our team manages about 75 databases, each having a lot of complex DDL -- over 5 million objects each. I'm not talking about five million rows of data (tiny data!), but over five million OBJECTS: tables, views, packages, etc. (large structure!).

Database source code for a complete build of our systems does not exist.  Database changes are coded with incremental scripts for each release. Database scripts for each release depend on the previous release database. There are no baseline database scripts to use to "build from scratch".

We are using shared database servers for development work. Like many conveniences in software development, a shared database is a tar pit waiting to fossilize a project: Developers overwrite each others' changes, changes made on the database break the code in the web app, etc. In short, there are a lot of opportunities to accelerate from our current snail's pace.

We are currently stuck with an arcane, time-intensive manual process that leaves much to be desired and offers very little by way of consistency or reproducibility.

Job Duties

The DB Code Librarian's duties will be to keep a complete set of database source code up-to-date and in-sync with what is deployed to our 75 databases, along with revision history as data structures evolve. The Database Source Code needs to be accurate, complete and up-to-date so that it can be used to recreate an entire database system from scratch for standing up new environments. This code will be stored in our Source Code Management System and available to the team for browsing, and using as a baseline for new development.

The Database Code Librarian will also deploy the latest active sets of database source code to build servers to test the DB build scripts. This ideally needs to be done continuously, but maybe once nightly is sufficient.

Additional duties include providing ad-hoc reports to the rest of the team showing database source code differences for a given database between two different points in time and creating demo, training or developer sandbox databases that are exact clones of production systems (except they are customized to have no data, no partitions, sometimes specific subsystems only, other tweaks).

Experience:

10-15 years of demonstrable relevant experience.

M.S. in Data Science with a specialization in Structural Metadata Management preferred.

Encyclopedic knowledge required of Oracle data structure definition, including not only tables, view and packages and triggers, but audits, contexts, VPD policies, directories, grants, roles, functions, indexes, partitions, procedures, comments, materialized views, database links, synonyms, sequences, profiles, jobs, types, identity columns, proxies, tablespaces, rollback segments, users and schemas.

Willingness to learn and grow with the team as we tackle these same structural metadata management problems on HBase, BigQuery, Redshift, Snowflake, MongoDB, Couchbase, MySQL and Postgres.

Speed, attention to detail and organizational skills required to accurately create and maintain about 500k lines of code across 16k scripts for each of 75 databases at a pace to keep up with code changes on a daily (or at least weekly) basis with less than a 0.99999% error rate.

Expertise doing database script Librarian work using ERWin, Toad, Oracle SQL Developer or similar tools for generating and maintaining database source code scripts, although none of these tools will scale up to within 5% of our requirements so you won't be using these tools for this position.

Stay Tuned for our Job Search Results in an upcoming post!


Just in case it is not glaringly obvious, this is a work of satire. This is a work of fiction. Names, characters, businesses, places, events, locales, and incidents are either the products of the author's imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.

Monday, September 10, 2018

Functional Dependency Hot Air



XKCD: Here To Help



In a previous blog entry "Grokking the Data Model", I prototyped what appeared to be a reasonably efficient approach at programmatically finding single-column functional dependencies within a database table. I implemented an Oracle SQL prototype and ran it through some basic tests, it appeared to work and possibly be the tiny kernel of something useful.
This post covers the next step as I attempt to turn the idea "hot air" into a useful utility. First, I'll repeat the example table I am using:


Create a simple table and fill with sample data

create table lab_test_rslt (
   seq_no number,
   subj_no number,
   site_no number,
   site_name varchar2(35),
   test_code varchar2(35),
   test_val varchar2(35),
   test_rslt varchar2(35),
   primary key (seq_no)
);
 
insert into lab_test_rslt values (1,1,10,'Happy Clinic','HR',60,'OK');
insert into lab_test_rslt values (2,2,10,'Happy Clinic','HR',70,'OK');
insert into lab_test_rslt values (3, 3,10,'Happy Clinic','HR',80,'OK');
insert into lab_test_rslt values (4, 4,10,'Happy Clinic','HR',55,'OK');
insert into lab_test_rslt values (5, 1,20,'Regional Med Ctr','HR',0,'DEATH');
insert into lab_test_rslt values (6,2,20,'Regional Med Ctr','HR',170,'HIGH');
insert into lab_test_rslt values (7,1,30,'Central Hospital','HR',65,'OK');
-- whoops! maybe a typo in site_name!
insert into lab_test_rslt values (8,3,30,'Centrel Hospital','HR',20,'LOW');
commit;

select * from lab_test_rslt order by 1;
 
SEQ_NO   SUBJ_NO   SITE_NO SITE_NAME          TEST_CODE   TEST_VAL   TEST_RSLT
    1         1        10 Happy Clinic       HR          60         OK
    2         2        10 Happy Clinic       HR          70         OK
    3         3        10 Happy Clinic       HR          80         OK
    4         4        10 Happy Clinic       HR          55         OK
    5         1        20 Regional Med Ctr   HR          0          DEATH
    6         2        20 Regional Med Ctr   HR          170        HIGH
    7         1        30 Central Hospital   HR          65         OK
    8         3        30 Centrel Hospital   HR          20         LOW
 
 8 rows selected. 

Note, this is a nicely trivial example of a table to analyze, with only eight columns and ten rows of data. Ok, so what would the SQL look like to analyze all columns in our example table?


Time to try our current approach out on all columns in our example table

with x as (
    select 
        case when count(distinct subj_no) over (partition by seq_no) = 1 then 1 else 0 end c1_cnt,
        case when count(distinct site_no) over (partition by seq_no) = 1 then 1 else 0 end c2_cnt,
        case when count(distinct site_name) over (partition by seq_no) = 1 then 1 else 0 end c3_cnt,
        case when count(distinct test_code) over (partition by seq_no) = 1 then 1 else 0 end c4_cnt,
        case when count(distinct test_val) over (partition by seq_no) = 1 then 1 else 0 end c5_cnt,
        case when count(distinct test_rslt) over (partition by seq_no) = 1 then 1 else 0 end c6_cnt,
        --
        case when count(distinct seq_no) over (partition by subj_no) = 1 then 1 else 0 end c7_cnt,
        case when count(distinct site_no) over (partition by subj_no) = 1 then 1 else 0 end c8_cnt,
        case when count(distinct site_name) over (partition by subj_no) = 1 then 1 else 0 end c9_cnt,
        case when count(distinct test_code) over (partition by subj_no) = 1 then 1 else 0 end c10_cnt,
        case when count(distinct test_val) over (partition by subj_no) = 1 then 1 else 0 end c11_cnt,
        case when count(distinct test_rslt) over (partition by subj_no) = 1 then 1 else 0 end c12_cnt,
        --
        case when count(distinct seq_no) over (partition by site_no) = 1 then 1 else 0 end c13_cnt,
        case when count(distinct subj_no) over (partition by site_no) = 1 then 1 else 0 end c14_cnt,
        case when count(distinct site_name) over (partition by site_no) = 1 then 1 else 0 end c15_cnt,
        case when count(distinct test_code) over (partition by site_no) = 1 then 1 else 0 end c16_cnt,
        case when count(distinct test_val) over (partition by site_no) = 1 then 1 else 0 end c17_cnt,
        case when count(distinct test_rslt) over (partition by site_no) = 1 then 1 else 0 end c18_cnt,
        --
        case when count(distinct seq_no) over (partition by site_name) = 1 then 1 else 0 end c19_cnt,
        case when count(distinct subj_no) over (partition by site_name) = 1 then 1 else 0 end c20_cnt,
        case when count(distinct site_no) over (partition by site_name) = 1 then 1 else 0 end c21_cnt,
        case when count(distinct test_code) over (partition by site_name) = 1 then 1 else 0 end c22_cnt,
        case when count(distinct test_val) over (partition by site_name) = 1 then 1 else 0 end c23_cnt,
        case when count(distinct test_rslt) over (partition by site_name) = 1 then 1 else 0 end c24_cnt,
        --
        case when count(distinct seq_no) over (partition by test_code) = 1 then 1 else 0 end c25_cnt,
        case when count(distinct subj_no) over (partition by test_code) = 1 then 1 else 0 end c26_cnt,
        case when count(distinct site_no) over (partition by test_code) = 1 then 1 else 0 end c27_cnt,
        case when count(distinct site_name) over (partition by test_code) = 1 then 1 else 0 end c28_cnt,
        case when count(distinct test_val) over (partition by test_code) = 1 then 1 else 0 end c29_cnt,
        case when count(distinct test_rslt) over (partition by test_code) = 1 then 1 else 0 end c30_cnt,
        --
        case when count(distinct seq_no) over (partition by test_val) = 1 then 1 else 0 end c31_cnt,
        case when count(distinct subj_no) over (partition by test_val) = 1 then 1 else 0 end c32_cnt,
        case when count(distinct site_no) over (partition by test_val) = 1 then 1 else 0 end c33_cnt,
        case when count(distinct site_name) over (partition by test_val) = 1 then 1 else 0 end c34_cnt,
        case when count(distinct test_code) over (partition by test_val) = 1 then 1 else 0 end c35_cnt,
        case when count(distinct test_rslt) over (partition by test_val) = 1 then 1 else 0 end c36_cnt,
        --
        case when count(distinct seq_no) over (partition by test_rslt) = 1 then 1 else 0 end c37_cnt,
        case when count(distinct subj_no) over (partition by test_rslt) = 1 then 1 else 0 end c38_cnt,
        case when count(distinct site_no) over (partition by test_rslt) = 1 then 1 else 0 end c39_cnt,
        case when count(distinct site_name) over (partition by test_rslt) = 1 then 1 else 0 end c40_cnt,
        case when count(distinct test_code) over (partition by test_rslt) = 1 then 1 else 0 end c41_cnt,
        case when count(distinct test_val) over (partition by test_rslt) = 1 then 1 else 0 end c42_cnt
    from lab_test_rslt
), x2 as (
    select 
        'seq_no' c1_name_1, 'subj_no' c1_name_2, sum(c1_cnt) / count(*) c1_fd_score,
        'seq_no' c2_name_1, 'site_no' c2_name_2, sum(c2_cnt) / count(*) c2_fd_score,
        'seq_no' c3_name_1, 'site_name' c3_name_2, sum(c3_cnt) / count(*) c3_fd_score,
        'seq_no' c4_name_1, 'test_code' c4_name_2, sum(c4_cnt) / count(*) c4_fd_score,
        'seq_no' c5_name_1, 'test_val' c5_name_2, sum(c5_cnt) / count(*) c5_fd_score,
        'seq_no' c6_name_1, 'test_rslt' c6_name_2, sum(c6_cnt) / count(*) c6_fd_score,
        --
        'subj_no' c7_name_1, 'seq_no' c7_name_2, sum(c7_cnt) / count(*) c7_fd_score,
        'subj_no' c8_name_1, 'site_no' c8_name_2, sum(c8_cnt) / count(*) c8_fd_score,
        'subj_no' c9_name_1, 'site_name' c9_name_2, sum(c9_cnt) / count(*) c9_fd_score,
        'subj_no' c10_name_1, 'test_code' c10_name_2, sum(c10_cnt) / count(*) c10_fd_score,
        'subj_no' c11_name_1, 'test_val' c11_name_2, sum(c11_cnt) / count(*) c11_fd_score,
        'subj_no' c12_name_1, 'test_rslt' c12_name_2, sum(c12_cnt) / count(*) c12_fd_score,
        --
        'site_no' c13_name_1, 'seq_no' c13_name_2, sum(c13_cnt) / count(*) c13_fd_score,
        'site_no' c14_name_1, 'subj_no' c14_name_2, sum(c14_cnt) / count(*) c14_fd_score,
        'site_no' c15_name_1, 'site_name' c15_name_2, sum(c15_cnt) / count(*) c15_fd_score,
        'site_no' c16_name_1, 'test_code' c16_name_2, sum(c16_cnt) / count(*) c16_fd_score,
        'site_no' c17_name_1, 'test_val' c17_name_2, sum(c17_cnt) / count(*) c17_fd_score,
        'site_no' c18_name_1, 'test_rslt' c18_name_2, sum(c18_cnt) / count(*) c18_fd_score,
        --
        'site_name' c19_name_1, 'seq_no' c19_name_2, sum(c19_cnt) / count(*) c19_fd_score,
        'site_name' c20_name_1, 'subj_no' c20_name_2, sum(c20_cnt) / count(*) c20_fd_score,
        'site_name' c21_name_1, 'site_no' c21_name_2, sum(c21_cnt) / count(*) c21_fd_score,
        'site_name' c22_name_1, 'test_code' c22_name_2, sum(c22_cnt) / count(*) c22_fd_score,
        'site_name' c23_name_1, 'test_val' c23_name_2, sum(c23_cnt) / count(*) c23_fd_score,
        'site_name' c24_name_1, 'test_rslt' c24_name_2, sum(c24_cnt) / count(*) c24_fd_score,
        --
        'test_code' c25_name_1, 'seq_no' c25_name_2, sum(c25_cnt) / count(*) c25_fd_score,
        'test_code' c26_name_1, 'subj_no' c26_name_2, sum(c26_cnt) / count(*) c26_fd_score,
        'test_code' c27_name_1, 'site_no' c27_name_2, sum(c27_cnt) / count(*) c27_fd_score,
        'test_code' c28_name_1, 'site_name' c28_name_2, sum(c28_cnt) / count(*) c28_fd_score,
        'test_code' c29_name_1, 'test_val' c29_name_2, sum(c29_cnt) / count(*) c29_fd_score,
        'test_code' c30_name_1, 'test_rslt' c30_name_2, sum(c30_cnt) / count(*) c30_fd_score,
        --
        'test_val' c31_name_1, 'seq_no' c31_name_2, sum(c31_cnt) / count(*) c31_fd_score,
        'test_val' c32_name_1, 'site_no' c32_name_2, sum(c32_cnt) / count(*) c32_fd_score,
        'test_val' c33_name_1, 'site_name' c33_name_2, sum(c33_cnt) / count(*) c33_fd_score,
        'test_val' c34_name_1, 'test_code' c34_name_2, sum(c34_cnt) / count(*) c34_fd_score,
        'test_val' c35_name_1, 'test_rslt' c35_name_2, sum(c35_cnt) / count(*) c35_fd_score,
        'test_val' c36_name_1, 'subj_no' c36_name_2, sum(c36_cnt) / count(*) c36_fd_score,
        --
        'test_rslt' c37_name_1, 'site_no' c37_name_2, sum(c37_cnt) / count(*) c37_fd_score,
        'test_rslt' c38_name_1, 'site_name' c38_name_2, sum(c38_cnt) / count(*) c38_fd_score,
        'test_rslt' c39_name_1, 'test_code' c39_name_2, sum(c39_cnt) / count(*) c39_fd_score,
        'test_rslt' c40_name_1, 'test_val' c40_name_2, sum(c40_cnt) / count(*) c40_fd_score,
        'test_rslt' c41_name_1, 'seq_no' c41_name_2, sum(c41_cnt) / count(*) c41_fd_score,
        'test_rslt' c42_name_1, 'subj_no' c42_name_2, sum(c42_cnt) / count(*) c42_fd_score
    from x
)
select c1, c2, fd_score
from x2
unpivot include nulls (
    (c1, c2, fd_score)
    for name in (
        (c1_name_1, c1_name_2, c1_fd_score),
        (c2_name_1, c2_name_2, c2_fd_score),
        (c3_name_1, c3_name_2, c3_fd_score),
        (c4_name_1, c4_name_2, c4_fd_score),
        (c5_name_1, c5_name_2, c5_fd_score),
        (c6_name_1, c6_name_2, c6_fd_score),
        --
        (c7_name_1, c7_name_2, c7_fd_score),
        (c8_name_1, c8_name_2, c8_fd_score),
        (c9_name_1, c9_name_2, c9_fd_score),
        (c10_name_1, c10_name_2, c10_fd_score),
        (c11_name_1, c11_name_2, c11_fd_score),
        (c12_name_1, c12_name_2, c12_fd_score),
        --
        (c13_name_1, c13_name_2, c13_fd_score),
        (c14_name_1, c14_name_2, c14_fd_score),
        (c15_name_1, c15_name_2, c15_fd_score),
        (c16_name_1, c16_name_2, c16_fd_score),
        (c17_name_1, c17_name_2, c17_fd_score),
        (c18_name_1, c18_name_2, c18_fd_score),
        --
        (c19_name_1, c19_name_2, c19_fd_score),
        (c20_name_1, c20_name_2, c20_fd_score),
        (c21_name_1, c21_name_2, c21_fd_score),
        (c22_name_1, c22_name_2, c22_fd_score),
        (c23_name_1, c23_name_2, c23_fd_score),
        (c24_name_1, c24_name_2, c24_fd_score),
        --
        (c25_name_1, c25_name_2, c25_fd_score),
        (c26_name_1, c26_name_2, c26_fd_score),
        (c27_name_1, c27_name_2, c27_fd_score),
        (c28_name_1, c28_name_2, c28_fd_score),
        (c29_name_1, c29_name_2, c29_fd_score),
        (c30_name_1, c30_name_2, c30_fd_score),
        --
        (c31_name_1, c31_name_2, c31_fd_score),
        (c32_name_1, c32_name_2, c32_fd_score),
        (c33_name_1, c33_name_2, c33_fd_score),
        (c34_name_1, c34_name_2, c34_fd_score),
        (c35_name_1, c35_name_2, c35_fd_score),
        (c36_name_1, c36_name_2, c36_fd_score),
        --
        (c37_name_1, c37_name_2, c37_fd_score),
        (c38_name_1, c38_name_2, c38_fd_score),
        (c39_name_1, c39_name_2, c39_fd_score),
        (c40_name_1, c40_name_2, c40_fd_score),
        (c41_name_1, c41_name_2, c41_fd_score),
        (c42_name_1, c42_name_2, c42_fd_score)
    )
);

C1         C2         FD_SCORE NOTES(see below)
seq_no     subj_no           1  (2)
seq_no     site_no           1  (2)
seq_no     site_name         1  (2)
seq_no     test_code         1  (2)(5)
seq_no     test_val          1  (3)
seq_no     test_rslt         1  (2)
subj_no    seq_no        0.125  (1)
subj_no    site_no       0.125
subj_no    site_name     0.125
subj_no    test_code         1  (5)
subj_no    test_val      0.125  (1)
subj_no    test_rslt     0.125
site_no    seq_no            0  (1)
site_no    subj_no           0
site_no    site_name      0.75  =====> almost FD
site_no    test_code         1  (5)
site_no    test_val          0  (1)
site_no    test_rslt       0.5
site_name  seq_no         0.25  (1)
site_name  subj_no        0.25
site_name  site_no           1  =====> a FD!
site_name  test_code         1  (5)
site_name  test_val       0.25  (1)
site_name  test_rslt      0.75  =====> yet another almost FD?
test_code  seq_no            0  (1)(4)
test_code  subj_no           0  (4)
test_code  site_no           0  (4)
test_code  site_name         0  (4)
test_code  test_val          0  (1)(6)
test_code  test_rslt         0  (4)
test_val   seq_no            1  (3)(4)
test_val   site_no           1  (2)(4)
test_val   site_name         1  (2)(4)
test_val   test_code         1  (2)(6)
test_val   test_rslt         1  (2)(4)
test_val   subj_no           1  (2)(4)
test_rslt  site_no       0.375
test_rslt  site_name     0.375
test_rslt  test_code     0.375  (5)
test_rslt  test_val      0.375  (1)
test_rslt  seq_no            1  (1)
test_rslt  subj_no       0.375

42 rows selected.


Good Grief!

I spent some time wrapping my head around this. This sure is a lot of SQL for analyzing a modest seven column table! That awkwardness may become an issue later (foreshadowing). Well, hopefully I can mitigate some of the pain of the large/ugly SQL by not writing it by hand and generating it instead.

A general-purpose language like Python or Java would be the typical choice to hack away at this with, but I saw a possible opportunity to optimize by bringing the "code to the data" if I could represent a functional dependency algorithm in SQL. Whether Oracle, Postgres, or Spark SQL, a similar approach could be applied to multiple domains.

The above results nicely illustrate a few useful things.

Functional dependencies are not symmetric:  C1->C2 does not imply C2->C1. For example Site_no is functionally dependent on site_name, but not vice-versa.

Test_rslt is almost functionally dependent on site_no. Happy Clinic sure does seem happier than anywhere else. Someone may need to look into that :).

I also noticed a lot of false positives and other weirdness that can be leveraged to further optimize the approach. I added a "NOTES" column above and detailed out these cases below.



Special cases that can ideally be detected beforehand and excluded from the FD analysis

Uniqueness Cases

(1) When C1 is not unique and C2 is unique, then C1!->C2.

If column C1 is not unique, and C2 is unique, then C2 from cannot be functionally dependent on C1.

ex:
C1=* except seq_no C2=test_val
C1=* except test_val C2=seq_no

(2) When C1 is unique and C2 is not unique, then C1->C2.

C1, being unique, is simply a candidate key for the table. Therefore, C2 is functionally dependent on C1.

ex:
C1=test_val C2=* except seq_no
C1=seq_no C2=* except test_val

(3) When both C1 and C2 are unique, then C1->C2 and C2->C1.

ex:
C1=seq_no C2=test_val
C1=test_val C2=seq_no

Analyzing cases with either C1 and/or C2 is unique is a wasted effort; providing no new useful information. A useful approach is to exclude unique columns from FD analysis.

Single-Value Cases

(4) When C1 has a single value for all rows and C2 does not, then C1!->C2.

ex:
C1=test_code C2=* except test_val
C1=test_val C2=* except test_code

(5) When C2 has a single value for all rows and C1 does not, then C1->C2.

C2, always having the same value, can be "derived" from anything/nothing.  C1 will appear as a "false positive" functional dependency.

ex:
C1=* except test_val C2=test_code
C1=* except test_code C2=test_val

(6) When both C1 and C2 have a single value for all rows, then C1->C2 and C2->C1.

ex:
C1=test_val C2=test_code
C1=test_code C2=test_val

Analyzing cases with either C1 and/or C2 is having single values is a wasted effort; providing no new useful information. A useful approach is to exclude single-value columns from FD analysis.



My first attempt at FD Analyzer


My first attempt is a 527-line fd.sql script for an Oracle >=11 database, that uses a single anonymous PL/SQL block to accept runtime parameters, generate and execute the analysis SQL, then print the results. 

I use the optimized SQL approach I arrived at in my earlier post, which will scan the target table only ONCE for all FD pairs in the table. Therefore a table with 100 columns will require a single scan, (not 100! / (2(100-2))! or even 100 scans). I generate the onerously long SQL dynamically and optimize for the six special cases above by detecting and pruning columns related to these special cases from the FD analysis.

Here is what the result looks like. I am running the FD analyzer (fd.sql) against my example schema.table (user lab_test_rslt) and sampling 100% of the data for analysis:
SQL> @fd user lab_test_rslt 100

SITE_NAME -> SITE_NO

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.050


Short and sweet, but what is really going on? Here is a second execution with DEBUG and VERBOSE messages enabled to illustrate some of what is going on under the hood:
SQL> @fd user lab_test_rslt 100
        >>> VBS: Owner.table: user.LAB_TEST_RSLT>>> VBS:Sample factor (0.000001-100): 100
        >>> VBS:FD_Score lower limit (0-1): >=1
        >>> VBS:# of Columns before prune: 7
        >>> VBS:pruning: 1,SEQ_NO,cnt=8,num_distinct=8,num_nulls=0 Unique ...
        >>> VBS:NOT pruning: 2,SUBJ_NO,cnt=8,num_distinct=4,num_nulls=0
        >>> VBS:NOT pruning: 3,SITE_NO,cnt=8,num_distinct=3,num_nulls=0
        >>> VBS:NOT pruning: 4,SITE_NAME,cnt=8,num_distinct=4,num_nulls=0
        >>> VBS:pruning: 5,TEST_CODE,cnt=8,num_distinct=1,num_nulls=0 Only one distinct value ...
        >>> VBS:pruning: 6,TEST_VAL,cnt=8,num_distinct=8,num_nulls=0 Unique ...
        >>> VBS:NOT pruning: 7,TEST_RSLT,cnt=8,num_distinct=4,num_nulls=0
        >>> VBS:# of Columns after prune: 4
        >>> VBS:# of column pairs: 12
        >>> DBG:get_results();
        >>> DBG:1: SUBJ_NO,SITE_NO
        >>> DBG:2: SUBJ_NO,SITE_NAME
        >>> DBG:3: SUBJ_NO,TEST_RSLT
        >>> DBG:4: SITE_NO,SUBJ_NO
        >>> DBG:5: SITE_NO,SITE_NAME
        >>> DBG:6: SITE_NO,TEST_RSLT
        >>> DBG:7: SITE_NAME,SUBJ_NO
        >>> DBG:8: SITE_NAME,SITE_NO
        >>> DBG:9: SITE_NAME,TEST_RSLT
        >>> DBG:10: TEST_RSLT,SUBJ_NO
        >>> DBG:11: TEST_RSLT,SITE_NO
        >>> DBG:12: TEST_RSLT,SITE_NAME
        >>> DBG:with x as (
        select /*+ monitor */
                case when count(distinct SITE_NO) over (partition by SUBJ_NO)=1 then 1 else 0 end c1,
                case when count(distinct SITE_NAME) over (partition by SUBJ_NO)=1 then 1 else 0 end c2,
                case when count(distinct TEST_RSLT) over (partition by SUBJ_NO)=1 then 1 else 0 end c3,
                case when count(distinct SUBJ_NO) over (partition by SITE_NO)=1 then 1 else 0 end c4,
                case when count(distinct SITE_NAME) over (partition by SITE_NO)=1 then 1 else 0 end c5,
                case when count(distinct TEST_RSLT) over (partition by SITE_NO)=1 then 1 else 0 end c6,
                case when count(distinct SUBJ_NO) over (partition by SITE_NAME)=1 then 1 else 0 end c7,
                case when count(distinct SITE_NO) over (partition by SITE_NAME)=1 then 1 else 0 end c8,
                case when count(distinct TEST_RSLT) over (partition by SITE_NAME)=1 then 1 else 0 end c9,
                case when count(distinct SUBJ_NO) over (partition by TEST_RSLT)=1 then 1 else 0 end c10,
                case when count(distinct SITE_NO) over (partition by TEST_RSLT)=1 then 1 else 0 end c11,
                case when count(distinct SITE_NAME) over (partition by TEST_RSLT)=1 then 1 else 0 end c12
        from user.LAB_TEST_RSLT
), x2 as (
        select count(*) c,
                sum(c1)/count(*) p1fd,
                sum(c2)/count(*) p2fd,
                sum(c3)/count(*) p3fd,
                sum(c4)/count(*) p4fd,
                sum(c5)/count(*) p5fd,
                sum(c6)/count(*) p6fd,
                sum(c7)/count(*) p7fd,
                sum(c8)/count(*) p8fd,
                sum(c9)/count(*) p9fd,
                sum(c10)/count(*) p10fd,
                sum(c11)/count(*) p11fd,
                sum(c12)/count(*) p12fd
        from x
)
select c, fd
from x2 unpivot include nulls (
        (fd) for n in (
                p1fd,           p2fd,           p3fd,           p4fd,           p5fd,           p6fd,           p7fd,           p8fd,           p9fd,           p10fd,
                p11fd,          p12fd   )
)
        >>> DBG:sql length: 1646
        >>> DBG:l_results.count:12
        >>> DBG:SUBJ_NO !-> SITE_NO (s=.125 cnt=8)
        >>> DBG:SUBJ_NO !-> SITE_NAME (s=.125 cnt=8)
        >>> DBG:SUBJ_NO !-> TEST_RSLT (s=.125 cnt=8)
        >>> DBG:SITE_NO !-> SUBJ_NO (s=0 cnt=8)
        >>> DBG:SITE_NO !-> SITE_NAME (s=.75 cnt=8)
        >>> DBG:SITE_NO !-> TEST_RSLT (s=.5 cnt=8)
        >>> DBG:SITE_NAME !-> SUBJ_NO (s=.25 cnt=8)
SITE_NAME -> SITE_NO
        >>> VBS:        (s=1 cnt=8)
        >>> DBG:SITE_NAME !-> TEST_RSLT (s=.75 cnt=8)
        >>> DBG:TEST_RSLT !-> SUBJ_NO (s=.375 cnt=8)
        >>> DBG:TEST_RSLT !-> SITE_NO (s=.375 cnt=8)
        >>> DBG:TEST_RSLT !-> SITE_NAME (s=.375 cnt=8)

PL/SQL procedure successfully completed.
Elapsed: 00:00:00.058




Does it scale?


Next, let's try a somewhat more challenging table to analyze for functional dependencies to see if my approach scales.

This table has 50 columns and 50,000 rows of data:
exec dbms_random.seed(0);

drop table t1;

create table t1 as select
        0 c1,
        trunc(2 * dbms_random.normal) c2,
        trunc(2 * dbms_random.normal) c3,
        trunc(10 * dbms_random.normal) c4,
        trunc(10 * dbms_random.normal) c5,
        dbms_random.string('X',2) c6,
        dbms_random.string('X',8) c7,
        rownum c8,
        trunc(100 * dbms_random.normal) c9,
        trunc(100 * dbms_random.normal) c10,
        --
        1 c11,
        trunc(2 * dbms_random.normal) c12,
        trunc(2 * dbms_random.normal) c13,
        trunc(10 * dbms_random.normal) c14,
        trunc(10 * dbms_random.normal) c15,
        dbms_random.string('X',2) c16,
        dbms_random.string('X',8) c17,
        rownum c18,
        trunc(100 * dbms_random.normal) c19,
        trunc(100 * dbms_random.normal) c20,
        --
        2 c21,
        trunc(2 * dbms_random.normal) c22,
        trunc(2 * dbms_random.normal) c23,
        trunc(10 * dbms_random.normal) c24,
        trunc(10 * dbms_random.normal) c25,
        dbms_random.string('X',2) c26,
        dbms_random.string('X',8) c27,
        rownum c28,
        trunc(100 * dbms_random.normal) c29,
        trunc(100 * dbms_random.normal) c30,
        --
        3 c31,
        trunc(2 * dbms_random.normal) c32,
        trunc(2 * dbms_random.normal) c33,
        trunc(10 * dbms_random.normal) c34,
        trunc(10 * dbms_random.normal) c35,
        dbms_random.string('X',2) c36,
        dbms_random.string('X',8) c37,
        rownum c38,
        trunc(100 * dbms_random.normal) c39,
        trunc(100 * dbms_random.normal) c40,
        --
        4 c41,
        trunc(2 * dbms_random.normal) c42,
        trunc(2 * dbms_random.normal) c43,
        trunc(10 * dbms_random.normal) c44,
        trunc(10 * dbms_random.normal) c45,
        dbms_random.string('X',2) c46,
        dbms_random.string('X',8) c47,
        rownum c48,
        trunc(100 * dbms_random.normal) c49,
        trunc(100 * dbms_random.normal) c50
    from dual
connect by level <= 50000;

exec dbms_stats.gather_table_stats(user, 'T1');

The resulting table data looks like 50 columns * 50,000 rows of this:
SQL> select * from t1 where rownum < 11;

  C1   C2   C3    C4    C5 C6   C7           C8     C9    C10   C11   C12   C13   C14   C15 C16   C17          C18   C19    C20   C21   C22   C23   C24   C25 C26   C27          C28   C29    C30   C31   C32   C33   C34   C35 C36   C37          C38    C39    C40   C41   C42
   C43   C44   C45 C46   C47          C48   C49    C50
   0   -1   -1    -1    -6 GY   XTG9HRTG      1    -48   -103     1     1     1    -9     2 41    DY9P0VZ0       1   -26   -100     2     0     0    11    -4 DB    R4OC4JOX       1   -19    -79     3     2     0   -11     4 RG    TXUKDWWX       1     49    -88     4     0
    -3   -21    28 US    Y8K1VNS2       1    93   -141
   0    1   -1     1     7 5N   YQ8N1DIO      2     32    -33     1     0    -3   -17   -18 N1    9KJVSHV2       2   -77     -1     2     0     0    13     0 0P    6V6GXE7B       2    71     34     3    -1     0   -10    -9 CZ    Q4RY4USA       2   -108    -65     4     0
     0     9     2 32    HCTVD58M       2   -22    -58
   0    1   -1    15    13 SY   CT0BQDS3      3    -14    -23     1     3     0    -8     5 5Z    Q41W2V3S       3    -1    -36     2    -3     0     3     7 XT    F9G1YA5Z       3   -32    -33     3     2    -1     5     7 UF    O337E2NW       3     46    -45     4     1
    -1    -6    -8 UZ    2W57U5YC       3   -33    -73
   0   -1   -3     0    16 A0   1PFRZ9E8      4   -123     54     1     0    -1    -4    -6 WP    IXXKNG7W       4   104    -32     2    -3     0    -8     0 Q8    V3T74TF4       4   -44     29     3     3     0    13   -14 BN    PM420DX7       4    -32    -66     4     0
    -1    -2     5 92    C0JY11BK       4    18     58
   0   -5   -1    15    -4 M0   28LRLTA4      5    -69     99     1     0     0     6    -4 79    IOEN7CJL       5    28    154     2     2     0   -12    -5 B9    WXQK7NSI       5    27    -50     3     0    -2     4     4 7Y    552UCMT5       5     42   -231     4    -4
     0     1    22 1U    PSX1WZS8       5    53     61
   0   -1    0     3    17 WU   EKZ5FEB4      6    -38     20     1     0     1   -19   -10 VC    7OT06XYV       6   121      3     2    -2    -3    10    -1 8S    3IRG90W0       6    49    -59     3     0     0    12    12 50    PZRQNVQ9       6    -66    -18     4     0
    -3    11     1 W5    81BMFSCZ       6    88     38
   0    0    0     2   -10 WJ   H52IBY15      7     57    -35     1     0     0     2    13 60    85S3FI0K       7    21     63     2     0    -3    13     3 IK    S6IXB3B2       7   255     68     3     0     1    -6     2 RC    LUQETJA1       7    170     -4     4    -1
     0   -15     1 20    10NGVIAO       7    12     82
   0    3    1   -13     3 I4   S1I5M01B      8     12    -44     1     0     0    -4     1 XR    LX5206RF       8   -36     28     2    -2     0     7    22 OU    J4C13AIB       8    45   -225     3    -4     0    10   -11 JC    NIAJHI6Q       8     57    -54     4     2
     0    11    -9 W0    5FQ1139O       8   -79   -204
   0    3   -1    17     3 DR   AG205VLD      9    -60   -106     1     0     2    15   -15 CO    S3IQES03       9   -71      7     2     0    -2     2     6 KU    6HDHMCUN       9   -13    237     3    -1    -2    -1   -12 64    CEV6D187       9    -19    -94     4     0
     0    -1    -1 OS    VQ5EGDDW       9   -99   -189
   0   -2   -2   -17    -9 XE   A1YT2MLU     10    -29    168     1     5     2    -7   -21 CH    NWJJ0RQ7      10   -92   -389     2    -2     0    12   -15 IR    020QEHLL      10   -71     85     3    -2     1    13   -10 6O    LSI4DOY3      10    238    -43     4     2
     0    -3     5 UY    3SXFEAJL      10   -75     56

10 rows selected.



As is usually the case, I encountered several limitations that needed to be addressed once I started trying to analyze tables that were not trivial textbook examples:

  • I began with using a VARCHAR2 datatype for the analysis SQL that I was generating, but had to change up to a CLOB as VARCHAR2s are limited to 32,767 characters and I very quickly exceeded this limit with the bulky, repetitive SQL I was generating. 
  • I had to limit the number of column pairs to analyze in a single SQL statement to avoid ORA-01792: maximum number of columns in a table or view is 1000. A table with 33 columns results in 1056 permutations of column pairs, which is TOO MANY. I mitigated this with batching into an analysis of 700 column permutations per SQL statement. Unfortunately, more SQL statements require more table scans, slowing down the analysis.
  • I experimented with choosing a random sample of rows to driven by a sample percentage (0.000001 - 100 where 100=ALL) as per Oracle SAMPLE() modifier. This seems to have limited use as I cannot see how the result from a sample is statistically useful in this case.
  • I started with running additional SQL to determine column cardinality (unique, single-values, etc.), but changed to leverage existing Oracle table stats instead. This works great as long as I remember to run dbms_status.gather_table_stats();
  • I also experimented with partial functional dependencies by introducing a g_fd_score_lower_limit variable that could be set to 1.0 to find strict FDs only, or set to lower values to find "almost FDs" as well. This does seem like it might be useful.

My environment was Oracle 12.2.0.1, on Redhat, a Dell PowerEdge R730XD with 16 cores. Here is what the execution profile looks like when I fire up this lights-dimming monster:



For this 50 column table with 50,000 rows, the execution time was an hour and forty-five minutes. That's pretty horrible. I normally would not even wait around that long for a SQL script, but I'm spending time to write up a post about this, so wanted to have something to write about.

I'll spare you the 3,622 debug & verbose stdout here, but if you want to see the whole tamale:
fd_XXX_t1_100.log

Fd.sql pruned 11 of the 50 columns, leaving 39 columns to analyze, resulting in 1,482 column pairs, any of which could potentially be a functional dependency. Without pruning, there would be 2,450 column pairs to analyze (n!/(n-2)! where n = the number of columns), so pruning reduced the work to do by about 40%.

As noted above, the resulting SQL was so large, that, to avoid ORA-01792: maximum number of columns  in a table or view is 1000,  fd.sql  split the analysis into three SQL statements; the first two for 700 column pairs each and the 3rd for the final eighty-two pairs. If you squint at the above SQL Execution profile, you can see two large humps for the first two SQL statements (apparently the 3rd statement was too trivial to appear).

The time the database spent doing work was primarily CPU (63%), secondarily I/O (37%). There was 331GB of  I/O, almost evenly split between reads (55%) and writes (45%).



Under the Hood


I also captured an Oracle trace for a different execution against a smaller 10,000 row version of the 50 column table in a sandbox Oracle instance on my laptop. This allowed for a more detailed inspection of what Oracle was doing under the hood.

This execution took about 16 minutes, 5 minutes of which were spent just parsing several absolutely ridiculous multi-thousand-line SQL selects. Looking at the trace, you can tell that most of the CPU is the result of several thousand sorts.

Here is the trace, the tkprof with no sort, and tkprof sorted by execution time:


In summary, my optimizations reduced redundant table scans, but that ended up being the least important part of the overall workload. The real work was the many DISTINCT COUNT operations, each requiring a SORT. There are n!/(n-2)! SORTS required, where n = number of columns. 

This algorithm is a textbook example of an O(n²) complexity: as the number of inputs increases linearly, the time to compute expands exponentially.

I need a better algorithm.

Next Steps (to be addressed in a future post)


We have now taken the original prototype from "Grokking the Data Model" and used it as the basis for a first attempt at an FD Analyzer utility. Everything looked hunky-dory (technical term for groovy) until we tried to use it against a table that was not a trivial textbook example, then the harsh reality of O(n²) algorithms crashed the party.

The next step is to find a better algorithm, or at least understand why a better algorithm cannot be found. I suspect that my favorite utility will be useful (https://lmgtfy.com/)

More to come in a future post.

Friday, August 31, 2018

An Interview Question I Would be Horrible At

XKCD: Strengths and Weaknesses

Have you ever had programming/algorithm interview questions that require you to write code off the top-of-your-head without a keyboard at your fingertips, while in a stressful situation, away from your normal development environment?

I've had a few interviews like that in the past. I usually come up with all sorts of creative answers

... starting right after I finish the interview, as I'm driving away in my car, then hitting a peak the morning after (that's when I really shine!). During the actual interview, my apparent IQ drops by 1/3.

Maybe the reason I suck at "whiteboard coding" in interviews is that I don't practice it. It is nothing like what I actually do for a job, so I would need to suck it up and spend some time practicing "coding quizzes" to do well in such an interview. I would rather skip that and do real work instead.

Also, I have to admit another probable reason as well. I'm just not well-suited for it. I would characterize my general methodology as somewhat measured, thoughtful and methodical to offset my zany, intuitive, artsy tendencies. This is an unlikely teeter-totter balance honed from dauntless decades of determined dedication. Also, there is a certain amount of high-quality coffee required. This juggling act just does not mix well with interviews. Strengths are weaknesses.

Having said all that, I'm going to visit a worst-nightmare example of an interview question for "educational purposes":


THE INTERVIEW QUESTION

Given a list of n+2 numbers, with all numbers in the range of 1..n and all numbers unique except for two numbers which each occur twice, find the duplicate numbers. Use two different methods, each using no more than one loop. Compare and contrast the advantages and disadvantages of each.

To get warmed up, first let's start with the the easiest algorithm I can think of -- one that would be a FAIL because it uses two loops:

Method #1 Nested Loops


Use two loops to compare every list element to every other element. If two different elements are equal, then you have found the duplicate.

Python:
mylist = [1,2,3,3,4,5,6,6,7]
mylist_size = len(mylist)
for x in range (0, mylist_size):
    for y in range (x + 1, mylist_size):
        if mylist[x] == mylist[y]:
            print(mylist[x])

This method does not take advantage of all the provided information: that there are only two repeating elements and the elements are in the range 1..n. It is a more generic and flexible approach, which can adapt to an arbitrary number of repeating elements (not just two) and can also handle lists of numbers outside of the 1..n range. Such flexibility is valuable if the needs are likely to evolve over time, which is a valuable consideration in building real-world systems.

The performance is O(n²), so is directly proportional to the square of the list size (it will gobble up cpu/time). For example, a 1,000 item list only took 6 seconds on my laptop, but a 10,000 item list took over two minutes. I can estimate that a 100,000 item list would take 100 times longer (about 3 1/2 hours).

On a positive note, it requires minimal O(1) state (it will not gobble up disk/memory) and is the simplest of the three methods.

This does not scale out well as it is essentially a cartesian product. It would be problematic to distribute this across many compute nodes as you would need to copy/shuffle the list around quite a bit.

This approach is perhaps well-suited for implementations with tiny data (dozens to hundreds of rows tops), where the code-simplicity outweighs all other concerns. Maybe it is also useful as an example of What Not To Do.


Method #2 loop and dictionary


For the second attempt, let's try to actually provide a solution that will meet the requirements.

Use one loop and a dictionary to keep track of all elements in the list. When you see an element a 2nd time, then it is a duplicate.

Python:
mylist = [1,2,3,3,4,5,6,6,7]
mylist_size = len(mylist)

# create dictionary of counts
count_dict = {}
for x in range(0, mylist_size):
    # we have already seen this element, it is a duplicate
    if (mylist[x] in count_dict):
        print(mylist[x])
    else:
        # have not seen this element yet, so add to dictionary
        count_dict[mylist[x]] = 1


Like method #1, this method does not take advantage of all the provided information: that there are only two repeating elements and that the numbers are in the range 1..n. Similar to method #1 it is also a more generic and flexible approach that can handle similar adaptations.

This approach performs O(n) for cpu/time but also requires O(n) state (memory/disk -- whichever) for the dictionary, so is more efficient in terms of cpu/time and state than method #1, but is more complicated. There are variations that reduce state a little by keeping a dictionary of hashes, or modifying the list itself to mark duplicates.

A 100,000 item list took about 6 seconds on my laptop (I WAG this is ~2,000 times faster than method 1 for the same sized list). I did not bother to measure the memory used by the dictionary, but that would be the limit one would hit first using this method.

This approach is straightforward to scale out across many nodes as you can easily partition out the work and the reassemble the results.

I would generally default to this approach as the best-balanced of the three methods here.


Method #3 system of equations


Create a system of equations to solve for the two repeating values.

given:
    S is sum of all numbers in the list
    P is the product of all numbers in the list
    X and Y are the duplicate numbers in the list
    the sum of integers from 1 to N is N(N+1)/2
    the product of integers from 1 to N is N!

When the sum of the list is subtracted from N(N+1)/2, we get X + Y because X and Y are the two numbers missing from set [1..N]:

    (1) X + Y = N(N+1)/2 - S

Similarly, calculate the product P of the list. When this product of the list is divided by N!, we get X*Y:

    (2) XY = P/N!

Using the above two equations, we can solve for X and Y.  I could use an Algebra refresher, so this could probably be done more simply, but here is an example:

For array = 4 2 4 5 2 3 1, we get S = 21 and P = 960.

    start with (1) from above    X + Y = S - N(N+1)/2 - S
    substitute values for S and N
    X + Y = (4+2+4+5+2+3+1) - 5(5+1)/2
    X + Y = 21 - 15
    X + Y = 6

    start with (2) from above    XY = P/N!
    substitute values for P and N and reduce
    XY = 960/5!
    XY = 8

State (X - Y) in terms of (X + Y) and (XY) to make solving with substitution easier:

    restate using square root of a square identity
    X - Y = sqrt( (X - Y)² )
    expand difference of squares
    X - Y = sqrt( X² + Y² - 2XY )
    restate xy terms
    X - Y = sqrt( X² + Y² + 2XY - 4XY )
    factor sum of squares
    X - Y = sqrt( (X + Y² - 4XY )
    substitute values for P and N and reduce
    X - Y = sqrt( 6² - 4*8 )
    X - Y = sqrt( 36 - 32)
    X - Y = sqrt(4)
    X - Y = 2

Now, we have a simple system of two equations to solve:

    (3) X + Y = 6
    (4) X - Y = 2

    add (3) and (4) from above to solve for X
    X = ((X - Y) + (X + Y)) / 2
    X = (2 + 6) / 2
    X = 8 / 2
    X = 4

    subtract (4) from (3) above to solve for Y
    Y = ((X + Y) - (X - Y)) / 2
    Y = (6 - 2) / 2
    Y = 4 / 2
    Y = 2


Python:
import math

mylist = [1,2,3,3,4,5,6,6,7]
mylist_size = len(mylist)
n = mylist_size - 2

# Calculate sum and product of all elements in list
s = 0
p = 1
for i in range(0, mylist_size):
    s = s + mylist[i]
    p = p * mylist[i]
  
# x + y
s = s - n * (n + 1) // 2
# x * y
p = p // math.factorial(n)  
# x - y
d = math.sqrt(s * s - 4 * p)
  
# solve
x = (d + s) // 2
y = (s - d) // 2
  
print(int(x))
print(int(y))

Unlike #1 and #2, this method DOES exploit all of the provided information, so is a more specialized and fragile approach. I imagine that this approach could be modified to handle exactly   z (1,3, etc..) repeating elements, but it would quickly become very complicated to handle an arbitrary number of repeating elements (some sort of dynamic application of the binomial theorem?). Methods #1 and #2 can do this already without any code changes.

This is by far the most complex approach -- especially if your Algebra is as rusty as mine.

This approach performs O(n) (similar to method #2) for cpu/time as it traverses the list to calculate the sum and product, but requires somewhat less than O(n) state to store the product and sum. There is a relatively large cost to calculating the product that does add up, though. For example, calculating the product of a 100,000 item list in Python took about 6 minutes on my laptop (similar to method #2).

This approach is vulnerable to overflow in the sum and product calculations. Interestingly, overflow is either very unlikely or not possible in pure Python 3 (not Numpy, etc.), but is a real issue in many other languages and environments.

The Oracle database number datatype is 124 digits (enough for only an 83 item list) but only has 38 digit precision, so is really only enough for about a 43 item list), which is barely sufficient for trivial PL/SQL textbook examples.


Musings on Other Methods


There are SQL analogues to methods #1 and #2 (the same SQL that declared the solution would result in either a nested loop (method #1) or hash loop (method #2) self-join plan based on what the optimizer knew about the data). Nuff said about that though as it is a whole topic unto itself.

Method #3 is effectively a naive hash that has no collisions but suffers from arithmetic overflow. If I spent time on a more thoughtful hash-based approach, it may be possible to eliminate the overflow concern and I suppose the hash collision concern as well. Using a single hash (not a hash table) to detect duplicates for an arbitrary list is maybe not possible, but at the very least is a Research Problem. This also deserves being treated as a separate discussion topic, so let's ignore that here as well.

There is also a class of solutions that rely on first sorting the list, which then simplifies scanning through list for what are now adjacent duplicates. To me, this could be argued as another variation of Method #2 (with extra work up front) where you modify the list to mark duplicates.

I think there are probably other methods that are not just simple variations on the above themes, but I can't think of them, even after scratching my head on and off for a week.

Nope, I did not do this to practice for interviews, but simply for good, clean fun that segued from me playing around with Python, refreshing my Algebra and struggling through Donald Knuth (yes I am that twisted).

Friday, August 24, 2018

Grokking the Data Model

© DC Comics
In a previous blog entry, I alluded to tools and techniques for quickly getting into the sweet spot for solving data layer problems.

One Really Useful Skill is being able to quickly grok a data model. When in a rush (which is more often than not :), this does not constitute a formal logical data modeling exercise, but simply querying the data dictionary, data profiling, and intuiting the signal from the noise to build a mental map of the relevant data structures.

How to quickly grok a data model? Spend 10,000 hours of on-the job practice and I suspect you'll be able to do this quite rapidly.  Not really a shortcut, eh?. Fair enough :).  Let's dig into this a little deeper then and see if an An Easier Way can be found.



What is it good for?

https://mashe.hawksey.info/2011/10/live-twitter-data-from-fote-fote11/
typical simple data model
(from MASHe The musing of Martin Hawksey, #Fote11 community)
Why care? Why go to the trouble to learn how to Grok a Data Model?

I'm a science fiction bookworm, so cannot resist the Heinlein reference here:

Grok /ˈɡrɒk/ is a word coined by American writer Robert A. Heinlein for his 1961 science fiction novel Stranger in a Strange Land. [...] The Oxford English Dictionary summarizes the meaning of grok as "to understand intuitively or by empathy, to establish rapport with" and "to empathize or communicate sympathetically (with); also, to experience enjoyment".

Don't you want to establish an empathic, enjoyable rapport with your data structures? If not, maybe you should stop reading now :).

The structure of data (i.e, the data model) is a fundamental complexity at the core of the development, maintenance and understanding of database systems. As modern database systems are becoming larger and more complex, it is increasingly difficult to conceptualize a view of the significant structures within a database. This lack of understanding increases system development and maintenance cost, increases the complexity of systems interfaces, and decreases the efficiency of development teams. In general, a comprehensive understanding of the data structures of large, complex database systems is unattainable to all but a few experts that have developed specialized knowledge in this area.

another simple data model
(from https://github.com/CMPUT301W15T09/Team9Project/wiki/UML)


Algorithms and methods that support programmatic analysis of database structures, identification of significant database structures, and organization of complex database systems into more comprehensible structural groups are key enablers in reducing complexity and improving manageability of many tasks in the lifecycle of database systems.


Where to start? (Answer: Functional Dependencies/Third Normal Form)


Instead of trying to boil the ocean, let's apply some necessary discipline and focus on a specific area.

Abstracting a “conceptual skeleton” out of low-level structural details provides a framework for comprehension of complex database systems. Apt analogies for doing without this are trying to grasp what is going on in a program if you only have access to its machine-level description, or trying to read a book a letter at a time.

Data structures in the guts of real-world systems I typically encounter (for good and bad reasons) are littered with redundancies that get in the way of understanding what they really are.

A proposed specific goal is to experiment with programmatically identifying single column functional dependencies in a database table. 

For you database-heads out there, I am really just talking about Third Normal Form; eliminating non-key columns that are dependent on other non-key columns within the same table (if you are not familiar with Third Normal Form, see the wikipedia entry).

So, let's see if we can codify domain knowledge and human expertise into methods that can be programmatically applied and leveraged.

I'm not sure yet if this is simple, hard or even possible. Let's find out together. This is a journey with an unknown destination. I may get lost, never arrive or change destinations. The old adage: "What is important is the journey, not the destination" might apply here. We will inevitably learn something useful along the way.

Ok, let's get started...


Create a simple table and fill with sample data

create table lab_test_rslt (
   seq_no number,
   subj_no number,
   site_no number,
   site_name varchar2(35),
   test_code varchar2(35),
   test_val varchar2(35),
   test_rslt varchar2(35),
   primary key (seq_no)
);
 
insert into lab_test_rslt values (1,1,10,'Happy Clinic','HR',60,'OK');
insert into lab_test_rslt values (2,2,10,'Happy Clinic','HR',70,'OK');
insert into lab_test_rslt values (3, 3,10,'Happy Clinic','HR',80,'OK');
insert into lab_test_rslt values (4, 4,10,'Happy Clinic','HR',55,'OK');
insert into lab_test_rslt values (5, 1,20,'Regional Med Ctr','HR',0,'DEATH');
insert into lab_test_rslt values (6,2,20,'Regional Med Ctr','HR',170,'HIGH');
insert into lab_test_rslt values (7,1,30,'Central Hospital','HR',65,'OK');
-- whoops! maybe a typo in site_name!
insert into lab_test_rslt values (8,3,30,'Centrel Hospital','HR',20,'LOW');
commit;
select * from lab_test_rslt order by 1;
 
SEQ_NO   SUBJ_NO   SITE_NO SITE_NAME          TEST_CODE   TEST_VAL   TEST_RSLT
    1         1        10 Happy Clinic       HR          60         OK
    2         2        10 Happy Clinic       HR          70         OK
    3         3        10 Happy Clinic       HR          80         OK
    4         4        10 Happy Clinic       HR          55         OK
    5         1        20 Regional Med Ctr   HR          0          DEATH
    6         2        20 Regional Med Ctr   HR          170        HIGH
    7         1        30 Central Hospital   HR          65         OK
    8         3        30 Centrel Hospital   HR          20         LOW
 
 8 rows selected. 

It is easy for a mere human to eyeball this example and quickly observe that site_name looks like it is dependent on site_no. That is a common pattern and will stand out to most practitioners. Let's see if we can write some code that will observe the same.


First attempt at a Functional Dependency algorithm; prototype in SQL

Pseudocode:

For all site_nos in the lab_test_rslt table:
 if the count of distinct count_names for every site_no is 1, then
  site_name is functionally dependent on site_no
 if the count of distinct count_names for any site_no is >1, then
  site_name is NOT functionally dependent on site_no

Ok, this sounds like a job for SQL "GROUP BY":

select
  site_no,
  count(distinct site_name)
from lab_test_rslt
group by site_no
order by 1;

  SITE_NO   COUNT(DISTINCTSITE_NAME)
       10                          1
       20                          1
       30                          2


So, is site_name functionally dependent on site_no?  In this example, site_no 30 has two different site_names, so you cannot derive the site_name from just knowing the site_no = 30.  So the answer is NO: site_name is not functionally dependent on site_no.

But, if there was no typo in "Centrel Hospital" there WOULD be a FD. Hmmmm... site_name is "usually" FD on site_no.

Can we improve upon this? Perhaps assign a score instead of just Yes/No?

-- what is the unique site_no values ratio of unique site_names that are FD on site_no?
select
    sum(case when count(distinct site_name) = 1 then 1 else 0 end)||' / ('||
    sum(case when count(distinct site_name) = 1 then 1 else 0 end)||' + '||
    sum(case when count(distinct site_name) > 1 then 1 else 0 end)||')'
    "fd_cnt / (fd_cnt + no_fd_cnt)",
    round(sum(case when count(distinct site_name) = 1 then 1 else 0 end) /
    (
        sum(case when count(distinct site_name) = 1 then 1 else 0 end) +
        sum(case when count(distinct site_name) > 1 then 1 else 0 end) 
    ), 2) fd_score
from lab_test_rslt
group by site_no;

fd_cnt / (fd_cnt + no_fd_cnt)     FD_SCORE
2 / (2 + 1)                           0.67


67% of unique site_names are FD on site_no, which is a little more useful, but does not really reflect that only a single row is not FD.


Can we further improve by weighing the score by the number of rows? YES

-- what is the row ratio of unique site_names that are FD on site_no?
select
    sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end)||' / ('||
    sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end)||' + '||
    sum(case when count(distinct site_name) > 1 then count(site_no) else 0 end)||')'
    "fd_cnt / (fd_cnt + no_fd_cnt)",
    sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end) /
    (
        sum(case when count(distinct site_name) = 1 then count(site_no) else 0 end) +
        sum(case when count(distinct site_name) > 1 then count(site_no) else 0 end) 
    ) fd_score
from lab_test_rslt
group by site_no;

fd_cnt / (fd_cnt + no_fd_cnt)     FD_SCORE
6 / (6 + 2)                           0.75

FD_SCORE represents the ratio of rows where C1->C2.

In plain English:
  • C2 is functionally dependent on C1
  • the value of column C2 depends on value of column C1
  • C2 can be derived if you know the value of C1

75% of the rows have site_names that are FD on site_no.  75% is getting fairly close to 100%, so suggests an "almost" FD. This is getting a little more useful.

Parking lot: An interesting possible use case for an FD analyzer: finding almost "patterns" that suggest data errors.(!!?!)



Is it possible to simplify this SQL? YES

-- break out the intermediate results so I can understand what is going on
select
    -- the number of rows where there is a unique site_name for this site_no (a FD!)
    case when count(distinct site_name) = 1 then count(site_no) else 0 end n1,
    -- the total number of rows for this site_no
    count(*) n2,
    -- the ratio of n1:n2
    case when count(distinct site_name) = 1 then count(site_no) else 0 end / 
        count(*) 
    fd_score
from lab_test_rslt
group by site_no;

  N1   N2   FD_SCORE
   0    2          0
   2    2          1
   4    4          1

-- now I can find the ratio of the sum of the numbers, eg:
--    sum(N1) / sum(N2)
--    0+2+4 / 2+2+4
--    6 / 8
--    0.75

-- here is the final SQL for that:
select
    sum(
        case when count(distinct site_name) = 1 then count(site_no) else 0 end) / 
        sum(count(*)
    ) fd_score
from lab_test_rslt
group by site_no;

  FD_SCORE
     .75

Ok, so that does not seem too hard - it is almost a one-liner now!



Is this approach sufficient for non-trivial cases?

The most common scenario for an "FD analyzer" would be to point it at a new mystery table and get it to report out all functional dependencies. The FD analyzer would need to analyze all possible pairs of columns to look for functional dependencies. Analyzing each pair of columns would require scanning the entire set of values for each of the pair of columns. This would typically be a full tablescan in a traditional relational database for EACH PAIR OF COLUMNS, a full filescan for files, or two full column scans in a columnar database. Uggghh.

How many 2 column (order-specific) permutations exist for a table with n columns?   Here is the formula and some worked examples:

n!
P(n,k)= ------
(n-k)!

n = number of columns in the table
k = number of columns to analyze at a time (two)

Ex: a table with 4 columns has 12 combinations:

1 2
1 3
1 4
2 3
2 4
3 4
4 3
4 2
3 2
4 1
3 1
2 1

4!
P(4,2)= ------
(4-2)!

4!    24
------ = --- = 12
(4-2)!    2

Ex: 30 columns have 435 combinations

n!
------
(n-k)!

30!
------ = 435
(30-2)!

Ex: 100 columns have 9900 combinations

100!
------ = 9900
(100-2)!

The current SQL approach will require scanning the data once for each FD within the same grouping, requiring k!/(k-2)! table scans. Therefore a table with 100 columns would require 9,900 scans. Seems a bit inefficient (understatement). There must be a better way.



Is it possible to calculate multiple FDs in a single pass? (within the same grouping) YES

select 'SITE_NO' c1, name c2, FD_SCORE from (
    select
        sum(
            case when count(distinct site_name) = 1 then count(site_no) else 0 end) / 
            sum(count(*)
    ) site_name,
        sum(
            case when count(distinct test_rslt) = 1 then count(site_no) else 0 end) / 
            sum(count(*)
    ) test_rslt
    from lab_test_rslt
    group by site_no
)
unpivot include nulls (
    (fd_score)
    for name in (
        site_name,
        test_rslt
    )
);

C1        C2              FD_SCORE
SITE_NO   SITE_NAME           0.75
SITE_NO   test_rslt            0.5

I introduced the unpivot() function (an Oracle-ism) to conveniently report analysis of each column pair on a separate row.

This SQL approach will require one pass over the data for each column in the table, (ex: the lab_test_rslt_table has 7 columns so this would be 7 table scans). Much better. Can we do more?



Is it possible to calculate multiple FDs in a single pass? (even in different groupings)? YES

with x as (
    select 
        case when count(distinct site_name) over (partition by site_no) = 1 
            then 1 else 0 
        end c1_cnt,
        case when count(distinct test_rslt) over (partition by site_no) = 1 
            then 1 else 0 
        end c2_cnt,
        case when count(distinct test_rslt) over (partition by site_name) = 1 
            then 1 else 0
        end c3_cnt
    from lab_test_rslt
), x2 as (
    select 
        'SITE_NO' c1_name_1,
        'SITE_NAME' c1_name_2,
        sum(c1_cnt) / count(*) c1_fd_score,
        'SITE_NO' c2_name_1,
        'test_rslt' c2_name_2,
        sum(c2_cnt) / count(*) c2_fd_score,
        'SITE_NAME' c3_name_1,
        'test_rslt' c3_name_2,
        sum(c3_cnt) / count(*) c3_fd_score
    from x
)
select c1, c2, fd_score
from x2
unpivot include nulls (
    (c1, c2, fd_score)
    for name in (
        (c1_name_1, c1_name_2, c1_fd_score),
        (c2_name_1, c2_name_2, c2_fd_score),
        (c3_name_1, c3_name_2, c3_fd_score)
    )
);

C1          C2              FD_SCORE
SITE_NO     SITE_NAME           0.75
SITE_NO     test_rslt            0.5
SITE_NAME   test_rslt           0.75

I replaced the GROUP BY with a series of COUNT DISTINCT window functions in the first WITH clause.. This allowed for the equivalent of multiple GROUP BYs within a single SQL statement. Then, I refactored the math into a 2nd WITH clause, which was required anyways to avoid:
ORA-30483: window  functions are not allowed here. Window functions are allowed only in the SELECT list of a query. And, window function cannot be an argument to another window or group function.
This SQL approach will require scanning the lab_test_rslt table ONCE for all FD pairs in the table. Therefore a table with 100 columns will require A SINGLE PASS. 9,900 vs. 1  = MUCH BETTER!





Next steps (to be addressed in a future post)


We now have prototyped what appears to be a reasonably efficient approach at programmatically finding single-column functional dependencies within a database table. We have developed an Oracle SQL implementation and run it through some basic tests, it appears to work and is possibly a kernel for building something useful.

The next step is to implement FD Analyzer (pretty uncreative name - I need to come up with something better!): a utility that will accept as input a table name and return a list of one or more single-column functional dependencies.


Using the FD Analyzer utility, mere humans will then be spared part of the manual effort and complexity of querying the data dictionary, data profiling, and intuiting the signal from the noise to build a mental map of the relevant data structures. Using this utility, one should be able to do many wondrous things:
  • Programmatically identify Third Normal Form violations in a physical data model to aid in understanding/producing a logical model.
  • For quick data model grokking, functional dependents, once identified, can be ignored as derivative and not conceptually significant.
  • As we have now automated what was previously a manual, human task, this, along with other similar approaches could be scaled up to analyze entire databases and tied into automated data curation and dictionary facilities.
But... this is just a lot of hot air until implemented, right? Stay tuned.

Wednesday, August 15, 2018

The Sweet Spot for Solving Data Layer Problems

Sweet spot Racket Tennis Rakieta tenisowa - tennis @kisspng
Photo credit: Rakieta tenisowa - @kisspng
There is a sweet spot for solving data layer problems. Some days it feels like that is the only thing that has kept me gainfully employed for decades.  It lies somewhere in-between (and across) DBA, Business Analyst and Application Developer roles.

(The proceeding story is fictional and should not be construed as depicting actual persons or events, even though it does.)
Harried Project Manager pings me: Thingamajab X runs waaay too long and breaks half the time. I know you have never heard of this system before, but can you take a quick look?  Formal testing starts in 23 hours and we're getting nowhere fast.
DBA: adding an index on lbrslt_code will provide a 12% reduction in I/O compared to the current concatenated index range scan.
Application Developer: I reviewed the hibernate config and can change the JDBC fetch size from 10 to 100 and expect average elapsed time for this query would decrease by 20%.
Project Analyst: I noticed that the result should represent the most recent 24 hours of activity, but it looks like we are doing a little extra work and also including all the way back to midnight of the previous full day. We should be able to reduce the size of the data on average by 33% and make everything that much smaller and faster by fixing the time filter. This process just calculates the average execution time of Jabberdejibbet X by study by day, but it takes over an hour for the query to run. I don't understand why it runs so long.

Looking across, one can connect the dots:

Taking a look at the SQL Plan and table structures; yes, the query could use a more efficient index, but most of db wait is on a range scan inside a nested loop that scans half a billion rows just to throw most away and return a few rows to the client. The final result set returning to the Web App is less than 2k rows, so most of the work (and wait) is in the data layer. Any middle-tier tweaks will be "shaking the short end of the stick".

Knowing that we are concerned with only the last 24 hours of data and looking at the structure and content of these tables, I can convert the outer join to an inner join, rewrite the SQL to completely exclude the "Spamalot" table and filter much earlier in the execution path and on a completely different column.

Thinking through what the SQL specifies and knowing what is in these tables, it includes failed executions in the average execution time, which skews the result pretty badly. Knowing what the customer is trying to do, that is probably unintentional. I chat with the analyst, and then the customer and confirm that this is indeed the case. I add another predicate to the query to filter out failed executions.

The new SQL does 20 thousand instead of 589 million logical reads, is now 129 times faster, returns in 1.2 seconds, and returns a measure that no longer is skewed from failed executions.

A little voice in the back of my head (the repressed architect) pipes up: what if these tables were restructured to completely avoid this ever happening? And... it would be so much simpler to create a really clean instrumentation layer with an API all Web apps on our platform could leverage to address a whole class of queries instead of this single case. It would establish a pattern that would completely avoid all this unnecessary pain.

Well? Why not? A little voice in the back of the architect's head pipes up "Grant me the serenity to accept the things I cannot change, the courage to change the things I can, and the wisdom to know the difference."



    Well.. back to the topic at hand ....  so, how to get into the data layer sweet spot?


    Photo credit: Justbatreviews.com
    • understand the current code
    • understand what the technology is capable of
    • understand the physical data model
    • understand the logical data model
    • understand the business need

    I often get called in to troubleshoot performance or functionality snafus with unfamiliar systems, so have accumulated a hodgepodge of tools and techniques for quickly getting into the sweet spot. There is no "easy button", but there are some shortcuts. Maybe those would be good topics for a future posts.