Monday, August 10, 2020

Producing Inconsistent Read Phantoms for Illustration Purposes Only, then Considering the Real-world Implications

The default transaction isolation level in Oracle (and Snowflake, PostgresSQL, and many other databases) is statement-level read consistency.

Two cases-in-point:
(1) a view referencing a function containing a 2nd sql statement is not by default read consistent.
(2) a PLSQL process with two cursors is not by default read-consistent across cursors.

Let's try to fashion the simplest possible example to demonstrate. This example is for Oracle, but should be applicable to other targets without too much fuss.

Setup

-- cleanup prior
drop table person_sal;
drop table person;
drop function get_salary_amt;
drop view person_view;

-- create
create table person(
    person_no number,
    person_name varchar2(10),
    job_title varchar2(10),
    primary key(person_no)
);
create table person_sal(
    person_no number,
    salary_amt number,
    primary key (person_no),
    foreign key(person_no) references person(person_no)
);
create function get_salary_amt(n in number) return number is
    x number;
begin
    select salary_amt into x from person_sal where person_no = n;
    return x;
end;
/
create or replace view person_view as (
    select
        person_no,
        person_name,
        job_title,
        get_salary_amt(person_no) salary_amt
    from person
);

Test #1

A view referencing a function containing a SQL statement is not by default read-consistent.

-- t0 (initialize test data)
delete from person_sal;
delete from person;
insert into person(person_no, person_name, job_title)
    values (101, 'Sue','Trainee');
insert into person_sal(person_no, salary_amt)
    values (101, 17000);
commit;
select * from person_view;

-- session 1 (long-running process with read consistency problems)
set serveroutput on size 999999;
declare
    cursor cur1 is select * from person_view;
    r cur1%rowtype;
begin
    open cur1;
    dbms_lock.sleep(10);
    fetch cur1 into r;
    dbms_output.put_line(
        r.person_no||' '||
        r.person_name||' '||
        r.job_title||' '||
        r.salary_amt);
    close cur1;
end;
/

-- session 2 t2 (2nd process that changes data while 1st process is running)
-- Execute this in a 2nd sqlplus session during the 10 seconds the above
-- process is sleeping.
update person set job_title = 'CEO';
update person_sal set salary_amt = 250000;
commit;

-- tx (final state of data after both processes have ended)
select * from person_view;

Test #1 Sample Output

-- t0 (Sue is a trainee with a low salary)
PERSON_NO PERSON_NAM JOB_TITLE SALARY_AMT
---------- ---------- ---------- ----------
       101 Sue          Trainee          17000

-- session 1 t1
-- Process 1 starts executing...
-- nothing to show yet

-- session 2 t2
-- Sue gets promoted to ceo with a much higher salary.
-- This happens while the session 1 long-running process is fetching
-- Sue's details.
1 row updated.
1 row updated.
Commit complete.

-- session 1 t3
-- Process 1 finishes and outputs: Note, that the output is a mix of two
-- different points in time. job title is not yet updated, but salary is.
-- This is commonly called a "phantom". The only two valid consistent states
-- for sue's details are: 101 Sue    Trainee    17000 (before the promotion) 101
-- Sue CEO 250000 (after the promotion) BAD!
101 Sue Trainee 250000

-- Why does this break? The view references a function containing a 2nd sql
-- statement. The 2nd sql statement is not read-consistent with the 1st (the
-- view sql).

-- either session final state:
PERSON_NO PERSON_NAM JOB_TITLE SALARY_AMT
---------- ---------- ---------- ----------
       101 Sue          CEO         250000

Test #1: How to correct?

-- You use the SET TRANSACTION statement to begin a read-only or read-write
-- transaction, establish an isolation level, or assign your current transaction
-- to a specified rollback segment. Read-only transactions are useful for running
-- multiple queries against one or more tables while other users update the same
-- tables.

-- During a read-only transaction, all queries refer to the same snapshot of the
-- database, providing a multi-table, multi-query, read-consistent view. Other
-- users can continue to query or update data as usual.

-- corrected with explicit transaction control
set serveroutput on size 999999;
declare
    cursor cur1 is select * from person_view;
    r cur1%rowtype;
begin
    set transaction read only name 'fetch people';
    open cur1;
    dbms_lock.sleep(10);
    fetch cur1 into r;
    dbms_output.put_line(
        r.person_no||' '||
        r.person_name||' '||
        r.job_title||' '||
        r.salary_amt);
    close cur1;
    commit;
end;
/

-- another way is to rewrite the view to join the two tables in pure sql. This
-- will eliminate the function and it's embed 2nd sql statement.

-- corrected with rewritten view
create or replace view person_view as (
    select
        p.person_no,
        p.person_name,
        p.job_title,
        ps.salary_amt
    from person p
        inner join person_sal ps
            on p.person_no = ps.person_no
);

-- There are other ways to correct as well. This is not an exhaustive treatment.

Test #2

A PL/SQL process with two cursors is not by default read-consistent.

-- the rest is same as above (including the BAD output)

-- session 1 (long-running process with read consistency problems)
set serveroutput on size 999999;
declare
    cursor cur1 is select * from person;
    r1 cur1%rowtype;
    cursor cur2 is select * from person_sal;
    r2 cur2%rowtype;
begin
    open cur1;
    fetch cur1 into r1;
    dbms_lock.sleep(10);
    open cur2;
    fetch cur2 into r2;
    dbms_output.put_line(
        r1.person_no||' '||
        r1.person_name||' '||
        r1.job_title||' '||
        r2.salary_amt);
    close cur2;
    close cur1;
end;
/

I will leave example output and the correction as an exercise to the reader.

NOW CONSIDER THIS

What if the "problem" process ran 3 hours instead of a few seconds? What if there were not just two, but 18 statements without read consistency in the long-running process? What if the long-running process with read-consistency issues was processing not just one, but 65 thousand records? What if there were not one, but over 20 of these long-running processes? What if you knew that at least 2 outbound interfaces were doing this and you suspect others are the same? what if you spot-checked a few of many views, found several that do this, expect most or all do?

I have found this problem in sooooooo many systems. Sometimes, it feels like it can be found in every system if I'm willing to look carefully enough.

Do you feel safe?

Thursday, July 30, 2020

Incremental System Integration Approaches

This is a republish of an article I wrote twenty years ago, before the Agile Manifesto. Now that I am re-reading in the Post- "Lean, Agile, DevOps, Continuous Integration, Continuous Delivery" world, it still makes sense, but without all the latest buzzwords😆.

For the teams that are effectively doing continuous integration today, phased integration may be a quaint reminder of the ways things used to be, but you still have the same incremental integration strategy decisions to make; you just have much better tooling for taking most of the drudgery out of the integration work. For the rest of the software development community who still struggles with quarterly releases, big-bang integration and Fragile methodologies, considering the disadvantages of phased integration is still quite relevant.

How you approach System Integration affects the order in which components are designed, coded, and tested. It also affects the ease with which you can test and debug them. Incremental integration comes in several varieties, and, unless the project is trivial, any one of them is better than phased integration. A good incremental approach assumes that parts of the course are uncharted, that flexibility is important and plans to respond to unexpected opportunities.

In engineering fields other than software, the importance of proper integration is well known. The Pacific Northwest saw a dramatic illustration of the hazards of poor integration when the football stadium at the University of Washington collapsed partway through construction. The structure wasn’t strong enough to support itself as it was being built. It doesn’t matter that it would have been strong enough by the time it was done; it needed to be strong enough at each step. If you integrate software in the wrong order, it’s hard to code, hard to test, and hard to debug. If none of it will work until all of it works, it can seem as though it will never be finished. It too can collapse under its own weight during construction – even though the finished product should work.

Because it’s done after project members have finished unit testing and in conjunction with system testing, integration is sometimes thought of as a testing activity. It’s complex enough, however, that is should be viewed as an independent activity. Here are some of the benefits you can expect from careful integration:

Easier defect diagnosis
Fewer defects
Less scaffolding
Shorter time to first working product
Shorter overall development schedules
Better customer relations
Improved morale
Improved chance of project creation
More reliable schedule estimates
More accurate status reporting
Improved code quality
Less documentation

Phased Integration

Until a few years ago, phased integration was the norm. It follows these well-defined steps:

1. Design, code, buy or reuse, test and debug each component. This step is called “unit development”.
2. Combine the components into one whopping-big system. This is called “system integration.”
3. Test and debug the whole system. This is called “system dis-integration.”

One problem with phased integration is that when the components in a system are put together for the first time, new problems inevitably surface and the causes of the problems could be anywhere. Since you have a large number of components that have never worked together before, the culprit might be poorly tested components, an error in the interface between two components, or an error caused by the interaction between two components. All components are suspect.

The uncertainty about the location of any specific problems is compounded by the fact that all the problems suddenly present themselves at once. This forces you to deal not only with the problems caused by interactions between components but with problems that are hard to diagnose because the problems themselves interact. For this reason, another name for phased integration is “big bang integration.”

Phased integration can’t begin until late in the project, after all the components have been unit-tested. When the components are finally combined and errors surface by the score, the project team immediately goes into panicky debugging mode rather than methodical error detection and correction.

Incremental Integration

In incremental integration, you write and test a program in small pieces and then combine the pieces one at a time. In this one-at-a-time approach to integration, you follow these steps:

1. Develop a small, functional part of the system. It can be the smallest functional part, the hardest part, or a key part. Thoroughly test and debug it. It will serve as a skeleton on which to hang the remaining parts of the system.
2. Design, code, test and debug a component.
3. Integrate the new component with the skeleton. Test and debug the combination of skeleton and new component. Make sure the combination works before you add any new components. If work remains to be done, repeat the process starting at step 2.

Benefits of Incremental Integration

Errors are easier to locate

When new problems surface during incremental integration, the new component is obviously at fault. Either its interface to the rest of the system contains an error or its integration with a previously integrated component produces an error. Either way, you know exactly where to look. Moreover, simply because you have fewer problems at once, you reduce the risk that multiple problems will interact or that one problem will mask another. The more interface errors you tend to have, the more this benefit of incremental integration will help in your projects. Since many projects spend up to 50 percent of their time debugging, maximizing debugging effectiveness by making errors easy to locate provides benefits in quality and productivity.

The system succeeds early in the project

When code is integrated and running, even if the system isn’t usable, it’s apparent that it soon will be. With incremental integration, the project team can see early results from their work, so their morale is better than when they suspect that their project will never draw its first breath. Management also has a more tangible sense of progress from seeing a system working with 50 percent of its capability than from hearing that coding is “99 percent complete.”

The units of the system are tested more fully

Integration starts early in the project. You integrate each component as it’s developed, rather than waiting for one magnificent binge of integration at the end. Components are unit-tested in both cases, but each component is exercised as part of the overall system more often with incremental integration than it is with phased integration.

Shorter development schedule

If integration is planned carefully, you can design part of the system while another part is being coded. This doesn’t reduce the total number of work hours required to develop the complete design and code, but it allows some work to be done in parallel, an advantage when calendar time is at a premium.

Incremental Integration Strategies

With phased integration, you don’t have to plan the order in which project components are built. All components are integrated at the same time, so you can build them in any order as long as they’re all ready by D-day.

With incremental integration, you have to plan more carefully. Most systems will call for the integration of some components before the integration of others. Planning for integration thus affects planning for construction; the order in which the components are constructed has to support the order in which they will be integrated.

Integration-order strategies come in a variety of shapes and sizes, and none is best in every case. The best integration approach varies from project to project and the best solution is always one that you create to meet the specific demands of a specific project. Knowing the points on the methodological continuum will give you insight into the possible solutions.

Top-Down Integration

In top-down integration, the component at the top of the hierarchy is written and integrated first. Stubs have to be written to exercise the top component. Then, as components are integrated from top down, stub components are replaced with real ones.

An important aspect of top-down integration is that the interfaces between components must be carefully specified. The most troublesome errors to debug are not the ones that affect single components but those that arise from subtle interactions between components. Careful interface specification can reduce the problem. Interface specification isn’t especially an integration activity, but making sure that the interfaces have been specified well is.

An advantage of top-down integration is that the control logic of the system is tested relatively early. All the components at the top of the hierarchy are exercised a lot, so that big, conceptual design problems are exposed quickly.

Another advantage of top-down integration is that, if you plan it carefully, you can complete a partially working system early in the project. If the user-interface parts are at the top, you can get a basic interface working quickly and flesh out the details later. The morale of both customers and the project team benefits from getting something visible working early.

Top-down incremental integration also allows you to begin coding before the low-level design details are complete. Once the design has been driven down to a fairly-low level of detail in all areas, you can begin implementation and integration of the routines at the higher levels without waiting.

In spite of these advantages, pure top-down integration usually involves disadvantages that are more troublesome than you’ll want to put up with.

Pure top-down integration leaves exercising the tricky low-level interfaces until last. If low-level interfaces are buggy or a performance problem, you’d usually like to get to them long before the end of a project. It’s not unusual for a low-level problem to bubble its way to the top of a system, causing high-level changes and reducing the benefit of earlier integration work. Minimize the bubbling problem through early careful unit testing and performance analysis of the routines that exercise the low-level interfaces.

Another problem with pure top-down integration is that it takes a lot of stubs to integrate from the top down. Many low-level components been integrated, which implies that a large number of stubs will be needed during intermediate steps in integration. Stubs are problematic in that, as test code, they are more likely to contain errors than the more carefully designed system code. Errors in the new stubs that support a new component defeat the purpose of incremental integration, which is to restrict the source of new errors to one new component.

Top-down integration is also nearly impossible to implement purely. In top-down integration done by the book, you start at the top level and then integrate all the routines at the second level. When you’ve integrated all the routines from the second level, and not before, you integrate the routines from the third level. The rigidity in pure top-down integration is completely arbitrary. Most people use a hybrid approach such as integrating from the top-down in sections instead.

Finally, you can’t use top-down integration if the collection of components doesn’t have a top.

Even though pure top-down integration isn’t workable, thinking about it will help you decide on a general approach. Some of the benefits and hazards that apply to a pure, top-down approach apply, less obviously, to a looser top-down approach, so keep them in mind.

Bottom-up Integration

In bottom-up integration, you write and integrate components at the bottom of the hierarchy first. You write test drivers to exercise the low-level components initially and add components to the test-driver scaffolding as they’re developed. As you add high-level components, you replace driver components with real ones.

Bottom-up integration provides a limited set of advantages. It restricts the possible source of errors to the single component being integrated, so errors are easy to locate. Integration can start early in the project.

Bottom-up integration also exercises potentially troublesome low-level interfaces early. Since low-level limitations often determine whether you can meet the systems’ goals, making sure the low-level interfaces have had a full workout is worth the trouble.

The main problem with bottom-up integration is that it leaves integration of the major, high-level interfaces until last. If the system has conceptual design problems at the higher levels, the implementation activity won’t find them until all the detailed work has been done. If the design must be changed significantly, some of the low-level work might have to be discarded.

Bottom-up integration requires you to complete the design of the whole system before you start integration. If you don’t, assumptions that needn’t have controlled the design might end up deeply embedded in low-level code., giving rise to the awkward situation in which you design the high-level components to work around problems in the low-level ones. Letting low-level details drive the design of the higher level components contradicts principles of information hiding and structured design. The problems of integrating higher-level components are small compared to the problems you’ll have if you don’t complete the design of high-level components before you begin low-level coding.

As with top-down integration, pure bottom-up integration is rare, and you can use a hybrid approach instead, integrating in sections.

Sandwich Integration

The problems with pure bottom-up and pure bottom-down integration have led some software integrators to recommend a sandwich approach. You first integrate the control components at the top of the hierarchy. Then you integrate the low-level components and widely-used utility components at the bottom. These high-level and low-level components are the bread of the sandwich. You leave the middle-level components until later. These make up the meat, cheese and tomatoes of the sandwich.

This approach avoids the rigidity of pure bottom-up or top-down integration. It integrates the often-troublesome components first and has the potential to minimize the amount of scaffolding you’ll need. It’s a realistic, practical approach.

Risk-Oriented Integration

Risk-oriented integration is also called “hard part first” integration. It’s like sandwich integration in that it seeks to avoid the problems inherent in pure top-down or pure bottom-up integration. Coincidentally, it also tends to integrate the components at the top and bottom first, saving the middle-level components for last. The motivation, however, is different.

In risk-oriented integration, you identify the level of risk associated with each component. You decide which will be the most challenging parts to implement, and you implement them first. Experience indicates that top-level interfaces are risky, so they are often at the top of the risk list. Low-level interfaces are also risky, so they’re also at the top of the risk list. In addition, you might know of components in the middle that will be challenging. Perhaps a component implements a poorly understood algorithm or has ambitious performance goals. Such components can also be identified as high risks and integrated relatively early.

The remainder of the code, the easy stuff, can wait until later. Some of it will probably turn out to be harder than you thought, but that’s unavoidable.

Feature-Oriented Integration

A final kind of incremental integration is integrating one feature at a time. The term “feature” refers to an identifiable function of the system you are integrating.

Many features must be implemented in multiple components, which puts a twist on the one-component-at-a-time aspect of incremental integration. When the feature to be integrated is bigger than a single component, the “increment” in incremental integration is bigger than a single component. This reduces the benefit of incrementalism a little in that it reduces your certainty about the source of new errors, but if you have thoroughly tested the new components that implement the new feature before you integrate them, that’s only a small disadvantage. You can use the incremental integration strategies recursively by integrating small pieces to form features and then incrementally integrating features to form a system.

You’ll usually want to start with a skeleton you’ve chosen for its ability to support other features. You can hang the reset of the features on the feature you integrate first. Components are added in “feature-trees,” hierarchical collections of routines that make up a feature. Integration is easier if a feature is relatively independent, perhaps calling the same low-level component as the routines for other features, but having no calls to middle-level code in common with other features.

Feature-oriented integration offers three main advantages. First, it eliminates scaffolding for virtually everything except low-level library components. The skeleton might need a little scaffolding, or some parts of the skeleton might not be operational until particular features have been added. When each feature has been hung on the structure, however, no additional scaffolding is needed. Since each feature is self-contained, each feature contains all the support code it needs.

The second main advantage is that each newly integrated feature brings out an incremental addition in functionality. This provides evidence that the project is moving steadily forward.

A third advantage is that feature-oriented integration works well with object-oriented design. Objects tend to map well to features, which makes feature-oriented integration a natural choice for object-oriented systems.

Pure feature-oriented integration is as difficult to pursue as pure top-down or pure bottom-up integration. Usually some of the low-level code must be integrated before significant features can be.

Conclusion

None of these approaches are robust procedures that you should follow methodically and then declare yourself to be done. Like software design approaches, they are heuristics more than algorithms, and rather than following any procedure dogmatically, you come out ahead by making up a unique strategy tailored to your specific project.

References

Adapted and summarized from the below works. Most content is from (2), chapter 7 and (3), Chapter 3.

(1) Hunt, Andrew and Thomas, David. The Pragmatic Programmer: From Journeyman to Master. Addison Wesley, 1999.
(2) Jones, Capers. Software Assessments, Benchmarks and Best Practices. Addison Wesley, 2000.
(3) McConnell, Steve. Code Complete. Microsoft Press, 1993.

Tuesday, July 14, 2020

The Unreasonable Effectiveness* of Pragmatic Automation

* apologies to Eugene Wigner

We all know that computers are great at doing repetitive tasks, but often fail to realize that there are lots of ways we can use computers to can make our own daily work more efficient. We can be a lot more effective about how we use our computers if we use them as a tool for ourselves.

The shell is the primary way you interact with a computer once you want to do more than what visual user interfaces allow. Visual user interfaces are somewhat limited in what they can allow because you can only do the things that there are buttons and sliders and input fields for.

Textual tools are often built to be composable: providing many different ways to combine, program or automate them. Using the computer as a tool for ourselves starts with understanding a few basic things: how to use your shell, terminal, and a programming editor.

Many software developers are captives of button-pushing repetitive drudgery. The cobbler's children have no shoes. As they are developing tools for others, they fail to use the computer effectively as a tool for themselves.

Past examples from my own observation:

tests are manual, results are word docs of pasted screenshots
code is written in Word or notepad
code is kept in a scattering of "secret" places, only a handful of people have access to, or not kept at all
documentation is separate from code, all hand-written, inaccessible/obfuscated in proprietary systems, or just missing
deployment is via runbook Word doc: a detailed step-by-step instructions for humans to execute

A wishlist for using the computer as a tool for myself: automating nearly all of the repetitive drudgery that goes into software development, leaving more time for the fun parts:

streamlined management of all your source, docs and related artifacts in a distributed fashion to track changes, revision history, rollback breaking changes, collaborate with team, isolate changes from main codebase, troubleshoot changes, and much more...
all aspects of which are managed across multiple versions of software
within a framework for project packaging and package dependencies
and a framework for unit testing, automated across all versions
and automated code coverage analysis
automated testing, continuous integration and continuous delivery across multiple environments
automated lint checks for code style, likely bugs, design flaws, import order
documentation of code in-place, automated generation of docs, automated checking for doc standards conformance
automated check of dependencies for known security vulnerabilities
automated decoration of untyped code with type annotations, then used for static and runtime type checking
auto-formatting of code on pre-commit
automatic test that in-line docs match code and docstring examples match documented example output
status tracking of continuous integration tests and doc tests with dashboard badges
automatically create release notes
automatically publish to a package repository for easy deployments/updates
automatically publish versioned docs to a document repository, auto-rebuilt on project update

Most shops use computers as tools to automate a small fraction of this list. Most of this list is just skipped and not done. The essential bits are handled with repetitive human drudgery.

I've recently been playing with Python frameworks that deliver every single bit of the ENTIRE wishlist above and it has provided much food for thought. I would love to apply the same to an end-to-end data engineering solution. Next step, become familiar with DBT!

get_wp project on GitHub: https://github.com/SteveRutledge/get_wp

DBT: https://www.getdbt.com/