← PostgreSQL Course8 / 17

Normalization

Design tables that avoid duplicated, inconsistent data — see the anomalies a flat table creates, reach first, second, and third normal form by splitting into related tables linked by foreign keys, and know when to denormalize on purpose.

Ad 728×90

Why normalize: the cost of duplicated data

Why: normalization is organizing columns into tables so each fact is stored once. The opposite — one wide table — duplicates data and breeds "anomalies": change a customer email and you must fix every order row or they disagree; you cannot add a customer before their first order; deleting their last order erases the customer entirely. Storing each fact once removes all three.

-- Denormalized: customer details repeat on every order row.
-- orders(id, customer_name, customer_email, product, qty)
--   1, 'Ada', 'ada@x.com', 'Keyboard', 1
--   2, 'Ada', 'ada@x.com', 'Mouse',    2   <- email duplicated
-- Change Ada's email and you must update EVERY one of her rows.

First normal form (1NF): one value per cell

Why: 1NF means every column holds a single, atomic value — no comma-separated lists, no repeating columns like product1, product2. A row with "Keyboard, Mouse" jammed into one column cannot be filtered, joined, or counted properly. Give each value its own row instead.

-- NOT 1NF: a list crammed into one column
-- order_products(order_id, products) -> (1, 'Keyboard, Mouse')

-- 1NF: one product per row, each value atomic
CREATE TABLE order_items (
  order_id bigint  NOT NULL,
  product  text    NOT NULL,
  quantity integer NOT NULL
);

Split into related tables (2NF & 3NF)

Why: the higher forms remove data that depends on something other than the whole primary key. Pull each independent thing — customers, products — into its own table with its own primary key, and reference it from the order with a foreign key. Now each fact lives in exactly one place, and a JOIN reassembles the full picture when you need it.

CREATE TABLE customers (
  id    bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name  text NOT NULL,
  email text NOT NULL UNIQUE
);

CREATE TABLE orders (
  id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id bigint NOT NULL REFERENCES customers(id), -- link, not a copy
  created_at  timestamptz NOT NULL DEFAULT now()
);

-- Ada's email is stored once. JOIN to read it back:
SELECT o.id, c.name, c.email
FROM orders o JOIN customers c ON c.id = o.customer_id;

When to denormalize on purpose

Note: normalization is the default because it keeps data correct, but a hot read path sometimes duplicates data deliberately to avoid an expensive JOIN — a cached order total, or a copied value on a reporting table. That is a trade: faster reads for the burden of keeping the copies in sync. Normalize first; denormalize only with a measured reason.

-- A deliberate, documented copy for read speed:
ALTER TABLE orders ADD COLUMN total numeric(10, 2);

-- You now MUST keep "total" in sync whenever items change
-- (a trigger, or application code). Worth it only when the
-- JOIN-and-SUM on every read is provably too slow.

Why normalize: the cost of duplicated data

-- Denormalized: customer details repeat on every order row.
-- orders(id, customer_name, customer_email, product, qty)
--   1, 'Ada', 'ada@x.com', 'Keyboard', 1
--   2, 'Ada', 'ada@x.com', 'Mouse',    2   <- email duplicated
-- Change Ada's email and you must update EVERY one of her rows.

First normal form (1NF): one value per cell

-- NOT 1NF: a list crammed into one column
-- order_products(order_id, products) -> (1, 'Keyboard, Mouse')

-- 1NF: one product per row, each value atomic
CREATE TABLE order_items (
  order_id bigint  NOT NULL,
  product  text    NOT NULL,
  quantity integer NOT NULL
);

When to denormalize on purpose

-- A deliberate, documented copy for read speed:
ALTER TABLE orders ADD COLUMN total numeric(10, 2);

-- You now MUST keep "total" in sync whenever items change
-- (a trigger, or application code). Worth it only when the
-- JOIN-and-SUM on every read is provably too slow.