I self-host analytics with SQLite + Astro middleware — and GA4 is never coming back

The site you’re reading has a “X people visited today” counter on the homepage. It doesn’t call Google Analytics, doesn’t embed Plausible / Umami / Fathom, doesn’t ship a single line of third-party JavaScript. The whole counter runs on:

A ~2 MB app.db file sitting next to the app.
A 60-line Astro middleware.
One INSERT per valid pageview.
One SELECT COUNT(DISTINCT ip) per homepage render.

That’s it.

I switched off Google Analytics 4 and onto this setup 14 months ago for 3 freelance projects. After 14 months, I can’t remember a single reason to go back. Here’s the setup, the real trade-offs, and when you shouldn’t follow this path.

Why I left GA4

Three reasons, in order of how much they bothered me:

1. Cookie banner. GDPR + ePrivacy force you to ask consent before loading gtag.js. You drop in Cookiebot / Termly, spend 3 days making the banner just-pretty-enough, and afterward 70% of users hit “Reject”. The numbers in GA represent 30% of reality. Making business decisions on that 30% is taking placebo for a real headache.

2. Performance budget burned for no payoff. gtag.js is ~50 KB compressed. It loads, then fires 2-3 requests to Google. My mobile Lighthouse dropped from 99 to 87 because of that script alone. A portfolio site can’t live with 87.

3. I don’t need 90% of GA4’s features. No conversion funnel, no e-commerce, no audience segmentation. I just want to know: “how many people read the Chainsaw Man post today?”. GA4 is Photoshop when I need a pencil.

Architecture — 3 pieces

┌──────────────────┐
│   Browser GET    │
│  /blog/foo       │
└────────┬─────────┘
         │ HTTP request
         ▼
┌──────────────────┐
│ Astro middleware │  ──► trackVisit({ ip, ua, path, ... })
│  (server-side)   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐       ┌──────────────────┐
│  visits.ts logic │ ────► │   SQLite (WAL)   │
│  - bot UA filter │       │   visits table   │
│  - 60s dedupe    │       │                  │
│  - day bucketing │       └──────────────────┘
└──────────────────┘

The browser sees nothing. No request, no cookie, no script. All tracking happens server-side, inside the same request that renders the page.

Part 1: Schema — 1 table, 9 columns

CREATE TABLE visits (
  id         INTEGER PRIMARY KEY AUTOINCREMENT,
  ip         TEXT NOT NULL,
  ua         TEXT,
  path       TEXT,
  referrer   TEXT,
  lang       TEXT,
  country    TEXT,
  day        TEXT NOT NULL,         -- YYYY-MM-DD in Asia/Ho_Chi_Minh
  created_at INTEGER NOT NULL DEFAULT (strftime('%s','now'))
);

CREATE INDEX idx_visits_day     ON visits(day);
CREATE INDEX idx_visits_ip_day  ON visits(ip, day);
CREATE INDEX idx_visits_created ON visits(created_at DESC);

A few intentional choices:

day is TEXT YYYY-MM-DD, not INTEGER timestamp. I GROUP BY day constantly — string equality is faster than date arithmetic, and I avoid timezone math on every query. Bucketing happens at insert time, in Vietnam time:
```
function todayVN(): string {
  return new Date().toLocaleDateString('en-CA', {
    timeZone: 'Asia/Ho_Chi_Minh',
  });
}
```
'en-CA' gives YYYY-MM-DD. Trick works in any runtime with Intl.DateTimeFormat.
IPs stored raw, no hashing. GDPR says IP is PII — true. But this is first-party log on a server I own, never leaves the box, kept for forensic use when abuse hits (form spam, API scraping). I chmod 600 app.db and don’t expose it through any public API.
No user_id, no session_id. Both require cookies or fingerprinting — instant GDPR territory. “Distinct IP per day” is a good-enough proxy for “person” at my traffic level.

Part 2: Middleware — 30 lines

src/middleware.ts:

import { defineMiddleware } from 'astro:middleware';
import { trackVisit } from '@/lib/visits';

export const onRequest = defineMiddleware(async (ctx, next) => {
  if (ctx.request.method === 'GET') {
    const headers = ctx.request.headers;
    let ip: string | null = null;
    try {
      ip = ctx.clientAddress || null;
    } catch {
      ip = null;
    }
    if (!ip) {
      const fwd = headers.get('x-forwarded-for');
      if (fwd) ip = fwd.split(',')[0].trim();
      else ip = headers.get('x-real-ip') || headers.get('cf-connecting-ip');
    }
    trackVisit({
      ip,
      ua: headers.get('user-agent'),
      path: new URL(ctx.request.url).pathname,
      referrer: headers.get('referer'),
      lang: ctx.locals.lang ?? null,
      country:
        headers.get('cf-ipcountry') ||
        headers.get('x-vercel-ip-country') ||
        null,
    });
  }
  return next();
});

Subtleties:

ctx.clientAddress can throw on prerendered routes (Astro hybrid: static pages have no per-request address). Wrap it in try/catch.
IP fallback chain: clientAddress → x-forwarded-for[0] → x-real-ip → cf-connecting-ip. Behind local nginx the first one is fine; behind Cloudflare the last one is canonical.
Country comes from Cloudflare / Vercel header — they tag based on edge location. I don’t run MaxMind, don’t store IPs to self-geolocate.

Part 3: Filter — 3 layers before insert

export function trackVisit(v: TrackInput): void {
  try {
    if (!v.ip) return;
    if (isBotUA(v.ua)) return;
    if (!isTrackablePath(v.path)) return;
    if (!shouldRecord(v.ip, v.path)) return;

    insertStmt.run(v.ip, v.ua, v.path, v.referrer, v.lang, v.country, todayVN());
  } catch {
    // analytics must never throw
  }
}

Bot UA filter:

const BOT_UA = /bot|crawler|spider|slurp|googlebot|yandex|baidu|duckduck|
                facebookexternalhit|twitterbot|whatsapp|telegrambot|
                linkedinbot|pingdom|uptimerobot|ahrefs|semrush|mj12|petalbot|
                applebot|headless|monitor|fetch|preview/i;

One regex covers ~95% of bots I see in nginx access logs. Honest bots (Googlebot Search Indexing) self-identify in their UA — there’s no clean way to fool this regex without trying hard.

Path filter:

function isTrackablePath(p: string): boolean {
  if (p.startsWith('/api/')) return false;
  if (p.startsWith('/admin')) return false;
  if (p.startsWith('/og/')) return false;
  if (p.startsWith('/_astro/')) return false;
  if (p === '/rss.xml') return false;
  // Skip anything with an extension (asset)
  const last = p.split('/').pop() || '';
  if (last.includes('.')) return false;
  return true;
}

I only track real HTML pages, never API calls or asset requests.

60s dedupe:

const recent = new Map<string, number>();
const DEDUPE_WINDOW_MS = 60_000;

function shouldRecord(ip: string, path: string): boolean {
  const key = ip + '|' + path;
  const now = Date.now();
  const last = recent.get(key);
  if (last !== undefined && now - last < DEDUPE_WINDOW_MS) return false;
  recent.set(key, now);
  // Lazy evict if Map > 5000 entries
  if (recent.size > 5000) {
    const cutoff = now - DEDUPE_WINDOW_MS * 5;
    for (const [k, v] of recent) if (v < cutoff) recent.delete(k);
  }
  return true;
}

In-memory Map. A user mashing F5 doesn’t double-count. A user with 5 tabs of the same URL within 60s only counts once. Resets on container restart — fine, dedupe is just noise reduction.

Part 4: Reading numbers — `COUNT(DISTINCT ip)`

For the homepage’s “X people today” counter:

const todayUniqueStmt = db.prepare(
  'SELECT COUNT(DISTINCT ip) AS c FROM visits WHERE day = ?'
);

export function publicTodayStats() {
  const day = todayVN();
  return {
    today: todayUniqueStmt.get(day)?.c ?? 0,
    pageviewsToday: todayPageviewsStmt.get(day)?.c ?? 0,
    totalUnique: totalUniqueStmt.get()?.c ?? 0,
    totalPageviews: totalPageviewsStmt.get()?.c ?? 0,
  };
}

db.prepare runs once at module load — after that .get(day) is just bind + execute. SQLite returns in < 1ms with the idx_visits_day index across ~50k rows. I run this SSR on every homepage request and Lighthouse stays 100.

The admin page has heavier queries (top paths, top countries, 14-day daily series) — all return < 5ms on 6 months of data.

Trade-offs — including what’s NOT here

After 14 months, here’s what I don’t have with this setup, ranked by “do I miss it”:

GA feature	Do I need it?	Replacement
Real-time visitors	No	`tail -f` log
Conversion funnel	No (portfolio)	—
User flow / session replay	No	—
Audience segmentation	No	—
E-commerce events	Yes (1 freelance project)	Separate `events` table
Click heatmap	Sometimes	Microsoft Clarity (banner cost)
Auto email reports	Maybe	Cron + script that emails `top_paths`
Multi-domain tracking	No	—
Goals + custom events	Sometimes	Add `event_name` column

Heatmap is the only thing I occasionally miss. When I need it — I temporarily turn on Microsoft Clarity for a week, look at the heatmap, turn it off. A week’s worth of data answers the question; no need to keep a banner alive year-round.

When you should NOT self-host SQLite analytics

Honest take. Don’t follow this if:

You need to share a dashboard with clients / boss. SQLite + custom UI means you build the admin page yourself. Plausible / Umami have cloud versions that share with one click.
Site has > 1M pageviews/month. SQLite still handles it, but COUNT(DISTINCT) on 10M rows starts to slow down. At that scale consider PostgreSQL or ClickHouse, or just self-host Plausible.
You need geolocation more granular than country. Nothing free replaces the MaxMind GeoIP database. Cloudflare headers only have country, not city.
You deploy to Vercel / Netlify / Cloudflare Pages. Serverless has no persistent disk — SQLite can’t survive there. You’d need Turso / Cloudflare D1 (SQLite-compatible but with quirks) or just go to Postgres.

This setup fits: Astro / Next / Laravel running on a VPS with disk. Site < 500k pageviews / month. Owner can write 200 lines of TypeScript when needed.

Real numbers — this site

I run this for nhatdote.io.vn. After 8 months:

app.db file: 2.3 MB (covers donations, bug_reports, coffee_clicks, visits, issue_comments).
visits table: ~47k rows, average ~180/day.
Lighthouse mobile: 100/100 every metric (vs 87 with GA4).
Backup: cp app.db backup-$(date +%F).db every 3 hours via cron — one 2 MB file. 7 days = 14 MB. A year of rolling window = ~17 GB if I kept everything — I rotate 30 days.
Cost: $0 (VPS already paid for the app, SQLite adds nothing).

GA4 for the same site would be $0 but cost:

A cookie banner I don’t want
50 KB of JS shipped to every visitor
A dashboard I have to log into
Numbers off by 70% because users hit reject

Closing

GA4 is built for enterprise marketing teams measuring $100k campaigns. If you’re not that team — you’re using a butcher knife to cut salad.

200 lines of TypeScript + 1 SQLite file + 1 middleware is the entire analytics stack I need for a portfolio and 3 freelance projects. Privacy-first not because of law — because most of the time I just want to know “how many people read this today”. SELECT COUNT(DISTINCT ip) FROM visits WHERE day = '2026-05-09' answers that faster than the GA4 dashboard loads.

Sometimes the right answer is less, not more.

Sources

The schema, middleware and filter logic above are the actual code running on nhatdote.io.vn. You can read about Astro.locals.lang middleware patterns in the Astro middleware docs.
better-sqlite3 (Node binding for SQLite) — github.com/WiseLibs/better-sqlite3.