← Blog

I self-host analytics with SQLite + Astro middleware — and GA4 is never coming back

One .db file, 200 lines of TypeScript, no client-side JS, no cookie banner. Here's how I count visitors for this portfolio — and when this setup is NOT enough.

The site you’re reading has a “X people visited today” counter on the homepage. It doesn’t call Google Analytics, doesn’t embed Plausible / Umami / Fathom, doesn’t ship a single line of third-party JavaScript. The whole counter runs on:

  • A ~2 MB app.db file sitting next to the app.
  • A 60-line Astro middleware.
  • One INSERT per valid pageview.
  • One SELECT COUNT(DISTINCT ip) per homepage render.

That’s it.

I switched off Google Analytics 4 and onto this setup 14 months ago for 3 freelance projects. After 14 months, I can’t remember a single reason to go back. Here’s the setup, the real trade-offs, and when you shouldn’t follow this path.

SQLite logo

Why I left GA4

Three reasons, in order of how much they bothered me:

1. Cookie banner. GDPR + ePrivacy force you to ask consent before loading gtag.js. You drop in Cookiebot / Termly, spend 3 days making the banner just-pretty-enough, and afterward 70% of users hit “Reject”. The numbers in GA represent 30% of reality. Making business decisions on that 30% is taking placebo for a real headache.

2. Performance budget burned for no payoff. gtag.js is ~50 KB compressed. It loads, then fires 2-3 requests to Google. My mobile Lighthouse dropped from 99 to 87 because of that script alone. A portfolio site can’t live with 87.

3. I don’t need 90% of GA4’s features. No conversion funnel, no e-commerce, no audience segmentation. I just want to know: “how many people read the Chainsaw Man post today?”. GA4 is Photoshop when I need a pencil.

Architecture — 3 pieces

┌──────────────────┐
│   Browser GET    │
│  /blog/foo       │
└────────┬─────────┘
         │ HTTP request

┌──────────────────┐
│ Astro middleware │  ──► trackVisit({ ip, ua, path, ... })
│  (server-side)   │
└────────┬─────────┘


┌──────────────────┐       ┌──────────────────┐
│  visits.ts logic │ ────► │   SQLite (WAL)   │
│  - bot UA filter │       │   visits table   │
│  - 60s dedupe    │       │                  │
│  - day bucketing │       └──────────────────┘
└──────────────────┘

The browser sees nothing. No request, no cookie, no script. All tracking happens server-side, inside the same request that renders the page.

Part 1: Schema — 1 table, 9 columns

CREATE TABLE visits (
  id         INTEGER PRIMARY KEY AUTOINCREMENT,
  ip         TEXT NOT NULL,
  ua         TEXT,
  path       TEXT,
  referrer   TEXT,
  lang       TEXT,
  country    TEXT,
  day        TEXT NOT NULL,         -- YYYY-MM-DD in Asia/Ho_Chi_Minh
  created_at INTEGER NOT NULL DEFAULT (strftime('%s','now'))
);

CREATE INDEX idx_visits_day     ON visits(day);
CREATE INDEX idx_visits_ip_day  ON visits(ip, day);
CREATE INDEX idx_visits_created ON visits(created_at DESC);

A few intentional choices:

  • day is TEXT YYYY-MM-DD, not INTEGER timestamp. I GROUP BY day constantly — string equality is faster than date arithmetic, and I avoid timezone math on every query. Bucketing happens at insert time, in Vietnam time:

    function todayVN(): string {
      return new Date().toLocaleDateString('en-CA', {
        timeZone: 'Asia/Ho_Chi_Minh',
      });
    }
    

    'en-CA' gives YYYY-MM-DD. Trick works in any runtime with Intl.DateTimeFormat.

  • IPs stored raw, no hashing. GDPR says IP is PII — true. But this is first-party log on a server I own, never leaves the box, kept for forensic use when abuse hits (form spam, API scraping). I chmod 600 app.db and don’t expose it through any public API.

  • No user_id, no session_id. Both require cookies or fingerprinting — instant GDPR territory. “Distinct IP per day” is a good-enough proxy for “person” at my traffic level.

Part 2: Middleware — 30 lines

Astro logo — the hybrid SSR framework powering this site

src/middleware.ts:

import { defineMiddleware } from 'astro:middleware';
import { trackVisit } from '@/lib/visits';

export const onRequest = defineMiddleware(async (ctx, next) => {
  if (ctx.request.method === 'GET') {
    const headers = ctx.request.headers;
    let ip: string | null = null;
    try {
      ip = ctx.clientAddress || null;
    } catch {
      ip = null;
    }
    if (!ip) {
      const fwd = headers.get('x-forwarded-for');
      if (fwd) ip = fwd.split(',')[0].trim();
      else ip = headers.get('x-real-ip') || headers.get('cf-connecting-ip');
    }
    trackVisit({
      ip,
      ua: headers.get('user-agent'),
      path: new URL(ctx.request.url).pathname,
      referrer: headers.get('referer'),
      lang: ctx.locals.lang ?? null,
      country:
        headers.get('cf-ipcountry') ||
        headers.get('x-vercel-ip-country') ||
        null,
    });
  }
  return next();
});

Subtleties:

  • ctx.clientAddress can throw on prerendered routes (Astro hybrid: static pages have no per-request address). Wrap it in try/catch.
  • IP fallback chain: clientAddressx-forwarded-for[0]x-real-ipcf-connecting-ip. Behind local nginx the first one is fine; behind Cloudflare the last one is canonical.
  • Country comes from Cloudflare / Vercel header — they tag based on edge location. I don’t run MaxMind, don’t store IPs to self-geolocate.

Part 3: Filter — 3 layers before insert

export function trackVisit(v: TrackInput): void {
  try {
    if (!v.ip) return;
    if (isBotUA(v.ua)) return;
    if (!isTrackablePath(v.path)) return;
    if (!shouldRecord(v.ip, v.path)) return;

    insertStmt.run(v.ip, v.ua, v.path, v.referrer, v.lang, v.country, todayVN());
  } catch {
    // analytics must never throw
  }
}

Bot UA filter:

const BOT_UA = /bot|crawler|spider|slurp|googlebot|yandex|baidu|duckduck|
                facebookexternalhit|twitterbot|whatsapp|telegrambot|
                linkedinbot|pingdom|uptimerobot|ahrefs|semrush|mj12|petalbot|
                applebot|headless|monitor|fetch|preview/i;

One regex covers ~95% of bots I see in nginx access logs. Honest bots (Googlebot Search Indexing) self-identify in their UA — there’s no clean way to fool this regex without trying hard.

Path filter:

function isTrackablePath(p: string): boolean {
  if (p.startsWith('/api/')) return false;
  if (p.startsWith('/admin')) return false;
  if (p.startsWith('/og/')) return false;
  if (p.startsWith('/_astro/')) return false;
  if (p === '/rss.xml') return false;
  // Skip anything with an extension (asset)
  const last = p.split('/').pop() || '';
  if (last.includes('.')) return false;
  return true;
}

I only track real HTML pages, never API calls or asset requests.

60s dedupe:

const recent = new Map<string, number>();
const DEDUPE_WINDOW_MS = 60_000;

function shouldRecord(ip: string, path: string): boolean {
  const key = ip + '|' + path;
  const now = Date.now();
  const last = recent.get(key);
  if (last !== undefined && now - last < DEDUPE_WINDOW_MS) return false;
  recent.set(key, now);
  // Lazy evict if Map > 5000 entries
  if (recent.size > 5000) {
    const cutoff = now - DEDUPE_WINDOW_MS * 5;
    for (const [k, v] of recent) if (v < cutoff) recent.delete(k);
  }
  return true;
}

In-memory Map. A user mashing F5 doesn’t double-count. A user with 5 tabs of the same URL within 60s only counts once. Resets on container restart — fine, dedupe is just noise reduction.

Part 4: Reading numbers — COUNT(DISTINCT ip)

For the homepage’s “X people today” counter:

const todayUniqueStmt = db.prepare(
  'SELECT COUNT(DISTINCT ip) AS c FROM visits WHERE day = ?'
);

export function publicTodayStats() {
  const day = todayVN();
  return {
    today: todayUniqueStmt.get(day)?.c ?? 0,
    pageviewsToday: todayPageviewsStmt.get(day)?.c ?? 0,
    totalUnique: totalUniqueStmt.get()?.c ?? 0,
    totalPageviews: totalPageviewsStmt.get()?.c ?? 0,
  };
}

db.prepare runs once at module load — after that .get(day) is just bind + execute. SQLite returns in < 1ms with the idx_visits_day index across ~50k rows. I run this SSR on every homepage request and Lighthouse stays 100.

The admin page has heavier queries (top paths, top countries, 14-day daily series) — all return < 5ms on 6 months of data.

Trade-offs — including what’s NOT here

After 14 months, here’s what I don’t have with this setup, ranked by “do I miss it”:

GA featureDo I need it?Replacement
Real-time visitorsNotail -f log
Conversion funnelNo (portfolio)
User flow / session replayNo
Audience segmentationNo
E-commerce eventsYes (1 freelance project)Separate events table
Click heatmapSometimesMicrosoft Clarity (banner cost)
Auto email reportsMaybeCron + script that emails top_paths
Multi-domain trackingNo
Goals + custom eventsSometimesAdd event_name column

Heatmap is the only thing I occasionally miss. When I need it — I temporarily turn on Microsoft Clarity for a week, look at the heatmap, turn it off. A week’s worth of data answers the question; no need to keep a banner alive year-round.

When you should NOT self-host SQLite analytics

Honest take. Don’t follow this if:

  • You need to share a dashboard with clients / boss. SQLite + custom UI means you build the admin page yourself. Plausible / Umami have cloud versions that share with one click.
  • Site has > 1M pageviews/month. SQLite still handles it, but COUNT(DISTINCT) on 10M rows starts to slow down. At that scale consider PostgreSQL or ClickHouse, or just self-host Plausible.
  • You need geolocation more granular than country. Nothing free replaces the MaxMind GeoIP database. Cloudflare headers only have country, not city.
  • You deploy to Vercel / Netlify / Cloudflare Pages. Serverless has no persistent disk — SQLite can’t survive there. You’d need Turso / Cloudflare D1 (SQLite-compatible but with quirks) or just go to Postgres.

This setup fits: Astro / Next / Laravel running on a VPS with disk. Site < 500k pageviews / month. Owner can write 200 lines of TypeScript when needed.

Real numbers — this site

I run this for nhatdote.io.vn. After 8 months:

  • app.db file: 2.3 MB (covers donations, bug_reports, coffee_clicks, visits, issue_comments).
  • visits table: ~47k rows, average ~180/day.
  • Lighthouse mobile: 100/100 every metric (vs 87 with GA4).
  • Backup: cp app.db backup-$(date +%F).db every 3 hours via cron — one 2 MB file. 7 days = 14 MB. A year of rolling window = ~17 GB if I kept everything — I rotate 30 days.
  • Cost: $0 (VPS already paid for the app, SQLite adds nothing).

GA4 for the same site would be $0 but cost:

  • A cookie banner I don’t want
  • 50 KB of JS shipped to every visitor
  • A dashboard I have to log into
  • Numbers off by 70% because users hit reject

Closing

GA4 is built for enterprise marketing teams measuring $100k campaigns. If you’re not that team — you’re using a butcher knife to cut salad.

200 lines of TypeScript + 1 SQLite file + 1 middleware is the entire analytics stack I need for a portfolio and 3 freelance projects. Privacy-first not because of law — because most of the time I just want to know “how many people read this today”. SELECT COUNT(DISTINCT ip) FROM visits WHERE day = '2026-05-09' answers that faster than the GA4 dashboard loads.

Sometimes the right answer is less, not more.

Sources