Building a Good Sitemap for Static Sites

Why sitemaps matter more than most developers think, what makes one actually useful, and how to generate and verify one automatically with PowerShell.

Why Sitemaps Matter

A sitemap is not just an SEO checkbox. It is the document you hand to search engines that says: here is everything I want you to index, here is when it was last updated, and here is how important each page is relative to the others. Without one, crawlers discover your pages by following links — which means pages with few inbound links, recently added pages, and pages buried deep in your navigation may never get indexed at all.

For a static site like Exponanta — HTML files served directly, no server-side rendering, no dynamic routes — the sitemap is especially important. There is no server generating canonical URLs on the fly. If a page exists as a file but is not in the sitemap and not linked from anywhere prominent, it is effectively invisible to Google.

Three concrete things a good sitemap does for you. First, it tells Google about pages that exist but are not yet linked from other pages — new blog posts, new industry taxonomy pages, new program pages. Second, it tells Google when pages were last updated via the <lastmod> field, so it knows which pages to re-crawl after you publish changes. Third, it gives you a complete inventory of your own site, which is useful independently of SEO — you can diff it against your navigation to find orphaned pages.

What Makes a Good Sitemap

A sitemap is an XML file at the root of your domain, listed in robots.txt. The structure is simple — a <urlset> wrapper containing one <url> block per page, each with a <loc> (the canonical URL) and ideally a <lastmod> (the last modified date in ISO format).

Sitemap limits

A single sitemap file supports up to 50,000 URLs and must be under 50MB uncompressed. For larger sites, use a sitemap index file that references multiple sitemap files.

What to include: every page you want indexed. For Exponanta that means all HTML pages — blog posts, program pages, event listings, industry taxonomy pages, ecosystem pages, role pages, and the homepage.

What to exclude: pages you do not want indexed. This includes admin pages, draft pages, style guide and component pages like /blog/styles-tags.html, thank-you pages after form submissions, duplicate content pages, and any page that already has <meta name="robots" content="noindex"> in its head. Including noindex pages in your sitemap sends a contradictory signal to Google and should be avoided.

URL format matters. Use the canonical form — the one that matches your <link rel="canonical"> tags. For a static site that means trailing slashes should be consistent. If your server redirects /blog/building-sitemap to /blog/building-sitemap.html, the sitemap should list the .html version (or whichever is canonical). Mixing formats creates crawl budget waste.

Generating the Sitemap with PowerShell

For a static site stored as HTML files, PowerShell is the natural tool on Windows — it can walk the file tree, read last-modified timestamps directly from the filesystem, and write clean XML. No build tool, no Node.js dependency, no configuration file.

The script below walks your project directory recursively, finds every .html file, constructs a clean canonical URL for each one, reads the file's last modified date from the filesystem, and writes a valid sitemap.xml. It handles index.html files correctly — converting them to their directory URL rather than including the filename.

# Generate sitemap.xml from all HTML files in the current directory tree
# Run from your site root: cd C:\Sites\exponanta.com && .\generate-sitemap.ps1

$baseUrl    = "https://exponanta.com"
$outputFile = ".\sitemap.xml"

# Pages to exclude from the sitemap
$excludePatterns = @(
    'styles-tags\.html$',
    'brandbook',
    'components',
    'assets',
    '404\.html$'
)

$urls = Get-ChildItem -Recurse -Filter "*.html" | ForEach-Object {
    $relativePath = $_.FullName.Replace((Get-Location).Path, '').Replace('\', '/').TrimStart('/')
    $lastmod      = $_.LastWriteTime.ToString("yyyy-MM-dd")

    # Skip excluded paths
    foreach ($pattern in $excludePatterns) {
        if ($relativePath -match $pattern) { return }
    }

    # Build clean canonical URL
    if ($relativePath -match 'index\.html$') {
        $cleanUrl = "$baseUrl/" + ($relativePath -replace 'index\.html$', '')
    } else {
        $cleanUrl = "$baseUrl/" + ($relativePath -replace '\.html$', '/')
    }

    # Remove double trailing slashes if any
    $cleanUrl = $cleanUrl -replace '//$', '/'

    "  <url>`n    <loc>$cleanUrl</loc>`n    <lastmod>$lastmod</lastmod>`n  </url>"
}

$sitemap = @"
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
$($urls -join "`n")
</urlset>
"@

Set-Content -Path $outputFile -Value $sitemap -Encoding UTF8
Write-Host "Done: $($urls.Count) URLs written to $outputFile" -ForegroundColor Green

Save this as generate-sitemap.ps1 in your site root. Run it from PowerShell with:

# Navigate to your site root first
cd C:\Sites\exponanta.com

# Run the script
.\generate-sitemap.ps1

Verifying the Output

Generating the sitemap is only half the job. Before submitting it to Google Search Console, verify it — check that the URL count is correct, that no excluded pages slipped through, that the XML is well-formed, and that every URL in it actually resolves.

The verification script below reads the generated sitemap.xml, parses it, and runs four checks: URL count, excluded pattern detection, XML validity, and optionally HTTP status for each URL.

# Verify sitemap.xml after generation
$sitemapFile     = ".\sitemap.xml"
$excludePatterns = @('styles-tags', 'brandbook', 'components', 'assets')
$checkHttp       = $false  # set $true to verify each URL returns 200

# 1. Load and parse XML
try {
    [xml]$xml = Get-Content $sitemapFile -Encoding UTF8
    Write-Host "✓ XML is well-formed" -ForegroundColor Green
} catch {
    Write-Host "✗ XML parse error: $_" -ForegroundColor Red
    exit 1
}

# 2. Count URLs
$urls = $xml.urlset.url
Write-Host "✓ Total URLs: $($urls.Count)" -ForegroundColor Green

# 3. Check for accidentally included excluded pages
$violations = $urls | Where-Object {
    $loc = $_.loc
    $excludePatterns | Where-Object { $loc -match $_ }
}
if ($violations) {
    Write-Host "✗ Excluded pages found in sitemap:" -ForegroundColor Red
    $violations | ForEach-Object { Write-Host "  - $($_.loc)" -ForegroundColor Red }
} else {
    Write-Host "✓ No excluded pages detected" -ForegroundColor Green
}

# 4. Check lastmod dates are valid ISO format
$badDates = $urls | Where-Object { $_.lastmod -notmatch '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' }
if ($badDates) {
    Write-Host "✗ Invalid lastmod dates found:" -ForegroundColor Red
    $badDates | ForEach-Object { Write-Host "  - $($_.loc) → $($_.lastmod)" -ForegroundColor Red }
} else {
    Write-Host "✓ All lastmod dates are valid" -ForegroundColor Green
}

# 5. Optional: HTTP check — verify each URL returns 200
if ($checkHttp) {
    Write-Host "`nChecking HTTP status for $($urls.Count) URLs..."
    $errors = @()
    $urls | ForEach-Object {
        try {
            $response = Invoke-WebRequest -Uri $_.loc -Method Head -TimeoutSec 10
            if ($response.StatusCode -ne 200) {
                $errors += "$($_.loc) → $($response.StatusCode)"
            }
        } catch {
            $errors += "$($_.loc) → ERROR: $_"
        }
    }
    if ($errors) {
        Write-Host "✗ URLs with issues:" -ForegroundColor Red
        $errors | ForEach-Object { Write-Host "  $_" -ForegroundColor Red }
    } else {
        Write-Host "✓ All URLs returned 200" -ForegroundColor Green
    }
}

Example output from a clean run:

✓ XML is well-formed ✓ Total URLs: 47 ✓ No excluded pages detected ✓ All lastmod dates are valid

Example output when something is wrong:

✓ XML is well-formed ✓ Total URLs: 51 ✗ Excluded pages found in sitemap: - https://exponanta.com/blog/styles-tags/ - https://exponanta.com/brandbook/index/ ✓ All lastmod dates are valid

Registering the Sitemap

Two places to register your sitemap once it is generated and verified.

First, add it to robots.txt at your site root. This tells any crawler — not just Google — where to find it:

# robots.txt
User-agent: *
Disallow:

Sitemap: https://exponanta.com/sitemap.xml

Second, submit it directly in Google Search Console under Indexing → Sitemaps. Paste the full URL — https://exponanta.com/sitemap.xml — and click Submit. Google will show you how many URLs it found and how many it has indexed. The gap between those two numbers is your indexing backlog.

Re-submit after major changes

Google recrawls your sitemap automatically over time, but after a large batch of new pages — a new taxonomy section, a new blog category — it is worth re-submitting manually in Search Console to speed up discovery.

Keeping It Fresh

The sitemap is only as useful as it is current. For a static site the simplest approach is to run the generation script every time you deploy. If you are using Git, add it as a pre-commit hook or a step in your deployment script so it never goes stale.

# Add to your deploy script (deploy.ps1)

# 1. Generate fresh sitemap
Write-Host "Generating sitemap..."
.\generate-sitemap.ps1

# 2. Verify it
.\verify-sitemap.ps1

# 3. Stage and commit
git add sitemap.xml
git commit -m "chore: update sitemap"
git push

Watch out for component and template files

If your site uses dynamically loaded components like navbar.html and footer.html stored in a /components/ folder, make sure that folder is in your $excludePatterns. Component files are not pages and should never appear in the sitemap.

Summary

A good sitemap for a static site is three things: complete (every indexable page), clean (no noindex pages, no components, no drafts), and current (lastmod dates reflecting real file changes). The PowerShell scripts above handle all three automatically — generation reads directly from the filesystem, verification catches exclusion violations and malformed dates before you publish, and the deploy hook keeps everything in sync.