Google Indexes Fewer Pages Than Your Sitemap

Apr 15, 2026

—

Revised version below.

You submit a sitemap. Google finds the URLs. Then you check Search Console and realise far fewer pages are actually indexed.

That gap matters. If important pages are not indexed, they cannot rank, attract the right searches, or support leads and sales. Sometimes the gap is harmless. Often, it points to a technical, structural, or quality problem that needs fixing.

For South African businesses, this often shows up after a redesign, a content rollout, a template change, or a period of weak visibility. Do not panic. Diagnose it.

If Google is indexing fewer pages than your sitemap suggests, here is how to separate the likely causes and work out what matters first.

First separate the problem properly

Before you look at causes, separate four things that are often blurred together.

Crawlability

Crawlability is whether Google can access the URL at all.

A page may be in your sitemap, but if Google cannot crawl it because of blocked resources, server issues, broken paths, or crawl traps, it may never be properly assessed.

Indexability

Indexability is whether the page is allowed to appear in Google’s index.

A URL can be crawlable but still excluded because it carries a noindex directive, conflicting signals, or some other instruction that works against indexation.

Canonical selection

Canonical selection is about which version of a page Google chooses as the main version.

Your submitted URL may be accessible and indexable, but Google may decide another URL is the canonical one. In that case, the submitted page can appear excluded even though Google did assess it.

Quality and prioritisation

Quality and prioritisation are about whether Google believes the page is worth indexing now.

This is where thin pages, near-duplicates, weak landing pages, and low-value parameter URLs usually fall over. The page exists. Google knows about it. It just does not see enough reason to keep it in the index.

Each problem leads to a different fix. A crawl issue needs a different response from a noindex mistake, a canonical conflict, or a weak-page problem.

What common Search Console symptoms usually mean

Search Console labels are useful, but only if you read them as clues, not conclusions.

Submitted URL not selected as canonical

This usually points to canonical confusion or duplicate signals.

Common causes include:

self-referencing canonicals missing on key pages
templates pointing multiple pages to one parent URL
duplicate service or city pages with very little differentiation
HTTP, HTTPS, trailing slash, or parameter variants competing with the main URL

What to verify first: Check the declared canonical, inspect the URL in Search Console, and compare it with the version Google says it selected.

Crawled – currently not indexed

This usually means Google reviewed the page and was not convinced it deserved a place in the index yet.

Common causes include:

thin or low-value content
near-duplicate pages

What to verify first: Compare the excluded page with the closest indexed alternative on your site and ask whether it has enough unique value to justify its own URL.

Discovered – currently not indexed

This usually means Google knows the page exists but has not treated it as a crawl priority yet.

Common causes include:

weak site authority
newly published pages
poor internal link support
oversized sitemaps full of non-priority pages
too many low-value URLs diluting crawl attention

What to verify first: Check whether the page has strong internal links, whether it belongs in the sitemap, and whether the sitemap is bloated with low-value URLs.

Excluded by noindex

This usually means the page is being told not to enter the index.

Common causes include:

meta robots noindex tags
plugin settings applied too broadly
staging rules left in place after launch
headers sending X-Robots-Tag noindex

What to verify first: Review the live page source, response headers, and relevant CMS or SEO plugin settings.

Duplicate without user-selected canonical

This usually means Google found multiple versions of roughly the same page and did not get a strong enough preference from the site.

Common causes include:

duplicate category or tag archives
parameter URLs
faceted navigation
alternate printable or filtered versions
weakly differentiated location or service pages

What to verify first: Find the duplicate set, choose the version you actually want indexed, and check whether canonicals, internal links, and sitemap inclusion all support that choice.

1. Your sitemap includes URLs that should never have been there

A sitemap should help Google find index-worthy URLs. It should not become a dumping ground for every page a CMS can generate.

A common problem is that sitemaps include thin, duplicate, filtered, paginated, tagged, parameter-based, or utility URLs that do not deserve indexation in the first place. When that happens, the sitemap count looks healthy, but Google ignores a chunk of those URLs because they offer little standalone value.

This is especially common on:

ecommerce sites with filter and sort URLs
WordPress setups with unnecessary tag or author archives
service sites with weak duplicate city pages
sites that automatically include media or attachment URLs in the sitemap

Google is not required to index every submitted URL. A sitemap is a suggestion, not a guarantee.

Before treating this as a technical failure, review whether the missing URLs are pages you actually want in search.

For example, a WooCommerce store may submit colour-filtered collection URLs alongside core category pages. Those filtered URLs can inflate sitemap counts without adding search value. On the service side, a WordPress site may accidentally include tag archives, media pages, date archives, or thin local pages that should never have been in the sitemap at all.

First check: Pull a sample of excluded URLs and ask the obvious question first: should these pages even be indexable?

Related reading: Pages not indexed in Google

2. The pages are too weak or too similar to justify indexation

Sometimes the answer is simpler than people want: the page is too weak to earn indexation.

That can include:

short pages with little unique value
service pages that repeat the same offer with only minor wording changes
location pages that swap place names without adding local relevance
ecommerce pages with almost no useful copy, context, or differentiation

A Cape Town plumbing site, for example, might publish separate pages for every suburb using the same text structure, the same service list, and the same proof points. Even if Google crawls those pages, it may not see enough reason to index all of them.

On the ecommerce side, category pages with almost identical product sets and no unique category copy hit the same wall.

Important pages need a clear purpose, real usefulness, and a reason to exist beyond template expansion.

First check: Put the excluded page beside the closest indexed alternative on your site and look for genuine differences in intent, content, and usefulness.

3. Internal linking is too weak to signal importance

Internal links tell Google which pages matter.

A decent page can still be buried when the rest of the site barely acknowledges it.

This often shows up when:

service pages sit outside the main navigation
orphan pages are only accessible through XML sitemaps
blog posts receive no links from related commercial pages
new landing pages go live without parent-child support
priority pages sit too deep in the click path with little contextual linking

A local service site may have a strong “SEO consultant Cape Town” page, but if it is barely linked from the main SEO page, the city hub, or related specialist pages, Google has less reason to treat it as important.

First check: Count how many meaningful internal links point to the page from navigation, parent pages, and closely related content.

4. Canonical tags are telling Google to index something else

Some exclusions are not about weak pages at all. They are about the site nominating the wrong version.

A canonical tag tells search engines which version of a page should be treated as the main version. If that tag points to the wrong URL, or if templates apply canonicals badly across sections, Google may ignore the submitted page and favour another version instead.

This often happens when:

templates reuse the same canonical across multiple pages
parameter URLs canonically point in the wrong direction
location or service pages reference a broader parent page
old CMS logic remains in place after a redesign

In Search Console, this can look like Google ignored the page. Often, it simply chose a different winner.

For example, a city service page may be live at /seo-consultant-cape-town/, but the template still points its canonical to /seo-consultant/. Google then treats the city page as secondary even though it was submitted in the sitemap.

First check: Inspect the live canonical tag and compare it with both the submitted URL and Google’s selected canonical.

5. Robots directives or noindex signals are getting in the way

Sometimes the site is not being subtle. It is explicitly telling Google to keep the page out.

This can happen through:

a noindex meta tag
an X-Robots-Tag header
robots.txt blocking important resources
plugin settings that apply indexation rules too broadly
staging rules left behind after launch

These issues often show up after development work, migrations, or content staging. A page can look fine in the browser and still be telling Google to stay out. Robots.txt and noindex are different instructions, and messy launches often leave both in the wrong place.

A common example is a page that looks normal in the browser, sits in the sitemap, and is linked internally, but still carries <meta name="robots" content="noindex,follow"> from an SEO plugin setting left behind after launch.

First check: Review the live source, response headers, and any SEO plugin or CMS rules controlling indexation.

6. Google is seeing duplicate versions or conflicting URL signals

If canonicals tell Google which version should win, duplication is the wider problem of giving Google too many versions in the first place.

Google does not want to keep several versions of the same thing in its index. When your site offers multiple URLs for one intent, Google has to choose one, ignore others, or handle them inconsistently.

This often shows up on sites with:

HTTP and HTTPS issues
trailing slash inconsistencies
parameter duplication
print or alternate page versions
overlapping category structures
duplicate service pages created for the same intent

An ecommerce example is faceted navigation creating multiple crawlable URLs for one core category. A service-site example is having both /seo-consultant-cape-town/ and /cape-town-seo-consultant/ live, indexable, and chasing the same search intent.

This is not a content problem. It is a signal problem.

First check: Search for alternate versions of the same page and check whether one clear version is supported by canonicals, redirects, internal links, and the sitemap.

7. Google knows about the pages but has not prioritised them yet

Sometimes Google knows the pages exist, but has not decided to spend crawl attention on them yet. This tends to happen when:

the site is still relatively new
authority is limited
pages were recently published
crawl demand is low
the site has too many non-priority URLs competing for attention

In these cases, the site is usually asking Google to care about too much, too soon.

The answer is usually restraint: focus the sitemap, support the best pages first, and reduce URL noise.

First check: Look at publish timing, site authority, and whether the sitemap is asking Google to process too many low-priority URLs at once.

A simple way to diagnose the gap

By this point, the goal should be clearer: stop treating the sitemap as one problem and start sorting it into patterns.

Group affected URLs into a few buckets:

pages that never belonged in the sitemap
pages that are weak or too similar to stronger alternatives
pages blocked by canonicals or noindex signals
pages competing with duplicate versions
pages Google knows about but has not prioritised yet

Patterns matter more than totals. Once you know which pattern you are dealing with, the next step usually becomes obvious.

What to check first in Search Console

Start with the affected URLs, not the headline count.

Separate them quickly into three groups: URLs that never belonged in the sitemap, URLs with obvious technical conflicts, and URLs that should rank but are still being left out.

Then use Page Indexing, URL Inspection, canonicals, index directives, sitemap coverage, and internal links to answer one question: is this a quality problem, a technical problem, or a priority problem?

That answer matters more than the raw number of excluded pages.

When several causes appear at once

This is common. A site does not always have one neat problem.

Imagine a service business launches 40 city pages after a redesign. Some pages are thin, some inherit the wrong canonical, and most are barely linked from the main service pages. In that situation, do not start by resubmitting all 40 pages. Start by fixing the canonicals on the most commercially important pages, remove or consolidate weak duplicates, then add internal links from the core service and city hub pages. After that, clean the sitemap so it reflects the pages that actually deserve indexation.

That sequence clears the strongest blockers first and stops Google getting mixed signals from the rest.

Why this matters commercially

And this is where the issue stops being a tidy technical exercise and starts affecting growth.

The highest-risk pages are usually your money pages: core service pages, city landing pages, key category pages, and ecommerce collections.

If those pages are not indexed properly, the site loses visibility where it matters most. That means weaker rankings on commercial terms, fewer qualified visits, fewer enquiries, and a poorer return on SEO work.

That is why submitted-versus-indexed gaps should be reviewed through a commercial lens. The real question is not how many pages are missing. It is which missing pages are costing the business traffic and leads.

Get the right pages indexed

A sitemap gap only matters if the missing pages are the ones that should be driving visibility, enquiries, and sales. The goal is not to force every submitted URL into Google. The goal is to make sure the right URLs are strong, supported, and sending clean signals.

Start there. Clean the sitemap. Fix the obvious technical conflicts. Strengthen the pages that matter most. Then judge the problem again.

If you need a closer diagnostic path, see why Google is skipping important pages. If the issue runs deeper into canonicals, crawl handling, templates, or wider indexation conflicts, a technical SEO audit can show what to fix first.

Good indexation is not about getting more pages into Google. It is about getting the right pages in for the right reasons. Counting URLs is bookkeeping. Choosing the right ones to earn indexation is strategy. And treating every excluded URL as a crisis is usually a mistake.

Google Indexes Fewer Pages Than Your Sitemap

First separate the problem properly

Crawlability

Indexability

Canonical selection

Quality and prioritisation

What common Search Console symptoms usually mean

Submitted URL not selected as canonical

Crawled – currently not indexed

Discovered – currently not indexed

Excluded by noindex

Duplicate without user-selected canonical

1. Your sitemap includes URLs that should never have been there

2. The pages are too weak or too similar to justify indexation

3. Internal linking is too weak to signal importance

4. Canonical tags are telling Google to index something else

5. Robots directives or noindex signals are getting in the way

6. Google is seeing duplicate versions or conflicting URL signals

7. Google knows about the pages but has not prioritised them yet

A simple way to diagnose the gap

What to check first in Search Console

When several causes appear at once

Why this matters commercially

Get the right pages indexed

Comments

Leave a Reply Cancel reply