Revised version below.
You submit a sitemap. Google finds the URLs. Then you check Search Console and realise far fewer pages are actually indexed.
That gap matters. If important pages are not indexed, they cannot rank, attract the right searches, or support leads and sales. Sometimes the gap is harmless. Often, it points to a technical, structural, or quality problem that needs fixing.
For South African businesses, this often shows up after a redesign, a content rollout, a template change, or a period of weak visibility. Do not panic. Diagnose it.
If Google is indexing fewer pages than your sitemap suggests, here is how to separate the likely causes and work out what matters first.

First separate the problem properly
Before you look at causes, separate four things that are often blurred together.
Crawlability
Crawlability is whether Google can access the URL at all.
A page may be in your sitemap, but if Google cannot crawl it because of blocked resources, server issues, broken paths, or crawl traps, it may never be properly assessed.
Indexability
Indexability is whether the page is allowed to appear in Google’s index.
A URL can be crawlable but still excluded because it carries a noindex directive, conflicting signals, or some other instruction that works against indexation.
Canonical selection
Canonical selection is about which version of a page Google chooses as the main version.
Your submitted URL may be accessible and indexable, but Google may decide another URL is the canonical one. In that case, the submitted page can appear excluded even though Google did assess it.
Quality and prioritisation
Quality and prioritisation are about whether Google believes the page is worth indexing now.
This is where thin pages, near-duplicates, weak landing pages, and low-value parameter URLs usually fall over. The page exists. Google knows about it. It just does not see enough reason to keep it in the index.
Each problem leads to a different fix. A crawl issue needs a different response from a noindex mistake, a canonical conflict, or a weak-page problem.
What common Search Console symptoms usually mean
Search Console labels are useful, but only if you read them as clues, not conclusions.
Submitted URL not selected as canonical
This usually points to canonical confusion or duplicate signals.
Common causes include:
- self-referencing canonicals missing on key pages
- templates pointing multiple pages to one parent URL
- duplicate service or city pages with very little differentiation
- HTTP, HTTPS, trailing slash, or parameter variants competing with the main URL
What to verify first: Check the declared canonical, inspect the URL in Search Console, and compare it with the version Google says it selected.
Crawled – currently not indexed
This usually means Google reviewed the page and was not convinced it deserved a place in the index yet.
Common causes include:
- thin or low-value content
- near-duplicate pages
What to verify first: Compare the excluded page with the closest indexed alternative on your site and ask whether it has enough unique value to justify its own URL.
Discovered – currently not indexed
This usually means Google knows the page exists but has not treated it as a crawl priority yet.
Common causes include:
- weak site authority
- newly published pages
- poor internal link support
- oversized sitemaps full of non-priority pages
- too many low-value URLs diluting crawl attention
What to verify first: Check whether the page has strong internal links, whether it belongs in the sitemap, and whether the sitemap is bloated with low-value URLs.
Excluded by noindex
This usually means the page is being told not to enter the index.
Common causes include:
- meta robots noindex tags
- plugin settings applied too broadly
- staging rules left in place after launch
- headers sending X-Robots-Tag noindex
What to verify first: Review the live page source, response headers, and relevant CMS or SEO plugin settings.
Duplicate without user-selected canonical
This usually means Google found multiple versions of roughly the same page and did not get a strong enough preference from the site.
Common causes include:
- duplicate category or tag archives
- parameter URLs
- faceted navigation
- alternate printable or filtered versions
- weakly differentiated location or service pages
What to verify first: Find the duplicate set, choose the version you actually want indexed, and check whether canonicals, internal links, and sitemap inclusion all support that choice.
1. Your sitemap includes URLs that should never have been there
A sitemap should help Google find index-worthy URLs. It should not become a dumping ground for every page a CMS can generate.
A common problem is that sitemaps include thin, duplicate, filtered, paginated, tagged, parameter-based, or utility URLs that do not deserve indexation in the first place. When that happens, the sitemap count looks healthy, but Google ignores a chunk of those URLs because they offer little standalone value.
This is especially common on:
- ecommerce sites with filter and sort URLs
- WordPress setups with unnecessary tag or author archives
- service sites with weak duplicate city pages
- sites that automatically include media or attachment URLs in the sitemap
Google is not required to index every submitted URL. A sitemap is a suggestion, not a guarantee.
Before treating this as a technical failure, review whether the missing URLs are pages you actually want in search.
For example, a WooCommerce store may submit colour-filtered collection URLs alongside core category pages. Those filtered URLs can inflate sitemap counts without adding search value. On the service side, a WordPress site may accidentally include tag archives, media pages, date archives, or thin local pages that should never have been in the sitemap at all.
First check: Pull a sample of excluded URLs and ask the obvious question first: should these pages even be indexable?
Related reading: Pages not indexed in Google
2. The pages are too weak or too similar to justify indexation
Sometimes the answer is simpler than people want: the page is too weak to earn indexation.
That can include:
- short pages with little unique value
- service pages that repeat the same offer with only minor wording changes
- location pages that swap place names without adding local relevance
- ecommerce pages with almost no useful copy, context, or differentiation
A Cape Town plumbing site, for example, might publish separate pages for every suburb using the same text structure, the same service list, and the same proof points. Even if Google crawls those pages, it may not see enough reason to index all of them.
On the ecommerce side, category pages with almost identical product sets and no unique category copy hit the same wall.
Important pages need a clear purpose, real usefulness, and a reason to exist beyond template expansion.
First check: Put the excluded page beside the closest indexed alternative on your site and look for genuine differences in intent, content, and usefulness.
3. Internal linking is too weak to signal importance
Internal links tell Google which pages matter.
A decent page can still be buried when the rest of the site barely acknowledges it.
This often shows up when:
- service pages sit outside the main navigation
- orphan pages are only accessible through XML sitemaps
- blog posts receive no links from related commercial pages
- new landing pages go live without parent-child support
- priority pages sit too deep in the click path with little contextual linking
A local service site may have a strong “SEO consultant Cape Town” page, but if it is barely linked from the main SEO page, the city hub, or related specialist pages, Google has less reason to treat it as important.
First check: Count how many meaningful internal links point to the page from navigation, parent pages, and closely related content.
Related reading: Indexing issues
4. Canonical tags are telling Google to index something else
Some exclusions are not about weak pages at all. They are about the site nominating the wrong version.
A canonical tag tells search engines which version of a page should be treated as the main version. If that tag points to the wrong URL, or if templates apply canonicals badly across sections, Google may ignore the submitted page and favour another version instead.
This often happens when:
- templates reuse the same canonical across multiple pages
- parameter URLs canonically point in the wrong direction
- location or service pages reference a broader parent page
- old CMS logic remains in place after a redesign
In Search Console, this can look like Google ignored the page. Often, it simply chose a different winner.
For example, a city service page may be live at /seo-consultant-cape-town/, but the template still points its canonical to /seo-consultant/. Google then treats the city page as secondary even though it was submitted in the sitemap.
First check: Inspect the live canonical tag and compare it with both the submitted URL and Google’s selected canonical.
5. Robots directives or noindex signals are getting in the way
Sometimes the site is not being subtle. It is explicitly telling Google to keep the page out.
This can happen through:
- a
noindexmeta tag - an X-Robots-Tag header
- robots.txt blocking important resources
- plugin settings that apply indexation rules too broadly
- staging rules left behind after launch
These issues often show up after development work, migrations, or content staging. A page can look fine in the browser and still be telling Google to stay out. Robots.txt and noindex are different instructions, and messy launches often leave both in the wrong place.
A common example is a page that looks normal in the browser, sits in the sitemap, and is linked internally, but still carries <meta name="robots" content="noindex,follow"> from an SEO plugin setting left behind after launch.
First check: Review the live source, response headers, and any SEO plugin or CMS rules controlling indexation.
6. Google is seeing duplicate versions or conflicting URL signals
If canonicals tell Google which version should win, duplication is the wider problem of giving Google too many versions in the first place.
Google does not want to keep several versions of the same thing in its index. When your site offers multiple URLs for one intent, Google has to choose one, ignore others, or handle them inconsistently.
This often shows up on sites with:
- HTTP and HTTPS issues
- trailing slash inconsistencies
- parameter duplication
- print or alternate page versions
- overlapping category structures
- duplicate service pages created for the same intent
An ecommerce example is faceted navigation creating multiple crawlable URLs for one core category. A service-site example is having both /seo-consultant-cape-town/ and /cape-town-seo-consultant/ live, indexable, and chasing the same search intent.
This is not a content problem. It is a signal problem.
First check: Search for alternate versions of the same page and check whether one clear version is supported by canonicals, redirects, internal links, and the sitemap.
7. Google knows about the pages but has not prioritised them yet
Sometimes Google knows the pages exist, but has not decided to spend crawl attention on them yet. This tends to happen when:
- the site is still relatively new
- authority is limited
- pages were recently published
- crawl demand is low
- the site has too many non-priority URLs competing for attention
In these cases, the site is usually asking Google to care about too much, too soon.
The answer is usually restraint: focus the sitemap, support the best pages first, and reduce URL noise.
First check: Look at publish timing, site authority, and whether the sitemap is asking Google to process too many low-priority URLs at once.
A simple way to diagnose the gap
By this point, the goal should be clearer: stop treating the sitemap as one problem and start sorting it into patterns.
Group affected URLs into a few buckets:
- pages that never belonged in the sitemap
- pages that are weak or too similar to stronger alternatives
- pages blocked by canonicals or noindex signals
- pages competing with duplicate versions
- pages Google knows about but has not prioritised yet
Patterns matter more than totals. Once you know which pattern you are dealing with, the next step usually becomes obvious.
What to check first in Search Console
Start with the affected URLs, not the headline count.
Separate them quickly into three groups: URLs that never belonged in the sitemap, URLs with obvious technical conflicts, and URLs that should rank but are still being left out.
Then use Page Indexing, URL Inspection, canonicals, index directives, sitemap coverage, and internal links to answer one question: is this a quality problem, a technical problem, or a priority problem?
That answer matters more than the raw number of excluded pages.
When several causes appear at once
This is common. A site does not always have one neat problem.
Imagine a service business launches 40 city pages after a redesign. Some pages are thin, some inherit the wrong canonical, and most are barely linked from the main service pages. In that situation, do not start by resubmitting all 40 pages. Start by fixing the canonicals on the most commercially important pages, remove or consolidate weak duplicates, then add internal links from the core service and city hub pages. After that, clean the sitemap so it reflects the pages that actually deserve indexation.
That sequence clears the strongest blockers first and stops Google getting mixed signals from the rest.
Why this matters commercially
And this is where the issue stops being a tidy technical exercise and starts affecting growth.
The highest-risk pages are usually your money pages: core service pages, city landing pages, key category pages, and ecommerce collections.
If those pages are not indexed properly, the site loses visibility where it matters most. That means weaker rankings on commercial terms, fewer qualified visits, fewer enquiries, and a poorer return on SEO work.
That is why submitted-versus-indexed gaps should be reviewed through a commercial lens. The real question is not how many pages are missing. It is which missing pages are costing the business traffic and leads.
Get the right pages indexed
A sitemap gap only matters if the missing pages are the ones that should be driving visibility, enquiries, and sales. The goal is not to force every submitted URL into Google. The goal is to make sure the right URLs are strong, supported, and sending clean signals.
Start there. Clean the sitemap. Fix the obvious technical conflicts. Strengthen the pages that matter most. Then judge the problem again.
If you need a closer diagnostic path, see why Google is skipping important pages. If the issue runs deeper into canonicals, crawl handling, templates, or wider indexation conflicts, a technical SEO audit can show what to fix first.
Good indexation is not about getting more pages into Google. It is about getting the right pages in for the right reasons. Counting URLs is bookkeeping. Choosing the right ones to earn indexation is strategy. And treating every excluded URL as a crisis is usually a mistake.

Leave a Reply