How to Find All Pages on a Website (and Why You Need To)

 

Table of contents

Why you need to find all the pages on your site

How your content actually gets to be seen

What is crawling and indexing?

Links

Sitemaps

CMS

What is indexing?

Using robots.txt

Using ‘noindex’

What are orphan pages?

How do orphan pages come about?

How about dead-end pages?

Where do dead-end pages come from?

What are hidden pages?

Should all hidden pages be done away with?

Newsletter sign ups

Pages containing user information

How to find hidden pages

Using robots.txt

Manually finding them

How to find all the pages on your site

Using your sitemap file

Using your CMS

Using a log

Using Google Analytics

Manually typing into Google’s search query

What then do you do with your URL list?

Manual comparison with log data

Using site crawling tools

SEOptimers SEO crawl tool

In conclusion

 

Think about it. Why do you create a website? For your potential customers or audience to easily find you and for you to stand out among the competition, right? How does your content actually get to be seen? Is all the content on your site always seen?

 

Why you need to find all the pages on your website

 

It is possible that pages containing valuable information that actually needs to be seen, do not get to be seen at all. If this is the case for your website, then you are probably losing out on significant traffic, or even potential customers.

 

There could also be pages that are rarely seen, and when they are, users/visitors/potential customers hit a dead-end, as they cannot access other pages. They can only leave. This is as just as bad as those pages that are never seen. Google will begin to note the high bounce rates and question your site’s credibility. This will see your web pages rank lower and lower.

 

How your content actually gets to be seen

 

search engine bot crawling for webpages

 

For users, visitors or potential customers to see your content, crawling and indexing needs to be done and done frequently. What is crawling and indexing?

What is crawling and indexing?

For Google to show your content to users/visitors/potential customers, it needs to know first that content exists. How this happens is via crawling. This is when search engines search for new content and add it to its database of already existing content.

 

What makes crawling possible?

  • Links
  • Sitemaps
  • Content Management Systems (CMS – Wix, Blogger)

 

Links:

When you add a link from an existing page to another new page, for example via anchor text, search engine bots or spiders are able to follow the new page and add it to Google’s ‘database’ for future reference.

 

Sitemaps:

These are also known as XML Sitemaps. Here, the site owner submits a list of all their pages to the search engine. The webmaster can also include details like the last date of modification. The pages are then crawled and added to the ‘database’. This is however not real time. Your new pages or content will not be crawled as soon as you submit your sitemap. Crawling may happen after days or weeks.

 

Most sites using a Content Management System (CMS) auto-generate these, so it’s a bit of a shortcut. The only time a site might not have the sitemap generated is if you created a website from scratch.

example of a sitemap

 

CMS:

If your website is powered by a CMS like Blogger or Wix, the hosting provider (in this case the CMS) is able to ‘tell search engines to crawl any new pages or content on your website.’

 

Here’s some information to help you with the process:

 

Adding a sitemap to WordPress

Viewing the sitemap

Where is sitemap for Wix?

Sitemap for Shopify

What is indexing?

Indexing in simple terms is the adding of the crawled pages and content into Google’s ‘database’, which is actually referred to as Google’s index.

 

Before the content and pages are added to the index, the search engine bots strive to understand the page and the content therein. They even go ahead to catalog files like images and videos.

 

This is why as a webmaster, on-page SEO comes in handy (page titles, headings, and use of alt text, among others). When your page or pages have these aspects, it becomes easier for Google to ‘understand’ your content, catalog it appropriately and index it correctly.

 

Using robots.txt

Sometimes, you may not want some pages indexed, or parts of a website. You need to give directives to search engine bots. Using such directives also makes crawling and indexing easier, as there are fewer pages being crawled. Learn more about robots.txt here.

robots.txt

 

Using ‘noindex’

You can also this other directive if there are pages that you do not want to appear in the search results. Learn more about the noindex.

 

Before you start adding noindex, you’ll want to identify all of your pages so you can clean up your site and make it easier for crawlers to crawl and index your site properly.

 

What are some reasons why you need to find all your pages?

 

What are orphan pages?

 

An orphan page can be defined as one that has no links from other pages on your site. This makes it almost impossible for these pages to be found by search engine bots, and in addition by users. If the bots cannot find the page, then they will not show it on search results, which further reduces the chances of users finding it.

How do orphan pages come about?

Orphan pages may result from an attempt to keep content private, syntax errors, typos, duplicate content or expired content that was not linked. Here are more ways:

 

  • Test pages that were used for A/B testing and that were never deactivated
  • Landing pages that were based on a season, for example, Christmas, Thanksgiving or Easter
  • ‘Forgotten’ pages as a result of site migration

 

How about dead-end pages?

 

Unlike orphan pages, dead-end pages have links from other pages on the website but do not link to other external sites. Dead-end pages examples include thank you pages, services pages with no call to actions, and “nothing found” pages when users search for something via the search option.

 

When you have dead-end pages, people who visit them only have two options: to leave the site or go back to the previous page. That means that you are losing significant traffic, especially if these pages happen to be ‘main pages’ on your website. Worse still, users are left either frustrated, confused or wondering, ‘what’s next’?

 

If users leave your site feeling frustrated, confused or with any negative emotions, they are never likely to come back, just like unhappy customers are never likely to buy from a brand again.

Where do dead-end pages come from?

Dead end-pages are a result of pages with no calls to action. An example here would be an about page that alludes to the services that your company offers but has no link to those services. Once the reader understands what drives your company, the values you uphold, how the company was founded and the services you offer and is already excited, you need to tell them what to do next.

 

A simple call to action button ‘view our services’ will do the job. Make sure that the button when clicked actually opens up to the services page. You do not want the user to be served with a 404, which will leave him/her frustrated as well.

dead-end-page

 

What are hidden pages?

 

Hidden pages are those that are not accessible via a menu or navigation. Though a visitor may be able to view them, especially through anchor text or inbound links, they can be difficult to find.

 

Pages that fall into the category section are likely to be hidden pages too, as they are located in the admin panel. The search engine may never be able to access them, as they do not access information stored in databases.

 

Hidden pages can also result from pages that were never added to the site’s sitemap but exist on the server.

Should all hidden pages be done away with?

Not really. There are hidden pages that are absolutely necessary, and should never be accessible from your navigations. Let’s look at examples:

 

Newsletter sign ups

You can have a page that breaks down the benefits of signing up to the newsletter, how frequently users should expect to receive it, or a graphic showing the newsletter (or previous newsletter). Remember to include the sign up link as well.

 

Pages containing user information

Pages that require users to share their information should definitely be hidden. Users need to create accounts before they can access them. Newsletter sign ups can also be categorized here.

 

How to find hidden pages

 

Like we mentioned, you can find hidden pages using all the methods that are used to find orphan or dead end pages. Let’s explore a few more.

Using robots.txt

Hidden pages are highly likely to be hidden from search engines via robots.txt. To access a site’s robots.txt, type [domain name]/robots.txt into a browser and enter. Replace ‘domain name’ with your site’s domain name. Look out for entries beginning with ‘disallow’ or ‘nofollow’.

Manually finding them

If you sell products via your website for example, and suspect that one of your product categories may be hidden, you can manually look for it. To do this, copy and paste another products URL and edit it accordingly. If you don’t find it, then you were right!.

 

What if you have no idea of what the hidden pages could be? If you organize your website in directories, you can add your domainname/folder-name to a site’s browser and navigate through the pages and sub-directories.

 

Once you have found your hidden pages (and they do not need to stay hidden as discussed above), you need to add it to your sitemap and submit a crawl request.

 

How to find all the pages on your site

You need to find all your web pages in order to know which ones are dead-end or orphan. Let’s explore the different ways to achieve this:

Using your sitemap file

We have already looked at sitemaps. Your sitemap would come in handy when analyzing all of your web pages. If you do not have a sitemap, you can use a sitemap generator to generate one for you. All you need to do is enter your domain name and the sitemap will be generated for you.

Using your CMS

If your site is powered by a content management system(CMS) like WordPress, and your sitemap does not contain all the links, it is possible to generate the list of all your web pages from the CMS. To do this, use a plugin like Export All URLs.

Using a log

A log of all the pages served to visitors also comes in handy. To access the log, log in to your cPanel, then find ‘raw log files’. Alternatively, request your hosting provider to share it. This way you get to see the most frequently visited pages, the never visited pages and those with the highest drop off rates. Pages with high bounce rates or no visitors could be dead-end or orphan pages.

Using Google Analytics

Here are the steps to follow:

 

Step 1: Log in to your Analytics page.

Step 2: Go to ‘behavior’ then ‘site content’

Step 3: Go to ‘all pages’

Step 4: Scroll to the bottom and on the right choose ‘show rows’

Step 5: Select 500 or 1000 depending on how many pages you would estimate your site to have

Step 6: Scroll up and on the top right choose ‘export’

Step 7: Choose ‘export as .xlsx’ (excel)

Step 8: Once the excel is exported choose ‘dataset 1’

Step 9: Sort by ‘unique page views’.

Step 10: Delete all other rows and columns apart from the one with your URLs

Step 11: Use this formula on the second column:

=CONCATENATE(“http://domain.com,A1)

Step 12: Replace the domain with your site’s domain. Drag the formula so that it is applied to the other cells as well.

You now have all your URLs.

If you want to convert them to hyperlinks in order to easily click and access them when looking something up, go on to step 13.

Step 13: Use this formula on the third row:

=HYPERLINK(B1)

Drag the formula so that it is applied to the other cells as well.

Manually typing into Google’s search query

You can also type this site: www.abc.com into Google’s search query. Replace ‘abc’ with your domain name. You will get search results with all the URLs that Google has crawled and indexed, including images, links to mentions on other sites, and even hashtags your brand can be linked to.

 

You can then manually copy each and paste them onto an excel spreadsheet.

how to do a google search query

 

 

What then do you do with your URL list?

 

At this point, you may be wondering what you need to do with your URL list. Let’s look at the available options:

Manual comparison with log data

One of the options would be to manually compare your URL list with the CMS log and identify the pages that seem to have no traffic at all, or that seem to have the highest bounce rates. You can then use a tool like ours to check for inbound and outbound links for each of the pages that you suspect to be orphan or dead end.

 

Another approach is to download all your URLs as a .xlsx file (excel) and your log too. Compare them side by side (in two columns for example) and then use the ‘remove duplicates option’ in excel. Follow the step by step instructions. By the end of the process, you will have only orphan and dead-end pages left.

 

The third comparison approach is copying two data sets – your log and URL list on to Google Sheets. This allows you to use this formula: =VLOOKUP(A1, A: B,2,) to look up URLs that are present in your URL list, but not on your log. The missing pages (rendered as N/A) should be interpreted as orphan pages. Ensure that the log data is on the first or left column.

Using site crawling tools

The other option would be to load your URL list onto tools that can perform site crawls, wait for them to crawl the site and then you copy and paste your URLs onto a spreadsheet before analyzing them one by one, and trying to figure out which ones are orphan or dead end.

 

These two options can be time-consuming, especially if you have many pages on your site, right?

 

Well, how about a tool that not only finds you all your URLs but also allows you to filter them and shows their status (so that you know which ones are dead end or orphan?).  In other words, if you want a shortcut to finding all of your site’s pages  SEOptimer’s SEO Crawl Tool.

SEOptimer’s SEO Crawl Tool

This tool allows you to access all your pages of your site. You can start by going to “Website Crawls” and enter your website url. Hit “Crawl

enter your website url and hit "crawl" seoptimer tool

 

Once the crawl is finished you can click on “View Report”:

how to view report from seoptimer's crawl tool

 

Our crawl tool will detect all the pages of your website and list them in the “Page Found” section of the crawl.

pages found section of seoptimer's crawl tool

 

You can identify “404 Error” issues on our “Issues Found” just beneath the “Pages Found” section:

how to find any issues from your seoptimer crawl tool report

 

Our crawlers can identify other issues like finding pages with missing Title, Meta Descriptions, etc. Once you find all of your pages, you can start filtering and work on the issues at hand.

 

In conclusion

 

In this article we have looked at how to find all the pages on your site and why it is important. We have also explored concepts like orphan and dead end pages, as well as hidden pages. We have differentiated each one, how to identify each among your URls. There is no better time to find out whether you are losing out due to hidden, orphan or dead-end pages.

 





SEOptimer Report Preview

SEOptimer - SEO Audit & Reporting Tool.
Improve Your Website. Win More Customers.

Get a Free Website Audit Instantly