What’s in a number? Probably more than you think. As SEOs you get to deal with a lot of things but what if under your nose was an accident waiting to happen? And Developers, what if the advice from an SEO was unclear, would you even notice?
So, ladies and gentlemen, SEOs and developers, I’d like to take you on a back to basics journey through pagination and canonical to hopefully illustrate the sort of sh*t show we can collectively create if we're not paying attention. Please note that examples are in my preferred language PHP and have been simplified to keep it readable, but don’t let that put you off as this is all still applicable regardless of your language of choice.
How does Pagination work?
If you’re an SEO you will know what pagination does, but for this I thought it would be important to clarify how pagination works. What we do as developers is count how many results match a criteria, such as all laptops, and then we divide the total by the number of results we wish to show per page. As an example, let’s say that there are 54 laptops and we would like to show 24 per page, this would be 54 total products ÷ 24 products per page = 2.25 total pages. Hang on a minute, I don't think 0.25 of a web page is going to work, lets round that up to 3 total pages.
Now we have 3 total pages, the first two pages to list 24 products each and the last page to show the remaining 6 results. The next step is knowing which set of results to display depending on the current page, and for this we'll use query parameters.
Parameters allow us to create key-value pairs, e.g. login=true, and with this example we're going to stick with tradition and aptly name our parameter 'page'. An important point to note is that parameters are harmless and don’t do much other than append a URL, that is however unless you're expecting them and GET their value, which is exactly what we're going to do with our 'page' parameter.
What we first need to do here is find out if the page parameter is actually set, and if it is we can then GET its value to use as our current page. In the case that the parameter is not set, e.g. /laptops, we will assume it's page one and set our current page value to 1.
The final piece of pagination is to return the results for the current page. With our laptops example our results are returned as an array from a query, so we’re going to use a little calculation to work out where to start our results from and for this we will need to use an offset. Following on with our example we can now calculate the offset with (current page − 1) × products per page = offset, and with this value we can then request the results starting at this position and limit them by the number of products per page.
How does a Canonical link work?
Every URL is unique by nature, so as we saw with pagination both /laptops and /laptops?page=1 will serve the exact same content on two different URLs. Aside from your own parameters duplication could also be found by links from third parties, for example tracking parameters such as utm_source=twitter.com are 'technically' appending your URL. The problem with this is that it makes it difficult for search engines to know which version to index, they want to index unique pages and not a heap of duplicates.
So, along came the Canonical Link Element, an Internationalized Resource Identifier, its purpose to help search engines consolidate multiple versions of the same content into one. How the canonical link works is by setting the link to your preferred version of the page, suggesting to search engines that if the url does not match the canonical link then it's a duplicate and should consolidate.
Although commonly used to help consolidate GET-parameters, not all parameters are bad, in fact with page two for example it's a unique set of products which we want the search engines to see. Since we know the current page we can decide whether to add the page parameter depending if it page one or not.
I must stress here that a canonical link is a suggestion and not a directive. For directives you'll need to look at other options such as 301 redirect, 404 not found or a noindex meta tag in the http request or html head.
How does this make a mess in the search index?
Good question, I've probably bored you with canonical, pagination and a hint of parameters and yet everything looks in order, nothing new, right? You would have thought so, maybe some of the more tech savvy SEOs and developers have spotted it already, but for those that didn’t I simply got page parameter and at no point did I actually validate what the value was, and therein lies the problem. Up until now we have always assumed that the current page value would be either 1, 2 or 3; but what if it wasn’t?
So far its been hypothetical range of laptops, but to carry on with this we'll need a website in the wild to explain the next bits. Although this is common to many websites I’ll be taking a look at The New York Times Store and their Best Sellers collection. The website is developed with the popular platform Shopify so uses Liquid on Ruby whereas I've been using PHP, I wanted to ensure we know that language and size of platform doesn’t make a difference.
The thing about paginated pages is they need to expand and contract depending on the number of products. With our ficticious laptop collection today we have 54 products, but what if maybe 6 months ago we had 76? Knowing how pagination works this would have meant a fourth page, so where did it go? One way to find out where our fourth page might be is with a site search, so to test we can target their best sellers section and see what pages Google knows about.
Aha, there it is! But hey don't worry, we all know pages come and go so page four will just drop out the index, right? Well lets see with a cheeky peek of Googles latest cache of page four.
Oh snap! Thing is search engines are a bit like an elephant as they never seem to forget unless we explicitly tell them to do so. Despite poor little page four being an orphan Google has still found its way back without any links, and even though there are no results to show they have decided to still index the page and cached it again.
Now we could brush this under the carpet, what’s one page? You might be right, one orphan floating around wouldn’t hurt, but that’s not looking at the bigger picture. The fact is a page can exist even when the parameter value is invalid. We also know from the site search that Google can still index and cache these pages, they return a 200 OK header and even a nice canonical link to confirm its the preferred version. Uh-oh, we're assuming it will be a number; surely you can't paginate a Christmas Pudding?
Now we’re just having a bit of fun here, but the reality check is we can set the page parameter to anything but a 1 and the page will always return 200 OK with a self-referencing canonical link. In theory if we were also using alternate versions for languages and regions, and we added this page parameter, it's not impossible to consider we could take this pudding global … Le pudding de Noël est fantastique!
Some might be thinking who would add a link on their site to a Christmas Pudding, and you’d be right we wouldn’t, but what if the links were not on our site but linked from another? What if those links were linking to our laptop store in a spammy sort of way with exact match anchors for our page=best+cheap+laptops? These are the more sinister theories for sure, but at the end of the day all of these problems and possibilities are because of a single parameter.
WTF just happened?
To explain what’s happening we need to go back to our pagination examples and use a universal rule of coding; you should never trust what you’re given! See with our laptop collection it's safe to rely on how many products we have in the range, how many results we want to show per page and the total number of pages we have. The moment the page parameter is set it's outside our safety zone, the value being the results that they have requested which can literally be anything, to infinity and beyond! This is where we start to fall apart, all of our conditions are based around whether the current page value is 1, no other tests are carried out after we GET our page parameter.
With the offset rule we only test if the current page is greater than 1 which introduces the first problem, we do not check to see if it's within the range of our total number of pages. If we take page 4 as an example it would be (4 − 1) × 24 products per page = offset 72, an impossible point to start when there are only 54 results, and this is why we end up with no results showing on page 4 of 3.
Where it might slightly twist your noodle is why we get products 1-24 showing if we offset our results with a Christmas Pudding, and this is because it's a string and not an integer. When we test to see if a Christmas Pudding is greater than 1 the answer is no, and since it's not the condition is met in the same way as if it was page 1. This would be the same result if the current page was empty or if the value was 0.
The final nail in the coffin comes from the canonical, in our example we are only testing if the current page value is 1, if not we're simply adding the page parameter regardless of what it is; empty, 0, 2 or Christmas Pudding the outcome is always the same.
What can be done?
The way to remedy this sort of pickle is to introduce extra validation steps to the page parameter before using it, prevention is key, we can test the current page to meet different criteria and act accordinly. In addition if we do this before the page headers are sent we still have a moment to introduce a directive such as 404 or redirect with a 301 depending on why the validation failed. This is a golden opportunity for developers and SEOs to work together and discuss which directives to use at each stage.
It's at this point I need to draw your attention to the fact that this PHP example is not exhaustive, nor is it particularly efficient, it's merely here to show common failure points of validation to provoke ideas you could use in your language of choice. Do not blindly copy and paste this, that's for stack overflow, and possibly why stuff like this gets missed.
In the case of Shopify I’m a bit stumped, although I can see what’s happening alas what I know about Liquid and Ruby wouldn’t even fill the back of a postage stamp. I would suggest looking through the paginate object documentation and the canonical urls article by Tiffany Tse. If you're a Shopify developer it would no doubt make more sense to you than it does to me.
When I was 15 my sister had just started to learn to drive. Whilst having a cheeky cigarette out my bedroom window I spotted her pulling into our road after a driving lesson. She turned the car around at the end of our street, paused for a minute to listen to the instructor and then drove straight towards our house, mounting the kerb and stopping with just the front wheels of the car on the pavement. I could hear her apologising and sounding slightly embarrassed as she got out the car and came into the house. Little brother syndrome kicked in, “Interesting parking” I smirked. My sister replied “It's embarrasing, he told me to park outside the house with two wheels on the kerb so I did, but he never said which two”.
Although funny at the time this is exactly what we’ve shown here. If as SEOs we give recommendations like “the canonical needs to be the same as the current page … oh, unless it’s page 1” then with our original canonical and the Shopify example both do exactly that, nothing more and nothing less. The big takeaway from this for you is to manually check things … oh yeah, you heard me right, I just said the 'M' word. If we rely purely on crawling with tools like Screaming Frog this issue would probably be missed, there are no links to the invalid URLs and no directive is returned to pick up on. You don't need to go crazy, but testing key template pages like sections is a great place to look, and if something wants a number give it a Christmas Pudding to be sure.
Ah developers, my kin, don’t be disheartened. Although the lack of validation falls on us the canonical implication should have been unearthed by the SEOs. My advice is to always validate what you're given; GET, REQUEST, POST and most importantly SEO advice. If you discover an issue like the one we cover here go find an SEO, raise your concerns and ask them what they would to do about it.
So there we have it, the confusion between canonical and pagination in all it's Christmas Pudding glory … thanks for reading peeps!