Did you know that you hold more authority than you think when it comes to controlling how web crawlers index your site?
Digital marketing agencies usually love talking about ‘hacks’ that would improve your search rankings and website visibility. And what better way to do it than by taking advantage of the robots.txt file?
This tiny text file is one of the most significant yet overlooked parts of websites. Some SEO experts rarely talk about it, much less know how to properly implement it.
What’s surprising about the robots.txt file is that it’s one of the easiest SEO strategies out there that’s not time-consuming. When you get it right, you can unlock higher organic traffic, appear on the first page of the SERPs and expand your reach.
However, editing the robots.txt file is not a walk in the park. It requires advanced skill and careful planning to maximise its SEO benefits. Even the smallest mistake can harm your site, so make sure you read this article carefully before making any changes.
In this article, as an addition to our ultimate SEO guide, we will tackle one of the most vital components of your on-page search engine optimisation run – the robots.txt.
What is a robots.txt File?
A guide. The simplest definition we can give you is what a robots.txt file is—a map that helps search engine crawlers know which part of your web pages they must go to and must not go to.
It is also a part of the robots exclusion protocol (REP). They are a group of web standards implementing regulations on how robots should crawl the web, index content and display the most relevant pages to the users.
Running a website is almost similar to managing an art museum. To make sure your visitors witness the latest exhibits you run, you hire a tour guide. This allows them to know what’s the most important thing in the trip and prevents them from accessing off-limit areas.
You can also imagine that a robots.txt file is like a cardboard sign hung on the walls of a bar or a gym. The sign may have little to no power in enforcing the rules, but well-intentioned people will follow them nonetheless. Meanwhile, the bad ones will break them and end up getting kicked out.
As part of the Technical SEO, the robots.txt file is integral to your crawl budget optimisation campaign.
What is a crawl budget, you ask?
It is the number of pages Googlebot crawls and indexes on a website within a specific timeframe. The reason why search engines assign it to every site is that they need to divide their attention to millions of websites across the web, and their resources are limited.
There are two factors in how search engines determine crawl budget:
- Crawl demand refers to the popularity of your pages and how often they are updated.
- Crawl rate limit is the overall speed rate of your pages, crawl limit and crawl errors recorded in your Google Search Console.
When it comes to SEO, this is considered to be vital because if a search engine doesn’t index a page, it’s not going to rank. However, only selected websites should pay attention to the crawl budget. Here are a few cases:
- You manage an e-commerce website that has more than 10,000 pages.
- There are several redirect chains that consume your crawl budget.
- You want to add a new section that contains plenty of pages.
How Does robots.txt Work?
Like any other file on your website, a web server hosts a robots.txt file. For clarification, the robots.txt file is a file extension and is not an HTML markup code, meaning you may view it by typing the full URL for the homepage and then adding /robots.txt.
It is the first file that web crawlers will look at before crawling the rest of the site. Though the robots.txt file hand in the instructions for search engine crawlers, it can’t enforce the instructions.
For example, some good bots will try to check the robots.txt file first and the other pages of the websites before following the instructions. While there are those spiders who will either ignore or look for forbidden pages to crawl your websites.
Remember that a search engine crawler will follow the designated rules set of your robots.txt file, and if there are contradictory commands in your file, the bot will depend on a more granular command.
The Function of Robots Txt
For us to have a better understanding of how robots.txt works, it matters that we see it based on its functions – as a guide and a crawl budget optimiser.
Directions for Crawler
For search engines to get to know your site’s content and offer it to the masses, it entails a clear and straightforward robots.txt file.
This file directs the bots on where and how to crawl your website. Exploring your content can take much of these crawlers’ time, especially if you run a large website.
Robot.txt file is a tool that can bring you closer to search engines. As you set guidelines for their spider to crawl and discover your page’s content, you are helping them also figure out the relevance of your site if it matches search intent.
Crawl Budget Optimiser
Aside from giving directions to the bots, another thing that makes the robots.txt file a holy grail is that some web owners can maximise it to optimise their crawl budget.
Optimising your crawl budget for SEO is a crucial move for your website’s overall health. It is a wise move to know which of your page’s content needs the utmost crawling attention and which of your pages needs no crawling activities at the moment.
We know that crawl budget refers to the number of URLs search engine crawlers can and wants to crawl on your website. It matters that crawling activities centre more on your valuable pages other than irrelevant ones.
With that said, ensure that your robots.txt file directs crawlers to the value-adding content of your website.
Preventing Duplicate Content and Non-Public Pages to Appear on SERP
Search engines like Google doesn’t have to crawl all of your pages because some of them don’t need to rank. This includes duplicate pages, login pages, internal search results pages and staging sites, to name a few.
Yes, these pages are still important but you don’t want users to randomly land on them. That’s why this is one of the cases where you’d use robots.txt to block these pages from crawlers.
Specifying Sitemap Location
Google’s documentation clearly states that including a line on your robots.txt file that mentions the location of your sitemap is necessary. That way, Googlebot and other bots will find it more efficient to find your sitemap. Take a look at this example:
If you fail to include this in the file, search engines won’t crawl your sitemap regularly and delay the process of indexing. You’ll find it harder to rank in the long run!
How to Create Robots Txt
If you decide to create a robots.txt file for your website today, you want to ensure your website’s visibility on search engines. This process entails four important steps:
- Create a file named robots.txt.
You can use notepad, TextEdit, vi, and emacs to create your robots.txt file. Ensure that you save the file with UTF-8 encoding if you get a prompt while saving your file. Google tends to ignore characters that are not part of the UTF-8 range.
Never forget to have it named robots.txt. Keep in mind that you must only have ONE robots.txt file present on your website.
Google Search Central provided a gentle reminder – “If you’re unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can’t access your website root, use an alternative blocking method such as meta tags.”
Content management systems (CMS) like Shopify and WordPress also allow their users to directly edit the robots.txt file and give them more control over their SEO.
In the case of Shopify, store owners can make the following commands on robots.txt:
- Block specific crawlers.
- Add crawl-delay rules for certain crawlers.
- Add sitemap URLs.
- Allow or disallow some pages from being crawled.
The platform highly recommends its users use robots.txt.liquid theme template in adding or removing directives to make automatically keep the file updated.
- Specify the rules you want to include in your robots.txt file.
Adding rules to your robots.txt file is vital for a smooth crawling process.
These instructions are crucial because they will affect your crawl budget if you mess this one up, and it is something you wouldn’t want to happen.
It matters that you know your website’s content from the top to the bottom, as it will have an impact on how you set up the guidelines for search engine spiders.
Be mindful of the different groups you have under your robots.txt files. Each group begins with a user-agent and bears multiple directives – one directive per line.
- Upload your robots.txt file to your website.
After creating your robots.txt file, the next step is to have it saved or uploaded to your website.
Uploading your robots.txt file requires no special tools as it depends on your site and server architecture. If you are having trouble accessing or uploading it to your website, reach out to your hosting company provider.
If this step went smoothly from your end, check if it is accessible to Google and see if it can parse it.
- Test and submit your robots.txt file.
Testing your website’s robots.txt file is necessary to ensure your efforts don’t go to waste.
Evaluating your newly uploaded robots.txt file is also a way to see if it is publicly accessible. You may open a private browsing window in your browser and key in the location of your robots.txt file.
Google crawlers will find and start using your newly created robots.txt file.
Robots.txt SEO Best Practices
Set your Robots.txt User-Agent
A user agent refers to the search engine crawlers that you allow or want to block on your website. There are hundreds present online, and these are some we picked that you might find SEO useful.
- Google: Googlebot
- Google Images: Googlebot-Image
- Bing: Bingbot
- Yahoo Search: Slurp
- Baidu: Baiduspider
- DuckDuckGo: DuckDuckBot
- Yandex: YandexBot
- Sogou: Sogou web
- Facebook: Facebot
- Exalead: Exabot
For your information, you may establish your user agent in three various ways:
- Creating only one user-agent
- User-agent: Bingbot
- Establishing more than one user-agent
- Setting all search engine crawlers as user-agent.
Directives in robots.txt
As mentioned previously, the robots.txt file is read in groups. These groups consist of instructions that specify who the user-agent is and the rules it carries and must perform.
Here are the directives you will often see on your robots.txt file.
A disallow directive commands the search engine bots not to access files and pages that fall under this specific path. This type of directive starts with a forward slash (/) followed by your page’s full URL.
You may have one or more disallow settings per rule. Ensure that you explicitly indicate disallow to pages you don’t want bots to access because web crawlers process the groups of your robots.txt file from top to bottom. You don’t want them to spend vast amounts of time crawling on pages that don’t hold much value.
Let’s give you an example. Say you want to prevent all search engines from crawling your site, this is how your block should look:
The disallow directive isn’t case-sensitive so you can capitalise them or use lowercase letters instead. Most users just often capitalise all the letters because it makes the file easier to read and process.
However, you have to be careful with the values inside every directive on the file because they are case-sensitive.
For example, /image/ is not the same as /Image/.
Since you have a disallow setting, you must also have an allow directive.
Use the Allow directive if you want to override your disallow settings. This works entirely the opposite of the previously cited command.
So, if you want to block Googlebots from accessing your blog posts except for one, this is what the allow directive should look like:
Just keep in mind that not every search engine recognises this command. Only Bing and Google use this directive.
An XML sitemap is an XML file that carries a list of all pages on a website that you want robots to access and crawl.
A sitemap directive is an optional type of element in your website’s robots.txt file creation.
Sitemaps directive gives the location of your website’s sitemap. If you have plans to use this, ensure that it is a fully qualified URL to avoid unnecessary trouble in the future.
To find the sitemap directive, look at the top or bottom part of your robots.txt file. Here’s an example:
Source: SemrushAnd if you want to speed up the crawling process, you can submit your XML sitemap to every search engine through their webmaster tools.
Robots Txt in WordPress
If you have a WordPress account, you know already that WordPress automatically creates a virtual robots.txt file. No sweat for web owners like you.
Unfortunately, this action limits you from amending or improving your site’s robots.txt file since it is a virtual file, and you must develop a physical file on your server so you can change it according to your liking.
Editing Robot Txt in RankMath
If you are using RankMath as your WordPress plugin, this is how you edit your robots.txt file:
- Log in to your WordPress website
- Switched to the Advanced Mode
- Navigate the robots.txt file located under the WordPress Dashboard.
- Go to Rank Math > General Settings > Edit robots.txt
- Add or Edit the code as you see fit.
- Remember that Rank Math provides an automatic set of rules to your Robot.txt file.
- If you are confident with your code, ensure to save your modifications.
- Click Save Changes.
Editing Robot Txt in Yoast
Regardless if you are using Yoast SEO or Yoast SE Premium, you’ll have the freedom to edit your robots.txt file.
Check out this 6-step process of editing your robots.txt in Yoast.
- Log in to your WordPress website.
- Click on ‘SEO’.
- Click on ‘Tools’.
- Click on ‘File Editor’.
- Make the changes to your file.
- Save your changes.
Robots Txt in Magento
Magento provides a mechanism for creating your site’s robots.txt file, saving you from the hassle of crafting it from scratch.
To generate a robots.txt file in Magento, you may do the following:
- Log in to your Magento Admin account
- Select STORE, then CONFIGURATION
- Choose GENERAL, then DESIGN
- Look for the SEARCH ENGINE ROBOTS option on the drop-down menu
- You may select the DEFAULT storefront or create a robots.txt yourself.
- Once you have set your options, you may click the RESET to DEFAULT button to add the default robots.txt file directives to the custom instructions field.
- Dont forget to Click SAVE CONFIG to save your modifications.
Robots Txt Generator
There are free online tools that enable you to generate your robots.txt files right away.
You will notice that most robots.txt files online provide you with numerous options. Those options are not mandatory; thus, it matters you come up with what you want for your robots.txt file and choose wisely from those default options.
Typically, the first row contains the default values for all robots and an option to keep a crawl delay.
The second row is about the sitemap. If you want to incorporate a sitemap directive to your robots.txt file, then ensure not to miss this one out.
The next thing is for you to choose whether you want search engines bots to crawl or not on some pages of your website. You will see blocks for images for indexation and an option for your site’s mobile version.
For an online robots.txt file generator, the last option you will see is the restriction for crawlers – disallowing directive.
After filling out and ticking the options, you may download your generated robots.txt file and have it saved on your computer or upload it directly to your website.
A site’s robots.txt file can be overwhelming to understand as it almost leans to the technical side of search engine optimisation.
We understand that these elements can make one feel intimidated and might shy away from them as quickly as possible.
Let us tell you that there’s no need to worry that much if you don’t get it all at once. It is okay. We are all in the process of learning; thus, we have this mini-series prepared for you.
We are providing you with the basic concepts and tips to help you have a piece of additional information when it comes to evaluating your SEO campaign for your website.
It is extremely important to understand what goes and makes up your website for SEO success. The robots.txt file is one of the many factors that can help you achieve your optimisation goals.
Frequent modifications for this one are not required, but it will help if you test and check how it works to ensure your website performs well and you don’t compromise your crawl budget.
There are a lot of possibilities that a robots.txt file can do to your technical SEO.
Who knows? Once you put in the necessary work to improve your robots.txt file now, you’ll get surprised by the changes it can bring to your site’s current SEO position.
Roots Digital SEO services can help you not only learn the fascinating pillars of search engine optimisation but also in achieving your business’s SEO goals. Our SEO experts can’t wait to serve and collaborate with you. Let us know how we can help.
What is a Robots.Txt File Used for?
A robots.txt file serves as a guide for search engines crawlers to check your website’s pages. It carries the instructions you set on how bots should crawl and understand your page’s content.
Is a Robots.Txt File Necessary?
Though it is not an essential indicator of a successful and competitive website, a robots.txt file can somehow influence your site’s SEO optimisation campaign.
You may or may not have one on your website to function well, but as indicated in this article, a robots.txt file is a set of instructions on how you want spiders to crawl your website.
How do I Block a Crawler in Robots Txt?
You have the option to Allow or Disallow crawlers to explore the pages of your website. Ensure you have indicated specifically the rules for crawling your website.
What Happen If You Ignore Robots.Txt?
If you opt to ignore robots.txt files, it won’t pose any trouble as having it is optional in the first place. You have the freedom to run your website without this one, but if you think it matters that search engines crawlers must have a clear direction on how to explore every content of your site, so incorporate it.