Facebook Messenger
Insights

What is Robots.txt and when should businesses care about it?

Thinh Dinh

532 views

Table of Contents

You have a sitemap, you've submitted it to Google, and your website is starting to be indexed. But one day, you discover that your admin page, internal payment page, or even your website's staging page is appearing on Google. Customers type in your company name and see the unfinished test page.

Or conversely: you publish a new blog post, wait two weeks, and it still doesn't appear on Google. You ask the technical team, and they say, "The robots.txt file is blocking Google from crawling the entire website."

Both situations involve a small file that few web administrators pay attention to: robots.txt .

This article will explain what robots.txt is, how it works, when you need to edit it, and common mistakes businesses should avoid—all in simple language, with practical examples.

What is Robots.txt? An explanation for web administrators.

If a sitemap is like a building diagram – showing Google which rooms are there – then robots.txt is like a "Restricted Areas" sign – telling Google which rooms are off-limits .

In technical terms: robots.txt is a small text file located in the root directory of a website (for example: https://example.com/robots.txt ). This file contains rules that tell search engine bots – such as Googlebot – what to do.

  • Which pages are allowed to be crawled?
  • Which pages are not allowed to be crawled?
  • Where is the sitemap located?

You can view the robots.txt file of any website by typing: ten-mien.com/robots.txt in your browser.

💡 Important: robots.txt is just a polite request , not an absolute ban. Reputable bots like Googlebot will comply, but malicious bots (spam, scrapers) may ignore it. If you need real security, use a password or firewall – don't rely on robots.txt.

What does a robots.txt file look like?

You don't need to write this file from scratch. But to understand it at a glance, this is a simple robots.txt file:

 User-agent: * Disallow: /admin/ Disallow: /thanh-toan/ Disallow: /staging/ Allow: / Sitemap: https://example.com/sitemap.xml

Explanation for each line:

Current Meaning
User-agent: * Applies to all bots (Google, Bing, etc.)
Disallow: /admin/ Prevent bots from accessing the /admin/ directory.
Disallow: /thanh-toan/ Prevent bots from accessing the checkout page.
Disallow: /staging/ Do not allow bots into the staging environment.
Allow: / Allow the bot to crawl the rest.
Sitemap: https://... Show bots where the sitemap is located.

Here's a more complex example – suitable for a business website with a blog, service pages, and an admin area:

 # Cho phép tất cả bot crawl nội dung công khai User-agent: * Disallow: /admin/ Disallow: /wp-admin/ Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ Disallow: /search? Disallow: /*?ref= Disallow: /*?utm_ # Cho phép Googlebot crawl CSS và JS (cần thiết để render trang) User-agent: Googlebot Allow: /wp-content/uploads/ Allow: /wp-includes/ Sitemap: https://example.com/sitemap.xml
📝 Note for developers: The ` ` characters in the path are wildcards — `/ ?utm_` means blocking all URLs containing the `?utm_` parameter. The `$` characters at the end of the path are used for exact matching of the URL. For example: `Disallow: /*.pdf$` will block all PDF files.

How does Robots.txt work in the SEO process?

To understand the role of robots.txt, let's look back at the process Google uses to rank websites in search results:

 Crawl → Index → Rank (Quét) (Lưu) (Xếp hạng)

Robots.txt operates in the first step - Crawl.

Before Googlebot begins crawling any page on your website, it checks your robots.txt file first . If a URL is listed as Disallow , Googlebot will skip that page—no crawling, no reading of its content.

 Googlebot muốn crawl https://example.com/admin/settings → Kiểm tra robots.txt → Thấy Disallow: /admin/ → Bỏ qua, không crawl Googlebot muốn crawl https://example.com/dich-vu/ → Kiểm tra robots.txt → Không bị chặn → Crawl bình thường → Index → Có thể lên kết quả tìm kiếm

Robots.txt and sitemap: a complementary pair.

File Role
Sitemap "This is a list of pages I want Google to know about."
Robots.txt "These are the pages I don't want Google to crawl."

These two files don't conflict—they work together. The sitemap provides directions, while robots.txt sets the barrier. Combined correctly, you control what Google sees and ignores on your website.

What is Robots.txt used for? 4 common scenarios.

1. Hide the admin and internal pages from Google.

Admin page, CMS backend page, staging page, test page - none of these should appear in Google search results. Robots.txt tells Google: "Don't enter here."

 Disallow: /admin/ Disallow: /wp-admin/ Disallow: /staging/

2. Avoid wasting your "crawl budget".

Google doesn't crawl indefinitely. Each website has a "crawl budget"—the number of pages Googlebot will crawl on each visit. If a website has many unimportant pages (internal search pages, filter pages, pagination pages), Googlebot might be busy crawling these pages instead of important service pages or blog posts.

 Disallow: /search? Disallow: /tag/ Disallow: /page/
💡 Crawl budget is primarily important for large websites (thousands of pages). Small business websites usually don't need to worry too much, but keeping your robots.txt clean is still a good habit.

3. Block duplicate content

If a website has multiple URLs that lead to the same content (for example, a URL with the tracking parameter ?utm_source=facebook , or a print version ?print=true ), you can block these duplicate URLs:

 Disallow: /*?utm_ Disallow: /*?ref= Disallow: /*?print=

4. Directions to the sitemap

Robots.txt is the first place Googlebot checks when it comes to a website. Placing your sitemap here helps Google find it faster – even if you haven't submitted it to Search Console yet.

 Sitemap: https://example.com/sitemap.xml

When should businesses pay attention to robots.txt?

You don't always need to edit robots.txt. But there are times when checking this file is mandatory :

When the new website goes live

This is the most critical moment. Many websites are completely blocked from crawling because the development team forgot to remove the line Disallow: / " - a line they placed during staging to prevent Google from indexing the unfinished version.

Test when going live:

Category How to check
The robots.txt file exists. Open https://ten-mien.com/robots.txt in your browser.
Do not block entire websites. Make sure there are NO Disallow: /
Sitemap has been declared. Make sure the following line is present Sitemap: https://ten-mien.com/sitemap.xml
Important pages are not blocked. Check if the service page, blog, or contact information is not included in Disallow

✅ When a website is not indexed by Google after several weeks

If you already have a sitemap, have submitted it to Search Console, but Google still isn't indexing it, the robots.txt file is the first suspect to check.

When adding areas to be hidden (member page, internal page)

If your website includes account management pages, member areas, or internal pages, please update your robots.txt to block these areas.

When changing website platforms or redesigning

Each platform (WordPress, Webflow, custom code) creates a different URL structure. When migrating, the old robots.txt file may mistakenly block the new page or miss pages that need to be blocked.

When Search Console reports the error "Blocked by robots.txt"

Google Search Console provides an indexing report that shows which pages are being blocked by robots.txt. If you see an important page being blocked, it's time to fix the file immediately.

5 common robots.txt errors and how to fix them.

Error 1: Blocking the entire website - the most serious error.

Symptom: No pages are indexed by Google. Search Console reports numerous pages as "Blocked by robots.txt".

Reason: The robots.txt file contains:

 User-agent: Disallow: /

These two lines mean: "Block all bots from accessing any page." This often happens when developers set this rule during staging and forget to remove it before going live.

Solution: Change to:

 User-agent: * Disallow: Sitemap: https://ten-mien.com/sitemap.xml

Disallow: (nothing after the colon) = allows crawling all.

⚠️ This is the #1 error we see on new business websites. After fixing it, Google may take several days to several weeks to crawl it again. Resubmit your sitemap via Search Console to speed up the process.

Error 2: Blocking CSS and JavaScript

Symptom: The website displays normally in the browser, but when using the "URL Inspection" tool in Search Console, Google detects that the page has a broken layout or is blank.

Reason: Robots.txt is blocking the folder containing CSS and JS:

 Disallow: /wp-content/ Disallow: /wp-includes/

Google needs to read CSS and JS to understand what a page looks like (called "rendering"). If this is blocked, Google cannot render the page → it doesn't understand the content → affecting ranking.

How to fix it:

 User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Allow: /wp-content/ Allow: /wp-includes/
📝 Note for developers: Since 2014, Google has clearly recommended against blocking CSS, JS, and images in robots.txt. Googlebot needs these resources to render pages correctly. Use the URL Inspection tool in Search Console to check how Google renders your page.

Error 3: Blocking an important page by mistake.

Symptom: Service pages, product pages, or blog posts do not appear in Google search results - even though they are listed in the sitemap.

Reason: The rule in robots.txt is too broad. For example:

 Disallow: /dich-vu

This line blocks not only /dich-vu/ but also /dich-vu-thiet-ke-web/ , /dich-vu-seo/ , and any URL that starts with /dich-vu .

Solution: Add a forward slash / at the end of the path to block the exact directory:

Disallow: /dich-vu-noi-bo/

Or use Allow to protect the necessary pages:

 Disallow: /dich-vu-noi-bo/ Allow: /dich-vu/ Allow: /dich-vu-thiet-ke-web/
📝 Note for developers: The order of `Allow` and `Disallow` has an impact. Googlebot uses the most specific rule (most specific path). If they are the same length, `Allow` takes precedence over `Disallow`. Always test using the [Robots Testing Tool](https://support.google.com/webmasters/answer/6062598) in Search Console before deploying.

Error 4: No robots.txt file

Symptom: Typing ten-mien.com/robots.txt → returns a 404 error.

Reason: The website was built manually and the developer didn't create this file. Or the file was accidentally deleted during deployment.

Impact: Not as serious as error 1 - without robots.txt, Google crawls everything by default. But this means:

  • Google will crawl the admin page, test page, and internal pages.
  • You don't have a way to direct users to the sitemap via robots.txt.
  • Lack of basic controls

Solution: Create a robots.txt file in the root directory. Minimum content:

 User-agent: * Disallow: /admin/ Disallow: /search? Sitemap: https://ten-mien.com/sitemap.xml

Error 5: Using robots.txt to hide a page from Google (misunderstood purpose)

Symptom: You block a page using Disallow , but that page still appears on Google – even without any content snippets.

Reason: Robots.txt blocks crawling , but not indexing . If the page has already been indexed, or has backlinks from other websites pointing to it, Google may keep the URL in search results – it just won't display the content.

The correct way to fix it:

Target What to use
I don't want Google to crawl my site. Disallow in robots.txt
Do not want Google to index (display) your content. Card in HTML
I don't want both. Use noindex in HTML (and don't block in robots.txt).
⚠️ This is the most common misunderstanding: If you both block crawling (robots.txt) and use `noindex` (HTML), Google wo n't see the noindex tag because it won't crawl that page — and the page may still be indexed. The solution: use `noindex` in HTML and remove the `Disallow` rule for that page in robots.txt.

Robots.txt template for business websites

Below is a sample robots.txt file suitable for most SMB business websites:

 # ============================================= # Robots.txt cho website doanh nghiệp # Cập nhật: 2026-04-20 # ============================================= # Áp dụng cho tất cả bot User-agent: * # Chặn khu vực quản trị và nội bộ Disallow: /admin/ Disallow: /wp-admin/ Disallow: /dashboard/ Disallow: /staging/ # Chặn trang tìm kiếm nội bộ (tránh lãng phí crawl budget) Disallow: /search? Disallow: /*?s= # Chặn URL có tham số tracking (tránh nội dung trùng lặp) Disallow: /*?utm_ Disallow: /*?ref= Disallow: /*?fbclid= # Chặn trang giỏ hàng / thanh toán (nếu có) Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ # Cho phép CSS, JS, hình ảnh (Google cần để render trang) Allow: /wp-content/uploads/ Allow: /wp-content/themes/ Allow: /wp-includes/ # Chỉ đường đến sitemap Sitemap: https://ten-mien.com/sitemap.xml
📝 Note for developers: The robots.txt file must be located in the root domain — `https://example.com/robots.txt`. Not `/blog/robots.txt` or any other subdirectory. Each subdomain needs its own robots.txt (for example, `blog.example.com/robots.txt` is separate from `example.com/robots.txt`).

How to check your website's robots.txt file

Method 1: Check directly in the browser.

Type https://ten-mien-cua-ban.com/robots.txt into the address bar. You will see the contents of the file in text format. If you see a 404 error, it means the website does not have a robots.txt file.

Method 2: Using Google Search Console

  1. Log in to Google Search Console
  2. Go to SettingsCrawlingrobots.txt
  3. Check the robots.txt file that Google is reading.
  4. Check if the specific URL is blocked.

Method 3: Check in the Indexing report

In Search Console → Pages (or Indexing ) → Find the entry "Blocked by robots.txt" . If there are important pages in this list, you need to edit robots.txt immediately.

💡 You should check your robots.txt at least quarterly or whenever your website undergoes major changes (adding pages, changing structure, migrating to a different platform).

Summary: What should and shouldn't be blocked in Robots.txt?

✅ You should block it. ❌ DO NOT block
Admin page ( /admin/ , /wp-admin/ ) Homepage, services page, contact page
Staging/test page Blog post, article
Internal search page ( /search? ) CSS and JavaScript files
URL with tracking parameter ( ?utm_ , ?fbclid= ) Images (Google Images also bring traffic)
Shopping cart, checkout, personal account page Sitemap
Duplicate content pages (filter, sort, pagination) FAQ page, case study

Frequently Asked Questions about robots.txt

What is the difference between Robots.txt and sitemap?

Sitemap says, "This is the page I want Google to know." Robots.txt says, "This is the page I don't want Google to crawl." The two files complement each other — sitemap provides directions, robots.txt sets the barrier.

Without a robots.txt file, can Google still crawl a website?

Yes. Without a robots.txt file, Google will crawl all pages by default—including pages you don't want. That's why you should have this file.

I use WordPress, where is the robots.txt file located?

WordPress automatically creates a virtual robots.txt file. If you use an SEO plugin like Yoast or Rank Math, you can edit the robots.txt file directly within the plugin without accessing the server.

Does the robots.txt file affect website speed?

No. This file is only a few KB in size. It doesn't affect page load speed.

I blocked the site using robots.txt, so why is it still ranking on Google?

Because robots.txt only blocks crawling, not indexing. If you want the page to disappear completely from Google, use the tag. in HTML - and don't block that page in robots.txt (so Google can read the noindex tag).

After editing the robots.txt file, how long will it take for Google to update?

Google typically checks your robots.txt within 24-48 hours. You can go to Search Console → Settings → Crawling to request that Google check it sooner.

Conclude

Robots.txt is a small file—usually just a few lines—but it directly affects whether or not Google finds your website.

Things to remember:

  1. Robots.txt is a "Restricted Areas" sign : it tells Google which pages should not be crawled.
  2. Check immediately when the website goes live : the error that blocks the entire website is the most common and serious error.
  3. Don't use robots.txt to hide a page from Google : it blocks crawling, but not indexing.
  4. Always allow CSS and JS : Google needs to render the page to understand the content.
  5. Combine this with a sitemap and Search Console to gain complete control over how Google crawls and indexes your website.

Check your website platform.

Robots.txt is just one of many technical factors affecting SEO. If you're wondering, "Is my website set up correctly?" - the answer lies in the platform you're using.

GTG CRM helps you create a website with a standard robots.txt file, an automatic sitemap, and a technical structure ready for Google – you don't need to worry about editing each file or line of code.

Optimize Operations Accelerate Business Growth

Start with Free Credits
Free 20,00036,888 credits
Full features
No credit card required