Advertools for Modern SEO: A Complete, Practical Guide for SMBs and Startups

Advertools for Modern SEO: A Complete, Practical Tutorial

Advertools is a Python library that helps you crawl sites safely, turn XML sitemaps into analytics‑ready tables, collect Google SERPs at scale, and combine these datasets into a repeatable SEO pipeline. This tutorial walks you through each step with copy‑paste code and real scenarios for SMBs, startups, and fintechs.

What you’ll build

A safe Scrapy‑powered crawl (discovery or list) saved to JSONL
A sitemap DataFrame to track freshness and coverage
A SERP snapshot for target queries (title, link, position)
A joined dataset to prioritize issues and opportunities

Install and verify

pip install advertools pandas

import advertools as adv
print(adv.version)

Ethical crawling

Obey robots.txt (ROBOTSTXT_OBEY=True)
Throttle (DOWNLOAD_DELAY 0.5–1.0s) and enable AUTOTHROTTLE
Limit scope by folder/depth/file types; use list mode for precise audits
Write JSONL; log status/timeouts; store redirects/final URLs

Discovery crawl

import advertools as adv

start_urls = ["https://www.example.com/"]
settings = {
  "LOG_LEVEL": "INFO",
  "DOWNLOAD_DELAY": 0.5,
  "AUTOTHROTTLE_ENABLED": True,
  "AUTOTHROTTLE_START_DELAY": 0.5,
  "AUTOTHROTTLE_MAX_DELAY": 5,
  "ROBOTSTXT_OBEY": True,
  "USER_AGENT": "PaloSantoSEO/1.0 (contact: main@palosanto.ai)"
}

adv.crawl(
  start_urls=start_urls,
  follow_links=True,
  custom_settings=settings,
  output_file="data/crawl_discovery.jsonl"
)

List crawl

import advertools as adv

url_list = [
  "https://www.example.com/pricing",
  "https://www.example.com/blog/python-seo",
]

adv.crawl(
  start_urls=url_list,
  follow_links=False,
  custom_settings={"LOG_LEVEL": "INFO", "ROBOTSTXT_OBEY": True},
  output_file="data/crawl_list.jsonl"
)

Sitemaps to DataFrame

from advertools import sitemaps
import pandas as pd

df = sitemaps.sitemap_to_df("https://www.example.com/sitemap.xml")

# Stale content (older than 6 months)
stale = df[pd.to_datetime(df["lastmod"], errors="coerce") <
          (pd.Timestamp.now() - pd.Timedelta(days=180))]
print(stale[["loc", "lastmod"]].head())

# Coverage by folder
df["folder1"] = df["loc"].str.extract(r"https?://[^/]+/([^/]+)/")
coverage = df.groupby("folder1").size().sort_values(ascending=False)
print(coverage.head())

Collect Google SERPs

from advertools import serp
import pandas as pd

queries = ["python seo crawler", "sitemap to dataframe"]
serps = serp.serp_goog(q=queries, gl="us", num=20, start=0)
print(serps[["query", "title", "link", "position"]].head())

domains = serps["link"].str.extract(r"https?://([^/]+)/")[0]
print(domains.value_counts().head())

Join crawl + SERPs

import pandas as pd

crawl = pd.read_json("data/crawl_list.jsonl", lines=True)

crawl["domain"] = crawl["url"].str.extract(r"https?://([^/]+)/")
serps["domain"] = serps["link"].str.extract(r"https?://([^/]+)/")

joined = crawl.merge(serps, on="domain", how="left")
issues = joined[(joined.get("status") == 200) == False]
print(issues[["url", "status", "title", "position"]].head())

Scenarios

Pre‑launch QA

List crawl target URL(s)
Verify status, final URL, canonical, content type
Run SERPs for head + mid‑tail; align title/H1/intro

Freshness audit

Use sitemap_to_df for old lastmod
Stack‑rank by value + SERP opportunity
Refresh 10–20 URLs per sprint

Migration checks

Discovery crawl legacy/target; compare status mix
Map redirects (1 hop), verify canonicals/sitemaps

Troubleshooting

Slow/timeouts: raise DOWNLOAD_TIMEOUT, reduce concurrency, keep AUTOTHROTTLE
Robots blocks: check robots.txt, request staging/allowlist
Large outputs: JSONL + chunking; archive to parquet

FAQ

Is Advertools free? Yes—open‑source on PyPI and GitHub.

Replace paid crawlers? Often for repeatable audits/pipelines; GUI tools are handy for ad‑hoc visual audits.

Visualize? Export to BigQuery/Sheets; dashboard in Data Studio/Looker; add weekly runs with cron.