Advertools for Modern SEO: A Complete, Practical Tutorial
Advertools is a Python library that helps you crawl sites safely, turn XML sitemaps into analytics‑ready tables, collect Google SERPs at scale, and combine these datasets into a repeatable SEO pipeline. This tutorial walks you through each step with copy‑paste code and real scenarios for SMBs, startups, and fintechs.
What you’ll build
- A safe Scrapy‑powered crawl (discovery or list) saved to JSONL
- A sitemap DataFrame to track freshness and coverage
- A SERP snapshot for target queries (title, link, position)
- A joined dataset to prioritize issues and opportunities
Install and verify
pip install advertools pandas
import advertools as adv
print(adv.version)
Ethical crawling
- Obey
robots.txt
(ROBOTSTXT_OBEY=True
) - Throttle (
DOWNLOAD_DELAY
0.5–1.0s) and enableAUTOTHROTTLE
- Limit scope by folder/depth/file types; use list mode for precise audits
- Write JSONL; log status/timeouts; store redirects/final URLs
Discovery crawl
import advertools as adv
start_urls = ["https://www.example.com/"]
settings = {
"LOG_LEVEL": "INFO",
"DOWNLOAD_DELAY": 0.5,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY": 0.5,
"AUTOTHROTTLE_MAX_DELAY": 5,
"ROBOTSTXT_OBEY": True,
"USER_AGENT": "PaloSantoSEO/1.0 (contact: main@palosanto.ai)"
}
adv.crawl(
start_urls=start_urls,
follow_links=True,
custom_settings=settings,
output_file="data/crawl_discovery.jsonl"
)
List crawl
import advertools as adv
url_list = [
"https://www.example.com/pricing",
"https://www.example.com/blog/python-seo",
]
adv.crawl(
start_urls=url_list,
follow_links=False,
custom_settings={"LOG_LEVEL": "INFO", "ROBOTSTXT_OBEY": True},
output_file="data/crawl_list.jsonl"
)
Sitemaps to DataFrame
from advertools import sitemaps
import pandas as pd
df = sitemaps.sitemap_to_df("https://www.example.com/sitemap.xml")
# Stale content (older than 6 months)
stale = df[pd.to_datetime(df["lastmod"], errors="coerce") <
(pd.Timestamp.now() - pd.Timedelta(days=180))]
print(stale[["loc", "lastmod"]].head())
# Coverage by folder
df["folder1"] = df["loc"].str.extract(r"https?://[^/]+/([^/]+)/")
coverage = df.groupby("folder1").size().sort_values(ascending=False)
print(coverage.head())
Collect Google SERPs
from advertools import serp
import pandas as pd
queries = ["python seo crawler", "sitemap to dataframe"]
serps = serp.serp_goog(q=queries, gl="us", num=20, start=0)
print(serps[["query", "title", "link", "position"]].head())
domains = serps["link"].str.extract(r"https?://([^/]+)/")[0]
print(domains.value_counts().head())
Join crawl + SERPs
import pandas as pd
crawl = pd.read_json("data/crawl_list.jsonl", lines=True)
crawl["domain"] = crawl["url"].str.extract(r"https?://([^/]+)/")
serps["domain"] = serps["link"].str.extract(r"https?://([^/]+)/")
joined = crawl.merge(serps, on="domain", how="left")
issues = joined[(joined.get("status") == 200) == False]
print(issues[["url", "status", "title", "position"]].head())
Scenarios
Pre‑launch QA
- List crawl target URL(s)
- Verify status, final URL, canonical, content type
- Run SERPs for head + mid‑tail; align title/H1/intro
Freshness audit
- Use
sitemap_to_df
for oldlastmod
- Stack‑rank by value + SERP opportunity
- Refresh 10–20 URLs per sprint
Migration checks
- Discovery crawl legacy/target; compare status mix
- Map redirects (1 hop), verify canonicals/sitemaps
Troubleshooting
- Slow/timeouts: raise
DOWNLOAD_TIMEOUT
, reduce concurrency, keepAUTOTHROTTLE
- Robots blocks: check
robots.txt
, request staging/allowlist - Large outputs: JSONL + chunking; archive to parquet
FAQ
Is Advertools free? Yes—open‑source on PyPI and GitHub.
Replace paid crawlers? Often for repeatable audits/pipelines; GUI tools are handy for ad‑hoc visual audits.
Visualize? Export to BigQuery/Sheets; dashboard in Data Studio/Looker; add weekly runs with cron.