Advertools Python Library Tutorial for Entry‑Level Marketers
Published on October 10, 2025 by Palo Santo AI
As search and discovery evolve rapidly, marketers need tools that not only keep up with the pace but also help them build data‑driven systems. The advertools
Python library is an open-source Swiss army knife for online marketing, SEO, SEM, and content analysis. In this tutorial, you’ll learn how to install advertools, crawl websites responsibly, turn XML sitemaps into actionable dataframes, collect SERPs at scale, generate SEM keywords, and analyze text, emoji and URLs. We’ll also explore brand new features released in mid‑2025.
TL;DR – What you’ll learn in this guide
- How to install and verify advertools on your machine.
- Crawl a website safely and extract on‑page SEO elements like titles, headings and body text.
- Control which columns are kept or discarded during crawling to reduce file sizes.
- Convert crawled HTML into clean Markdown and partition it into content chunks.
- Download XML sitemaps into DataFrames and identify stale content.
- Collect Google SERPs for multiple queries and geographies in one call.
- Generate SEM keywords programmatically and organize them into campaigns.
- Analyze hashtags, mentions, emojis and word frequencies in social posts.
- Split URLs into meaningful components to enhance your analytics.
- Link out to related resources on Advertools for Modern SEO and Programmatic SEO in 2025 for further reading.
Why advertools matters for marketers in 2025
In a world where growth systems trump one‑off campaigns, marketers need flexible tools for crawling, scraping, parsing, and analysing digital assets. The advertools
library brings together modules for SEO crawling, XML sitemaps, SERP collection, keyword generation, text analysis, emoji search, URL parsing and even social API integrations. It wraps the power of Scrapy, pandas and the Google Custom Search API into easy functions so you can build repeatable workflows instead of clicking through GUIs.
Two special updates landed in 2025: the ability to restrict which columns are stored during crawling and a markdown generator that converts body text and headings into clean Markdown. These features help you reduce crawl file sizes and repurpose content into analyses or generative pipelines.
Installing advertools
You can install advertools from PyPI. It’s recommended to create a virtual environment to keep your marketing scripts isolated from other projects.
pip install advertools pandas
After installation, verify that the library works and check the version you’re running:
import advertools as adv
print(adv.version)
At the time of writing the latest version is 0.17.1 released on 23 September 2025. This guide uses features available in 0.17.x.
Crawling a website safely
Discovery vs list mode
advertools’ crawl()
function uses Scrapy under the hood to spider websites. There are two ways to crawl:
- Discovery (spider) mode – you provide one or more starting URLs and the crawler follows links, respecting
robots.txt
, until the site is exhaustively crawled. This approach is ideal for general audits. - List mode – you provide a fixed list of URLs and the crawler fetches only those pages without following links. Use this for spot‑checking specific pages, migrations or competitor analyses.
Here’s a basic discovery crawl that saves results to a .jsonl
file. Each line contains one page with fields for URL, title, meta description, headings, links, body text and more. Jsonlines is used because it appends data without consuming memory and plays nicely with pandas dataframes.
import advertools as adv
adv.crawl(
url_list="https://example.com",
output_file="crawl_example.jsonl",
follow_links=True
)
advertools extracts a rich set of on‑page elements. The output file includes columns such as URL, title, meta description, headings (h1…h6), JSON‑LD blocks, Open Graph tags, Twitter card data, link URLs/text, body text, page size, status code, response and request headers, image attributes and more. Each page’s data is appended line by line, which means you should always use a new output file for a fresh crawl to avoid duplication.
Controlling columns to reduce file sizes
Very large sites can generate crawl files that consume gigabytes of storage. In version 0.17, advertools.crawl()
gained keep_columns
and discard_columns
parameters. You can explicitly choose which columns to store. For example, to keep only headings, page size, status code and body text:
adv.crawl(
url_list="https://example.com",
output_file="lean_crawl.jsonl",
follow_links=True,
keep_columns=["h1","size","status","body_text"]
)
The resulting DataFrame will always include the URL column and an errors
column if any errors occur. To select groups of columns without knowing their names in advance, use regular expressions. For example, keep_columns=["resp_headers"]
keeps all response header columns, while discard_columns=["resp_headers_Vary","resp_headers_Cache-Control"]
excludes specific headers. This regex‑based flexibility lets you craft lean crawl datasets for specific analyses.
Using custom settings and respecting robots.txt
Always crawl ethically. Set polite delays, obey robots.txt
, identify yourself and throttle requests. Here’s a sample settings dictionary that you can pass to the custom_settings
argument:
settings = {
"LOG_LEVEL": "INFO",
"DOWNLOAD_DELAY": 0.5,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY": 0.5,
"AUTOTHROTTLE_MAX_DELAY": 5,
"ROBOTSTXT_OBEY": True,
"USER_AGENT": "YourBrandCrawler/1.0 (contact@example.com)"
}
adv.crawl(
url_list="https://yourdomain.com",
follow_links=True,
custom_settings=settings,
output_file="crawl_audit.jsonl"
)
advertools automatically checks the target site’s robots file and only requests allowed URLs. Provide a descriptive user agent and contact email in case site owners have questions.
Inspecting the crawl output
Once the crawl is complete, load the file into pandas:
import pandas as pd
crawl_df = pd.read_json("crawl_audit.jsonl", lines=True)
crawl_df.head()
You will see columns for SEO elements (title, meta description, headings), body text, size, status code, depth and hundreds of headers. If you need to extract images or custom elements, consider enabling img_*
columns or using CSS/XPath selectors.
From crawl to Markdown and content chunks
Version 0.17 introduced a generate_markdown()
function that converts the body text and headings from your crawl into Markdown. The function scans each body_text
string, inserts hash symbols before each heading and returns a list of Markdown strings. This is useful for turning raw HTML into articles you can analyse or feed into generative models.
For example:
import advertools as adv
import pandas as pd
df = pd.read_json("crawl_audit.jsonl", lines=True)
md_list = adv.crawlytics.generate_markdown(df)
print(md_list[0]) # prints the Markdown for the first page
You can then partition the Markdown into chunks. The new adv.partition()
function splits a string wherever a regular expression matches. If you pass a regex that matches Markdown headings (e.g., r"^#+ .*"
with the re.MULTILINE
flag), the function returns a list alternating between headings and paragraphs. This is powerful for segmenting long pages into semantically coherent units for AI summarisation, clustering or internal linking.
import re
parts = adv.partition(md_list[0], r"^#+ .*", flags=re.MULTILINE)
for part in parts:
print(part)
You can further group consecutive items into chunks (heading + text) with a helper like:
def get_markdown_chunks(md_parts):
chunks = []
current = []
for item in md_parts:
if item.strip().startswith("#"):
if current:
chunks.append(current)
current = [item]
else:
current.append(item)
if current:
chunks.append(current)
return chunks
This pipeline—crawl → markdown → partition → chunks—lets you evaluate content at the section level, perform similarity analysis, or create targeted summarizations.
Turn XML sitemaps into actionable data
Sitemaps tell search engines which URLs exist and when they were last modified. advertools’ sitemap_to_df()
function downloads one or more sitemap URLs and returns a DataFrame with columns for loc
, lastmod
, sitemap URL and various metadata. To convert a sitemap:
from advertools import sitemaps
import pandas as pd
df = sitemaps.sitemap_to_df("https://www.example.com/sitemap.xml")
# Identify stale content (older than 180 days)
stale = df[pd.to_datetime(df["lastmod"], errors="coerce") < (pd.Timestamp.now() - pd.Timedelta(days=180))]
print(stale[["loc","lastmod"]].head())
This approach helps you prioritize updates by focusing on pages that haven’t been modified recently. You can also group by folders to see which sections of a site dominate your index.
For programmatic SEO workflows, combine sitemap data with crawl or SERP data to identify ranking opportunities, map redirects and catch thin content. See our full guide Advertools for Modern SEO for a deeper dive.
Collect Google SERPs at scale
Instead of manually copying search results, advertools wraps the Google Custom Search JSON API so you can fetch multiple SERPs in one call. The serp_goog()
function takes lists of queries, country codes (gl
), start positions and other parameters, then builds the Cartesian product and returns a DataFrame of results. For example, three queries × five countries × three start positions produce forty‑five API calls and roughly 450 rows of data.
Here’s how to collect SERPs for a few keywords across the US and Mexico:
from advertools import serp
import pandas as pd
queries = ["python seo crawler","sitemap to dataframe"]
countries = ["us","mx"]
positions = [1,11]
serps = serp.serp_goog(
q=queries,
gl=countries,
start=positions,
num=10, # results per page
cx="YOUR_SEARCH_ENGINE_ID",
key="YOUR_API_KEY"
)
print(serps[["query","title","link","position"]].head())
Before running this code you’ll need to set up a Google Custom Search Engine, enable the API, generate credentials and possibly activate billing if you exceed 100 queries/day. The resulting DataFrame includes the query, country, result title, snippet, link, position and meta information. Combine SERP data with your crawl or sitemap data to see how your pages rank or which competitors dominate a topic.
Generate SEM keywords programmatically
Keyword research is often a bottleneck for SEM campaigns. Instead of brainstorming every combination of product + modifier, let kw_generate()
do the work. advertools defines a keyword as a phrase combining a product and a descriptive word. You supply lists of products and words, and the function returns every permutation in exact, phrase and broad match types.
For example, if you sell courses in engineering, graphic design and marketing, and want to target job‑seeking audiences:
import advertools as adv
products = ["engineering","graphic design","marketing"]
modifiers = ["jobs","careers","vacancies","full time","part time"]
kw_df = adv.kw_generate(products, modifiers)
print(kw_df.head())
The resulting DataFrame contains columns for campaign name, ad group, keyword, match type and descriptive labels for modifiers. You can export this table directly into Google Ads or another platform. Additional helper functions like kw_broad()
and kw_exact()
convert arbitrary keywords into broad or exact match automatically.
After generating keywords, you can use advertools’ ad creation modules to build text ads or feed them into your own templates. Pair this with internal linking strategies from our Programmatic SEO guide to ensure landing pages are properly connected.
Text, hashtags and emoji analysis
Extract structured entities
Social conversations and user‑generated content often contain hashtags, mentions, numbers, questions and exclamation marks. advertools includes a suite of extract_
functions that return dictionaries of entities and useful statistics. For example, extract_hashtags()
takes a list of posts and returns the hashtags, counts and frequency distributions.
import advertools as adv
text_list = [
"Check out our new #python course!",
"We love #data and #python 🚀",
"Follow us @palosantoAI for updates",
"#python #seo and #marketing tips"
]
hashtags = adv.extract_hashtags(text_list)
print(hashtags["overview"])
# {'num_posts': 4, 'num_hashtags': 6, 'hashtags_per_post': 1.5,
# 'unique_hashtags': 3}
Other functions include extract_mentions()
, extract_numbers()
, extract_questions()
, extract_currency()
and extract_intense_words()
. All are powered by a generic extract()
function that accepts any regular expression. Use these tools to understand which topics resonate in comments, tag popular influencers or track campaign codes.
Emoji search and analysis
Emojis drive engagement, but their meaning isn’t always obvious. The emoji_search()
function lets you search the full emoji database (version 16.0) by name, group or subgroup. If you search for “vegetable,” you’ll get avocado 🥑, eggplant 🍆, potato 🥔, carrot 🥕 and corn 🌽. Use this to find relevant emojis for social posts or to analyze which emojis appear most in your audience’s comments.
import advertools as adv
veg = adv.emoji_search("vegetable")
print(veg[["emoji","name"]].head())
love = adv.emoji_search("love")
print(love.head())
You can also extract emojis from text using extract_emoji()
and get statistics about their usage. Combined with word frequency analysis, emoji data offers insights into sentiment and community tone.
Word frequency and weighted analysis
Counting words is a fundamental step in text mining. advertools’ word_frequency()
function counts tokens in a list of documents, supports n‑grams, and can weight words by accompanying numeric metrics. For example, you can weight product names by sales or page titles by pageviews to surface high‑impact terms.
import advertools as adv
titles = ["Learn Python fast","Python SEO guide","Advanced Python marketing","Marketing automation with AI","Python and data"]
pageviews = [1000, 850, 200, 1500, 1200]
freq = adv.word_frequency(text_list=titles, num_list=pageviews, phrase_len=2, rm_words=["and","with"])
print(freq.head())
The resulting DataFrame includes absolute frequency, weighted frequency and relative values, plus optional cumulative percentages. Use this to prioritize topics for content marketing or to identify which phrases drive revenue.
Split and analyze URL structures
URLs contain rich information about category, parameters and hierarchy. Instead of treating them as opaque strings, use url_to_df()
to split them into components. Each row of the returned DataFrame includes the scheme, domain, path, fragment, individual directory levels and query parameters. This is invaluable for analytics reports, crawl datasets, SERP lists and extracted social URLs.
import advertools as adv
urls = [
"https://example.com/products/shoes?color=red&size=10#reviews",
"https://example.com/products/shoes?color=blue&size=9",
"https://example.com/blog/python-seo?utm_campaign=summer"
]
url_df = adv.url_to_df(urls)
print(url_df[["netloc","dir_1","dir_2","query_color","query_size","query_utm_campaign"]])
You can group by directory or query parameter to see which categories drive most traffic. Combined with log files or SERPs, this helps you identify cannibalization, missing canonical tags or paid campaign performance. The same function can parse thousands of URLs from your sitemap, crawl or social exports.
Putting it all together
advertools shines when you combine its modules into end‑to‑end workflows:
- Audit a site – perform a discovery crawl while keeping only the fields you need. Convert the results into Markdown, partition the content and analyze clusters to identify thin sections, duplication and opportunities for internal links.
- Refresh and monitor – download and analyse sitemaps regularly to find stale URLs; run periodic SERP collections to track your rankings or identify new competitors. Visualize the overlap between crawl data and SERPs to prioritize fixes.
- Create campaigns – generate SEM keywords from your product inventory; use crawl outputs to craft ad copy; join SERP data to your crawl to see where you rank and where you need paid coverage.
- Understand your community – extract hashtags, mentions, emojis and words from social posts; weight terms by engagement; and feed insights back into your content strategy.
- Scale with programmatic SEO – combine advertools dataframes with templating systems to generate unique landing pages at scale. For guidance on building programmatic SEO systems, see our article Programmatic SEO in 2025.
Used thoughtfully, advertools helps entry‑level marketers move from manual tasks to automated pipelines. You’ll save time, generate deeper insights and build a robust foundation for SEO and SEM strategies that adapt to how people search, discover and decide.
Next steps and resources
- Official advertools documentation – deep dive into every function.
- GitHub repository – source code, issue tracker and contributions.
- PyPI package page – installation instructions and version history.
- v0.17.0 release notes – explore the new features used in this article.
- Advertools for Modern SEO – a hands‑on tutorial combining crawl, sitemap and SERP workflows specifically for SMBs and startups.
- Programmatic SEO in 2025 – learn how to build scalable, high‑quality content systems.
Have questions or want to build tailored marketing systems? Contact Palo Santo AI. We design custom growth engines powered by data and AI.