Learn how to use BeautifulSoup for web scraping in Python. A complete guide with examples.
pip install beautifulsoup4BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It provides Pythonic idioms for iterating, searching, and modifying the parse tree that it generates from the markup.
Key features of BeautifulSoup include parsing HTML and XML documents, navigating through the parse tree, and searching the tree using a variety of methods. It's widely used for web scraping tasks like extracting data from web pages, crawling websites, and data mining.
To get started with BeautifulSoup, you need to install it using pip. Once installed, you can parse any HTML or XML document to extract information. Common patterns include loading the document, navigating the parse tree, and extracting data using BeautifulSoup's search methods.
from bs4 import BeautifulSoup html_doc = "<html><head><title>The Dormouse's story</title></head><body><p class='title'><b>The Dormouse's story</b></p></body></html>" soup = BeautifulSoup(html_doc, 'html.parser') print(soup.title)
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p><a href="http://example.com/elsie">Elsie</a></p>', 'html.parser')
links = soup.find_all('a')
for link in links:
print(link.get('href'))from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>Hello, World!</p></div>', 'html.parser')
print(soup.p.get_text())from bs4 import BeautifulSoup
soup = BeautifulSoup('<ul><li>Item 1</li><li>Item 2</li></ul>', 'html.parser')
for child in soup.ul.children:
print(child)from bs4 import BeautifulSoup
soup = BeautifulSoup('<div class="content"><p>Sample</p></div>', 'html.parser')
content = soup.select('.content')
for item in content:
print(item)findSearches for the first tag that matches the criteria.
find_allFinds all tags that match the criteria.
get_textRetrieves all the text in a document or beneath a tag.
selectSearches the document using CSS selectors.
find_parentsSearches the parents of a tag.
find_parentSearches for the immediate parent of a tag.
next_siblingGets the next sibling of a tag.
previous_siblingGets the previous sibling of a tag.
childrenIterates over the children of a tag.
descendantsIterates over all the children of a tag, recursive.