📦

BeautifulSoup in Python

Learn how to use BeautifulSoup for web scraping in Python. A complete guide with examples.

pip install beautifulsoup4

Overview

BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It provides Pythonic idioms for iterating, searching, and modifying the parse tree that it generates from the markup.

Key features of BeautifulSoup include parsing HTML and XML documents, navigating through the parse tree, and searching the tree using a variety of methods. It's widely used for web scraping tasks like extracting data from web pages, crawling websites, and data mining.

To get started with BeautifulSoup, you need to install it using pip. Once installed, you can parse any HTML or XML document to extract information. Common patterns include loading the document, navigating the parse tree, and extracting data using BeautifulSoup's search methods.

Code Examples

Parse HTML Document

from bs4 import BeautifulSoup
html_doc = "<html><head><title>The Dormouse's story</title></head><body><p class='title'><b>The Dormouse's story</b></p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title)

Find All Links

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p><a href="http://example.com/elsie">Elsie</a></p>', 'html.parser')
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

Get Text from Tag

from bs4 import BeautifulSoup
soup = BeautifulSoup('<div><p>Hello, World!</p></div>', 'html.parser')
print(soup.p.get_text())

Navigate Using Children

from bs4 import BeautifulSoup
soup = BeautifulSoup('<ul><li>Item 1</li><li>Item 2</li></ul>', 'html.parser')
for child in soup.ul.children:
    print(child)

Search with CSS Selectors

from bs4 import BeautifulSoup
soup = BeautifulSoup('<div class="content"><p>Sample</p></div>', 'html.parser')
content = soup.select('.content')
for item in content:
    print(item)

Common Methods

find

Searches for the first tag that matches the criteria.

find_all

Finds all tags that match the criteria.

get_text

Retrieves all the text in a document or beneath a tag.

select

Searches the document using CSS selectors.

find_parents

Searches the parents of a tag.

find_parent

Searches for the immediate parent of a tag.

next_sibling

Gets the next sibling of a tag.

previous_sibling

Gets the previous sibling of a tag.

children

Iterates over the children of a tag.

descendants

Iterates over all the children of a tag, recursive.

More Web Development Libraries

📝Python Handle Api Response 📦Numpy 📦Httpx 📦Fastapi 📦Tornado 📦Requests