html.parser Library

The html.parser library provides an HTML/XHTML parser compatible with Python’s html.parser module. This is an extended library that is automatically available as a built-in library.

Import

import html.parser

HTMLParser = html.parser.HTMLParser

Available Methods

Method Description
feed(data) Feed HTML data to the parser
reset() Reset the parser instance
close() Force processing of buffered data
get_starttag_text() Get text of most recent start tag
getpos() Get current (line, offset) position

HTMLParser Class

The HTMLParser class is the main interface for parsing HTML documents. Create a subclass and override handler methods to process HTML elements.

Basic Usage

import html.parser

class MyHTMLParser(html.parser.HTMLParser):
    def __init__(self):
        self.data = []

    def handle_starttag(self, tag, attrs):
        print(f"Start tag: {tag}")
        print(f"Attrs: {attrs}")

    def handle_endtag(self, tag):
        print(f"End tag: {tag}")

    def handle_data(self, data):
        self.data.append(data)

parser = MyHTMLParser()
parser.feed("<html><body><p>Hello World!</p></body></html>")

Instance Methods

feed(data)

Feed HTML data to the parser. Parses the HTML and calls appropriate handler methods.

Parameters:

  • data - String containing HTML to parse

Example:

parser.feed("<h1>Title</h1><p>Paragraph</p>")

reset()

Reset the parser instance. Clears internal buffers and parser state.

parser.reset()

close()

Force processing of all buffered data. Call when done feeding data.

parser.close()

get_starttag_text()

Returns the text of the most recently opened start tag.

class MyParser(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        text = self.get_starttag_text()
        print(f"Raw tag: {text}")

getpos()

Returns a tuple (line, offset) representing the current position in the source.

pos = parser.getpos()
print(f"Line: {pos[0]}, Offset: {pos[1]}")

Handler Methods

Override these methods in your subclass to handle different HTML elements:

handle_starttag(tag, attrs)

Called when a start tag is encountered.

Parameters:

  • tag - Lowercase tag name (e.g., “div”, “p”)
  • attrs - List of (name, value) tuples for attributes
def handle_starttag(self, tag, attrs):
    print(f"<{tag}>")
    for name, value in attrs:
        print(f"  {name}={value}")

handle_endtag(tag)

Called when an end tag is encountered.

Parameters:

  • tag - Lowercase tag name
def handle_endtag(self, tag):
    print(f"</{tag}>")

handle_startendtag(tag, attrs)

Called for self-closing tags like <br/> or <img/>.

Parameters:

  • tag - Lowercase tag name
  • attrs - List of (name, value) tuples
def handle_startendtag(self, tag, attrs):
    print(f"<{tag}/>")

handle_data(data)

Called for text data between tags.

Parameters:

  • data - Text content
def handle_data(self, data):
    if data.strip():
        print(f"Text: {data}")

handle_comment(data)

Called when an HTML comment is encountered.

Parameters:

  • data - Comment content (without )
def handle_comment(self, data):
    print(f"Comment: {data}")

handle_decl(decl)

Called for DOCTYPE and other declarations.

Parameters:

  • decl - Declaration content
def handle_decl(self, decl):
    print(f"Declaration: {decl}")

handle_pi(data)

Called for processing instructions like <?xml ...?>.

Parameters:

  • data - Processing instruction content
def handle_pi(self, data):
    print(f"PI: {data}")

handle_entityref(name)

Called for named character references like &gt;. Only called when convert_charrefs is False.

Parameters:

  • name - Entity name (without & and ;)

handle_charref(name)

Called for numeric character references like &#62;. Only called when convert_charrefs is False.

Parameters:

  • name - Character code (without &# and ;)

Instance Attributes

convert_charrefs

Boolean indicating whether to automatically convert character references. Default is True.

When True, entities like &amp; are converted to & before being passed to handle_data.

# To handle entities manually:
parser.convert_charrefs = False

Complete Example

import html.parser

class LinkExtractor(html.parser.HTMLParser):
    def __init__(self):
        self.links = []
        self.current_text = ""
        self.in_link = False

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            self.in_link = True
            for name, value in attrs:
                if name == "href":
                    self.links.append({"href": value, "text": ""})

    def handle_endtag(self, tag):
        if tag == "a":
            if self.links and self.in_link:
                self.links[-1]["text"] = self.current_text.strip()
            self.current_text = ""
            self.in_link = False

    def handle_data(self, data):
        if self.in_link:
            self.current_text = self.current_text + data

# Extract links from HTML
parser = LinkExtractor()
parser.feed("""
<html>
<body>
    <a href="https://example.com">Example</a>
    <a href="https://google.com">Google</a>
</body>
</html>
""")

for link in parser.links:
    print(f"{link['text']}: {link['href']}")
# Output:
# Example: https://example.com
# Google: https://google.com

Enabling in Go

The html.parser library is automatically available as a built-in library. To explicitly register it:

package main

import (
    "github.com/paularlott/scriptling"
    "github.com/paularlott/scriptling/extlibs"
)

func main() {
    p := scriptling.New()

    // Optionally register explicitly (already available by default)
    p.RegisterLibrary("html.parser", extlibs.HTMLParserLibrary)

    code := `
import html.parser

class MyParser(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(tag)

parser = MyParser()
parser.feed("<div>Hello</div>")
`
    p.Eval(code)
}

Differences from Python

  • super().__init__() is not required (or available) - the parent class init is automatically called
  • The from X import Y syntax is not supported - use import html.parser then html.parser.HTMLParser

See Also

  • html - HTML escaping and unescaping utilities
  • re - Regular expressions for text processing