html.parser Library
The html.parser library provides an HTML/XHTML parser compatible with Python’s html.parser module. This is an extended library that is automatically available as a built-in library.
Import
import html.parser
HTMLParser = html.parser.HTMLParserAvailable Methods
| Method | Description |
|---|---|
feed(data) |
Feed HTML data to the parser |
reset() |
Reset the parser instance |
close() |
Force processing of buffered data |
get_starttag_text() |
Get text of most recent start tag |
getpos() |
Get current (line, offset) position |
HTMLParser Class
The HTMLParser class is the main interface for parsing HTML documents. Create a subclass and override handler methods to process HTML elements.
Basic Usage
import html.parser
class MyHTMLParser(html.parser.HTMLParser):
def __init__(self):
self.data = []
def handle_starttag(self, tag, attrs):
print(f"Start tag: {tag}")
print(f"Attrs: {attrs}")
def handle_endtag(self, tag):
print(f"End tag: {tag}")
def handle_data(self, data):
self.data.append(data)
parser = MyHTMLParser()
parser.feed("<html><body><p>Hello World!</p></body></html>")Instance Methods
feed(data)
Feed HTML data to the parser. Parses the HTML and calls appropriate handler methods.
Parameters:
data- String containing HTML to parse
Example:
parser.feed("<h1>Title</h1><p>Paragraph</p>")reset()
Reset the parser instance. Clears internal buffers and parser state.
parser.reset()close()
Force processing of all buffered data. Call when done feeding data.
parser.close()get_starttag_text()
Returns the text of the most recently opened start tag.
class MyParser(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
text = self.get_starttag_text()
print(f"Raw tag: {text}")getpos()
Returns a tuple (line, offset) representing the current position in the source.
pos = parser.getpos()
print(f"Line: {pos[0]}, Offset: {pos[1]}")Handler Methods
Override these methods in your subclass to handle different HTML elements:
handle_starttag(tag, attrs)
Called when a start tag is encountered.
Parameters:
tag- Lowercase tag name (e.g., “div”, “p”)attrs- List of (name, value) tuples for attributes
def handle_starttag(self, tag, attrs):
print(f"<{tag}>")
for name, value in attrs:
print(f" {name}={value}")handle_endtag(tag)
Called when an end tag is encountered.
Parameters:
tag- Lowercase tag name
def handle_endtag(self, tag):
print(f"</{tag}>")handle_startendtag(tag, attrs)
Called for self-closing tags like <br/> or <img/>.
Parameters:
tag- Lowercase tag nameattrs- List of (name, value) tuples
def handle_startendtag(self, tag, attrs):
print(f"<{tag}/>")handle_data(data)
Called for text data between tags.
Parameters:
data- Text content
def handle_data(self, data):
if data.strip():
print(f"Text: {data}")handle_comment(data)
Called when an HTML comment is encountered.
Parameters:
data- Comment content (without )
def handle_comment(self, data):
print(f"Comment: {data}")handle_decl(decl)
Called for DOCTYPE and other declarations.
Parameters:
decl- Declaration content
def handle_decl(self, decl):
print(f"Declaration: {decl}")handle_pi(data)
Called for processing instructions like <?xml ...?>.
Parameters:
data- Processing instruction content
def handle_pi(self, data):
print(f"PI: {data}")handle_entityref(name)
Called for named character references like >. Only called when convert_charrefs is False.
Parameters:
name- Entity name (without & and ;)
handle_charref(name)
Called for numeric character references like >. Only called when convert_charrefs is False.
Parameters:
name- Character code (without &# and ;)
Instance Attributes
convert_charrefs
Boolean indicating whether to automatically convert character references. Default is True.
When True, entities like & are converted to & before being passed to handle_data.
# To handle entities manually:
parser.convert_charrefs = FalseComplete Example
import html.parser
class LinkExtractor(html.parser.HTMLParser):
def __init__(self):
self.links = []
self.current_text = ""
self.in_link = False
def handle_starttag(self, tag, attrs):
if tag == "a":
self.in_link = True
for name, value in attrs:
if name == "href":
self.links.append({"href": value, "text": ""})
def handle_endtag(self, tag):
if tag == "a":
if self.links and self.in_link:
self.links[-1]["text"] = self.current_text.strip()
self.current_text = ""
self.in_link = False
def handle_data(self, data):
if self.in_link:
self.current_text = self.current_text + data
# Extract links from HTML
parser = LinkExtractor()
parser.feed("""
<html>
<body>
<a href="https://example.com">Example</a>
<a href="https://google.com">Google</a>
</body>
</html>
""")
for link in parser.links:
print(f"{link['text']}: {link['href']}")
# Output:
# Example: https://example.com
# Google: https://google.comEnabling in Go
The html.parser library is automatically available as a built-in library. To explicitly register it:
package main
import (
"github.com/paularlott/scriptling"
"github.com/paularlott/scriptling/extlibs"
)
func main() {
p := scriptling.New()
// Optionally register explicitly (already available by default)
p.RegisterLibrary("html.parser", extlibs.HTMLParserLibrary)
code := `
import html.parser
class MyParser(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
print(tag)
parser = MyParser()
parser.feed("<div>Hello</div>")
`
p.Eval(code)
}Differences from Python
super().__init__()is not required (or available) - the parent class init is automatically called- The
from X import Ysyntax is not supported - useimport html.parserthenhtml.parser.HTMLParser