CLI Tools
Overview
The SCP Python reference implementation includes three command-line tools for working with SCP collections.
| Tool | Purpose |
|---|---|
| scp-validate | Validate SCP collections against JSON schemas |
| scp-inspect | Inspect and view SCP collections in human-readable format |
| scp-benchmark | Compare HTML vs SCP file sizes and parse performance |
scp-validate
Validate SCP collection files against the specification.
Basic Usage
# Validate a collection
scp-validate collection.scp.gz
# Validate multiple collections
scp-validate snapshot.scp.gz delta.scp.gz
Options
# Verbose output (show detailed validation info)
scp-validate -v collection.scp.gz
scp-validate --verbose collection.scp.gz
# Strict mode (fail on warnings, not just errors)
scp-validate --strict collection.scp.gz
# Quiet mode (only show errors)
scp-validate -q collection.scp.gz
scp-validate --quiet collection.scp.gz
What Gets Validated
- Collection metadata: Required fields, version, type, timestamps
- Page objects: Required fields, URL format, date formats
- Content blocks: Block types, required fields per type
- Checksums: SHA-256 validation if present
- Compression: Decompression ratio (max 100:1)
- Size limits: 50 GB compressed max, 500 GB decompressed max
- JSON format: Valid JSON Lines format
scp-inspect
Inspect SCP collections and display contents in human-readable format.
Basic Usage
# Show collection metadata only
scp-inspect collection.scp.gz
# Show collection metadata and all pages
scp-inspect --pages collection.scp.gz
# Show everything (metadata, pages, content blocks)
scp-inspect --content collection.scp.gz
Options
# Limit number of pages shown
scp-inspect --pages --limit 10 collection.scp.gz
# JSON output (machine-readable)
scp-inspect --json collection.scp.gz > output.json
# Show specific page by URL
scp-inspect --url "https://example.com/page" collection.scp.gz
# Show only pages modified after date
scp-inspect --since "2000-01-15T00:00:00Z" collection.scp.gz
Use Cases
Check collection contents:
scp-inspect collection.scp.gz
Find specific page:
scp-inspect --url "https://example.com/blog/my-post" collection.scp.gz
Export to JSON:
scp-inspect --json collection.scp.gz > data.json
View recent changes (delta):
scp-inspect --pages --since "2000-01-15T00:00:00Z" collection.scp.gz
Debug content blocks:
scp-inspect --content --limit 1 collection.scp.gz
scp-benchmark
Compare HTML files against SCP collections to measure bandwidth savings and performance improvements.
Basic Usage
# Compare SCP collection with original HTML files
scp-benchmark collection.scp.gz page1.html page2.html page3.html
# Works with any number of HTML files
scp-benchmark collection.scp.gz *.html
Arguments
scp-benchmark <scp_file> <html_file1> [html_file2] ...
Arguments:
scp_file SCP collection file (.scp.gz or .scp.zst)
html_files One or more HTML files to compare against
Note:
The HTML files should be the same pages that were converted to the SCP file
to ensure a fair comparison.
Metrics Explained
- Number of files: HTML requires separate requests per page; SCP bundles all pages in one file
- Size (raw): Uncompressed size comparison
- Size (compressed): Compressed size comparison (gzip level 6 for both)
- Parse time: Time to parse and extract content from HTML vs SCP
- Compression ratio: How much the SCP file compresses
Use Cases
Validate bandwidth savings claim:
scp-benchmark blog-snapshot.scp.gz /path/to/html/*.html
Compare different sections:
scp-benchmark docs.scp.gz docs/*.html
scp-benchmark blog.scp.gz blog/*.html
Benchmark your website:
# Generate both SCP collection and HTML from your backend data
python generate_collection.py # Creates collection.scp.gz
python generate_html.py # Creates HTML files for users
# Compare the two formats
scp-benchmark collection.scp.gz output/page1.html output/page2.html
Installation
All three tools are included when you install the SCP Python package:
# Clone the repository
git clone https://github.com/crawlcore/scp-protocol.git
cd scp-protocol/reference-impl
# Install the package
pip install -e .