Skip to main content
Reader includes a powerful CLI for scraping and crawling from the terminal.

Installation

The CLI is included with the Reader package:
npm install -g @vakra-dev/reader
Or use with npx:
npx @vakra-dev/reader scrape https://example.com

Scrape Command

Basic Usage

# Scrape a single URL
reader scrape https://example.com

# Scrape multiple URLs
reader scrape https://example.com https://example.org

# Short alias
reader s https://example.com

Output Formats

# Markdown only (default)
reader scrape https://example.com

# Multiple formats
reader scrape https://example.com -f markdown,html

# Save to file
reader scrape https://example.com -o output.json

Concurrency

# Scrape multiple URLs concurrently
reader scrape url1 url2 url3 url4 url5 -c 3

Timeouts

# Set per-page timeout
reader scrape https://example.com -t 60000

# Set batch timeout
reader scrape url1 url2 url3 --batch-timeout 300000

Content Extraction

# Disable main content extraction (full page)
reader scrape https://example.com --no-main-content

# Include specific elements
reader scrape https://example.com --include-tags ".article,.content"

# Exclude specific elements
reader scrape https://example.com --exclude-tags ".comments,.sidebar"

Proxy

reader scrape https://example.com --proxy http://user:pass@proxy.example.com:8080

Debugging

# Verbose logging
reader scrape https://example.com -v

# Show browser window
reader scrape https://example.com --show-chrome

All Options

OptionShortDefaultDescription
--format-fmarkdownOutput formats (comma-separated)
--output-ostdoutOutput file path
--concurrency-c1Parallel requests
--timeout-t30000Per-page timeout (ms)
--batch-timeout300000Total batch timeout (ms)
--proxyProxy URL
--user-agentCustom user agent
--no-main-contentInclude full page
--include-tagsCSS selectors to include
--exclude-tagsCSS selectors to exclude
--show-chromeShow browser window
--verbose-vEnable logging
--standaloneBypass daemon

Crawl Command

Basic Usage

# Crawl a website
reader crawl https://example.com

# Short alias
reader c https://example.com

Depth and Limits

# Set crawl depth
reader crawl https://example.com -d 3

# Limit pages
reader crawl https://example.com -m 100

# Both
reader crawl https://example.com -d 3 -m 100

Scrape Content

# Crawl and scrape content
reader crawl https://example.com -d 2 --scrape

# With format
reader crawl https://example.com --scrape -f markdown

URL Filtering

# Include patterns
reader crawl https://example.com --include "blog/*,docs/*"

# Exclude patterns
reader crawl https://example.com --exclude "admin/*,api/*"

Rate Limiting

# Set delay between requests
reader crawl https://example.com --delay 2000

All Options

OptionShortDefaultDescription
--depth-d1Maximum crawl depth
--max-pages-m20Maximum pages to discover
--scrape-sScrape content
--format-fmarkdownOutput formats
--output-ostdoutOutput file path
--delay1000Delay between requests (ms)
--timeout-tTotal crawl timeout (ms)
--includeURL patterns to include
--excludeURL patterns to exclude
--proxyProxy URL
--user-agentCustom user agent
--show-chromeShow browser window
--verbose-vEnable logging

Daemon Mode

For multiple requests, use daemon mode to keep the browser pool warm:

Start Daemon

# Start with default settings
reader start

# Custom pool size
reader start --pool-size 5

# Custom port
reader start -p 4000

Check Status

reader status

Stop Daemon

reader stop

Auto-Connect

When a daemon is running, CLI commands automatically connect to it:
# Start daemon
reader start --pool-size 5

# These commands use the daemon's browser pool
reader scrape https://example.com
reader scrape https://example.org
reader crawl https://example.net

# Bypass daemon (standalone mode)
reader scrape https://example.com --standalone

# Stop daemon when done
reader stop

Output Format

CLI output is always JSON with the following structure:

Scrape Output

{
  "data": [
    {
      "markdown": "# Page Title\n\nContent...",
      "html": "<h1>Page Title</h1>...",
      "metadata": {
        "baseUrl": "https://example.com",
        "scrapedAt": "2024-01-15T10:30:00Z",
        "duration": 1234,
        "website": {
          "title": "Page Title",
          "description": "Page description"
        }
      }
    }
  ],
  "batchMetadata": {
    "totalUrls": 1,
    "successfulUrls": 1,
    "failedUrls": 0,
    "totalDuration": 1234
  }
}

Crawl Output

{
  "urls": [
    { "url": "https://example.com/", "title": "Home" },
    { "url": "https://example.com/about", "title": "About" }
  ],
  "metadata": {
    "totalUrls": 2,
    "maxDepth": 1,
    "totalDuration": 2345,
    "seedUrl": "https://example.com"
  }
}

Examples

Scrape and process with jq

# Extract just the markdown
reader scrape https://example.com | jq -r '.data[0].markdown'

# Get all titles from batch
reader scrape url1 url2 url3 | jq -r '.data[].metadata.website.title'

Save crawl results

reader crawl https://docs.example.com -d 3 --scrape -o docs.json

Batch scrape from file

cat urls.txt | xargs reader scrape -c 5 -o results.json

Next Steps