A simple CLI tool to retrieve latest jnovel releases as a Markdown table.
Find a file
Alexey Skobkin 7d5a794894
Some checks failed
continuous-integration/drone/push Build is passing
CI / build (push) Has been cancelled
CI fixes.
2025-11-05 07:21:52 +03:00
.github/workflows CI fixes. 2025-11-05 07:21:52 +03:00
cmd/jnovels-scrape package name changed. 2025-11-05 06:48:43 +03:00
internal More test coverage. 2025-11-05 07:06:26 +03:00
.drone.yml CI fixes. 2025-11-05 07:21:52 +03:00
.gitignore initial commit. 2025-11-05 06:43:32 +03:00
go.mod package name changed. 2025-11-05 06:48:43 +03:00
README.md README.md examples fixed and extended. 2025-11-05 07:05:18 +03:00

jnovels-scrape

Build Status

jnovels-scrape is a Go CLI that gathers the latest posts from jnovels.com, normalises their metadata, and generates a Markdown table of releases. The tool prefers the WordPress REST API and transparently falls back to HTML scraping when the API is unavailable.

Features

  • WordPress REST API pagination with automatic stop once posts fall below the cutoff date
  • HTML fallback that mirrors the site archive when the API cannot be reached
  • Type (EPUB, PDF, MANGA, UNKNOWN) and volume inference with warnings on partial data
  • Filters on type, title substring, and exact volume match
  • Client rate limiting and respectful handling of server-side throttling (Retry-After / backoff)
  • Markdown output sorted by date (desc) then title (asc)

Install

go build ./cmd/jnovels-scrape

This produces the jnovels-scrape binary in the current directory.

Usage

The cutoff date (--until) is required and uses YYYY-MM-DD.

# Simple run
./jnovels-scrape --until 2024-12-01 --out releases.md

# More advanced
./jnovels-scrape --until 2025-01-01 --group title --type EPUB --out result.md

Common flags:

Flag Description
--until <date> Required cutoff date (YYYY-MM-DD). Only posts on/after this date are kept.
--type, -t <list> Comma-separated subset of epub,pdf,manga,unknown (case-insensitive).
--title, --name, -n <substr> Case-insensitive title substring filter.
--volume, -v <num> Filter by exact volume (integer or decimal). Posts without a recognised volume are dropped.
--out <path> Output file. Default is stdout.
--max-pages <int> Safety limit when paging (default 2000).
--concurrency <int> Detail fetch concurrency for HTML fallback (default 4).
--req-interval <duration> Minimum interval between every HTTP request (default 600ms).
--limit-wait <duration> Wait time when the server rate limits without Retry-After (default 60s).
--group <none|title> Group output rows before sorting (default none).
--group-sort <asc|desc> Sort order inside title groups when --group=title (default asc).
--mode <auto|api|html> Select fetch strategy. auto (default) tries the API before falling back to HTML.

Example

./jnovels-scrape \
  --until 2024-11-01 \
  --type epub,pdf \
  --title "mercenary" \
  --req-interval 1s \
  --out mercenary.md

Modes & Fallback

  • auto (default): Try the WordPress REST API first; on failure, fall back to HTML crawling.
  • api: Force API-only mode. The command exits with an error if the API is unreachable.
  • html: Force HTML-only scraping (never hitting the API).

API mode uses wp-json/wp/v2/posts with per_page=100, orderby=date, and an after parameter derived from --until. Taxonomies are fetched once to improve type inference. HTML mode mirrors the /page/{n}/ archives, extracts titles/links, and loads each post to read the authoritative publish date, categories, and tags.

Warnings are emitted for partial records (e.g., blank volumes, UNKNOWN type, skipped posts without publish dates). These appear on stderr prefixed with WARN.

Rate Limiting

  • Client-side throttle (--req-interval): Enforced for every HTTP request including taxonomy lookups and post detail fetches.
  • Server-side limits (--limit-wait): Applied when a 429/503 response lacks a Retry-After header. When the header is present, the tool sleeps for the provided duration.

A small jitter (±10%) is added to retry waits to avoid thundering herds.

Output format

The generated Markdown begins with a header noting the cutoff date followed by a table:

Generated from jnovels.com (cutoff: 2024-12-01)

| Title | Volume | Type | Date | Link |
|---|---:|---|---|---|
| Example Title | 4 | PDF | 2024-12-02 | [link](https://jnovels.com/example-title-volume-4-pdf/) |

Titles are HTML-stripped, entities are unescaped, pipes are escaped, and dates are normalised to YYYY-MM-DD (UTC). Volume cells may be blank when no numeric volume is present.

Grouping

Use --group=title to cluster releases that share the same cleaned title (e.g. EPUB/PDF pairs or different volume parts). Title groups are sorted alphabetically, and the rows inside each group are ordered by volume number (--group-sort=asc|desc, default ascending). Entries without a parsed volume stay within their title group but follow the numbered volumes.

Development

go test ./...

The repository contains unit tests for filters, type/volume parsing, and time parsing.

CI (GitHub Actions) runs formatting checks (go fmt), go vet, and the unit test suite on every push.