web_to_ebook

dteviot/WebToEpub: A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.

  • most straightforward using sytle

Converting a website to an ebook — Austin’s Thought Basin w3m like script, seems legit yuratomo/w3m.vim: w3m plugin for vim w3m in vim

readability-cli - npm gardenappl / readability-cli · GitLab Read HTML from a file and output the result to the console:

readable index.html

Fetch a random Wikipedia article, get its title and an excerpt:

readable https://en.wikipedia.org/wiki/Special:Random -p title,excerpt

Fetch a web page and read it in W3M:

readable https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html | w3m -T text/html

Download a web page using cURL, parse it and output as JSON:

curl https://github.com/mozilla/readability | readable --base=https://github.com/mozilla/readability --json
 
#!/bin/sh
 
set -eu
 
# Point this at whatever file your URLs are stored in.
urls="urls"
 
# Make the directory where we'll store the clean HTML for each post.
mkdir -p posts
 
# Iterate over the URLs, download them and clean them up with the readability-cli
# We use a count here to ensure that we organize the output posts in the same order that they are specified
# in the input file. This is helpful as you can lay out the full order of your book by just editing the URLs file.
count=1
cat "$urls" | while read url
do
    output=$(printf "posts/%03d.html" $count)
    readable -q --low-confidence force "$url" -o "$output" 2>&1 > /dev/null
    count=$((count+1))
done
 
# Take all of the posts and put them into a book.
pandoc -o TheBook.epub posts/*.html
 

gpt suggested way: ’ wget —mirror —convert-links —adjust-extension —page-requisites —no-parent https://zmk.dev/docsanaznazv

Script version

#!/bin/bash
 
# Check if URL is provided
if [ $# -eq 0 ]; then
    echo "Please provide a URL as an argument."
    exit 1
fi
 
URL=$1
DOMAIN=$(echo $URL | awk -F[/:] '{print $4}')
 
# Download the webpage and its assets
wget --level=inf --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-host-directories --directory-prefix=$DOMAIN "$URL"
 
# Find the main HTML file
HTML_FILE=$(find $DOMAIN -maxdepth 1 -type f -name "*.html" -o -name "*.htm" | head -n 1)
 
if [ -z "$HTML_FILE" ]; then
    echo "No HTML file found in the downloaded content."
    exit 1
fi
 
# Create a temporary TOC file
TOC_FILE="toc.html"
echo "<h1>Table of Contents</h1>" > $TOC_FILE
echo "<ul>" >> $TOC_FILE
 
# Extract headers and create TOC
grep -n "<h[1-3]" "$HTML_FILE" | while IFS=: read -r line_number line_content; do
    header_level=$(echo $line_content | sed -n 's/.*<h\([1-3]\).*/\1/p')
    header_text=$(echo $line_content | sed -n 's/.*<h[1-3][^>]*>\(.*\)<\/h[1-3].*/\1/p')
    echo "<li><a href=\"$HTML_FILE#line$line_number\">$header_text</a></li>" >> $TOC_FILE
done
 
echo "</ul>" >> $TOC_FILE
 
# Insert TOC into the main HTML file
sed -i "/<body/r $TOC_FILE" "$HTML_FILE"
 
# Add id attributes to headers in the main HTML file
sed -i 's/<h\([1-3]\)\([^>]*\)>/<h\1\2 id="line&">/g' "$HTML_FILE"
 
# Convert to EPUB
ebook-convert "$HTML_FILE" "${DOMAIN}.epub" --output-profile tablet --toc-title "Table of Contents" --level1-toc "//h:h1" --level2-toc "//h:h2" --level3-toc "//h:h3"
 
# Clean up
rm $TOC_FILE
 
# echo "Conversion complete. Output file: ${DOMAIN}.epub"
# Here's what this script does:
#
# It checks if a URL is provided as an argument.
# It uses wget to download the webpage and its assets, similar to the example you provided.
# It finds the main HTML file in the downloaded content.
# It creates a temporary Table of Contents (TOC) file by extracting h1, h2, and h3 headers from the main HTML file.
# It inserts the TOC into the main HTML file just after the <body> tag.
# It adds id attributes to the headers in the main HTML file for internal linking.
# It uses Calibre's ebook-convert to convert the HTML file to EPUB format, specifying the TOC structure.
# Finally, it cleans up the temporary TOC file.

and other sh from:

Convert Web Pages to Ebooks in MOBI Format using Wget and Calibre | Geeksta

#!/bin/bash
# Convert web pages to ebooks in MOBI format using wget and calibre.
set -euo pipefail
 
# Assign and check that URL argument is provided
URL=${1:-}
if [ -z "$URL" ]; then
  echo "Usage: $0 <URL>"
  exit 1
fi
 
# Download the webpage
# --level=inf: follows links to an unlimited depth (useful for downloading all linked assets)
# --no-clobber: don't overwrite any existing files, so you can run the script multiple times without re-downloading everything
# --page-requisites: downloads all necessary files to display the page, including CSS files, images, and JavaScript files
# --html-extension: save the downloaded HTML file with a .html extension, even if the original URL didn't have one
# --convert-links: converts all links in the downloaded files to relative links so they work offline
# --restrict-file-names=windows: replace any characters in filenames that are illegal on Windows (such as : or ?) with underscores
wget --level=inf --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows "$URL"
 
# Parse the host name from the URL
host=$(echo "$URL" | cut -d"/" -f3)
 
# Determine name of downloaded HTML file
html_file=$(find "$host" -name "*.html" -o -name "*.htm" | head -1)
if [ -z "$html_file" ]; then
  echo "No HTML file found in $subdir"
  exit 1
fi
 
# Convert HTML file to EPUB using calibre
ebook-convert "$html_file" "$(basename "$URL" .html).mobi" --output-profile kindle_pw
 
 

extract toc from ebook 2024-07-22

#!/bin/bash
 
# Define where to find the EPUB files (current directory in this case)
FILES=*.epub
for f in $FILES
do
  # Extract the base name of the file for naming the toc files
  basename=$(basename "$f" .epub)
 
  # Get all possible toc.ncx paths within the EPUB file
  tocfiles=$(unzip -l "$f" | grep 'toc.ncx' | awk '{print $4}')
 
  for toc in $tocfiles
  do
    # Extract toc.ncx contents directly into sed to strip tags and save unique lines
    unzip -p "$f" "$toc" | sed 's/<[^>]*>//g' | awk '!seen[$0]++' > "${basename}_$(basename "$toc" .ncx)_unique_toc.txt"
  done
 
  # Echo progress
  echo "Processed TOC for $f"
done
 
echo "All .epub files have been processed."
 
 

w3m

  • singlesave !! is pretty good

fc-list after the filenames, descripe the font column

FILES=*.epub
for f in $FILES
do
  # extension="${f##*.}"
  filename="${f%.*}"
  echo "Converting $f to $filename.pdf"
  ebook-convert "$f" "$filename.pdf" \
  --pdf-serif-family "Victor Mono" \
  --pdf-sans-family "Victor Mono" \
  --pdf-mono-family "Victor Mono" \
  --pretty-print
  #ebook-convert "$f" "$filename.pdf" --pdf-serif-family "Reddit Mono"
 
done
 

form cablibre :

—embed-font-family

Embed the specified font family into the book. This specifies the "base" font used for the book. If the input document specifies its own fonts, they may override this base font. You can use the filter style information option to remove fonts from the input document. Note that font embedding only works with some output formats, principally EPUB, AZW3 and DOCX.

—embed-all-fonts

    Embed every font that is referenced in the input document but not already embedded. This will search your system for the fonts, and if found, they will be embedded. Embedding will only work if the format you are converting to supports embedded fonts, such as EPUB, AZW3, DOCX or PDF. Please ensure that you have the proper license for embedding the fonts used in this document.

dteviot/WebToEpub: A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.

Good script but without images downloading

#!/bin/sh
set -eu
 
# Point this at whatever file your URLs are stored in.
urls="./url.txt"
 
# Make the directory where we'll store the clean HTML for each post.
mkdir -p posts
 
# Iterate over the URLs, download them and clean them up with the readability-cli
# We use a count here to ensure that we organize the output posts in the same order that they are specified
# in the input file. This is helpful as you can lay out the full order of your book by just editing the URLs file.
count=1
cat "$urls" | while read url
do
 
    # Generate output filename with zero-padded count (e.g., 001.html)
    output=$(printf "posts/%03d.html" $count)
 
    # Download the URL using singlefile
    echo "Downloading $url to $output"
    single-file "$url" "$output"
 
    # Increment the counter
    count=$((count+1))
done
 
# Workable script option:
#     readable -q --low-confidence keep -C "$url" -o "$output" 2>&1 > /dev/null
 

pandoc embed-font-family:

pandoc -o data2.epub posts/.html —epub-subdirectory=VictorMono —epub-embed-font=‘VictorMono/VictorMono.ttf’

  • it require the font folder

pdf to toc

fd -e pdf -x sh -c 'pdfcpu bookmark list "$1" > "${1%.pdf}.md"' _ {}


7z and epub

To update the EPUB file using 7-Zip, you can use the following command line:

7z u sicp.epub mimetype html META-INF content.opf index.xhtml LICENSE toc.xhtml

This command will update the sicp.epub file with the listed files and directories[1][5]. Here’s a breakdown of the command:

  • 7z: Invokes the 7-Zip command-line tool
  • u: Update command, which adds new files and updates existing ones in the archive
  • sicp.epub: The name of your EPUB file
  • The remaining items are the files and directories to be updated or added

Note that the mimetype file should be the first file in the EPUB archive and should not be compressed[2]. To ensure this, you may want to use a two-step process:

  1. Add the mimetype file without compression:
7z a -tzip sicp.epub mimetype -mx0
  1. Then add or update the remaining files:
7z u sicp.epub html META-INF content.opf index.xhtml LICENSE toc.xhtml

This approach ensures that the mimetype file is the first in the archive and uncompressed, which is important for EPUB validation[2][7].

Citations: [1] https://gist.github.com/spajak/9b8b8a46f7ebf8390f5943c3fe73195e [2] https://stackoverflow.com/questions/18824773/zip-an-epub-using-a-batch-file-and-7zip [3] https://superuser.com/questions/908184/zip-epub-using-7zip-and-issue-with-exlude-file [4] https://www.baeldung.com/linux/7z-tutorial [5] https://www.tecmint.com/7zip-command-examples-in-linux/ [6] https://www.7-zip.org/faq.html [7] https://kdpcommunity.com/s/question/0D5f400000FHiZDCA1/anybody-know-how-to-unzip-an-epub-and-i-assume-zip-it-back-up [8] https://py7zr.readthedocs.io/en/latest/user_guide.html


PPtx to pdf with slide-transitions

maxonthegit/PPspliT: A PowerPoint add-in that splits slides according to slideshow-time animation effects


replacing pdf font

PyMuPDF-Utilities/font-replacement at master · pymupdf/PyMuPDF-Utilities

extract pdf font

PDF processor api & cli | pdfcpu