web_to_ebook
- most straightforward using sytle
Converting a website to an ebook — Austin’s Thought Basin w3m like script, seems legit yuratomo/w3m.vim: w3m plugin for vim w3m in vim
readability-cli - npm gardenappl / readability-cli · GitLab Read HTML from a file and output the result to the console:
readable index.html
Fetch a random Wikipedia article, get its title and an excerpt:
readable https://en.wikipedia.org/wiki/Special:Random -p title,excerpt
Fetch a web page and read it in W3M:
readable https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html | w3m -T text/html
Download a web page using cURL, parse it and output as JSON:
curl https://github.com/mozilla/readability | readable --base=https://github.com/mozilla/readability --json
#!/bin/sh
set -eu
# Point this at whatever file your URLs are stored in.
urls="urls"
# Make the directory where we'll store the clean HTML for each post.
mkdir -p posts
# Iterate over the URLs, download them and clean them up with the readability-cli
# We use a count here to ensure that we organize the output posts in the same order that they are specified
# in the input file. This is helpful as you can lay out the full order of your book by just editing the URLs file.
count=1
cat "$urls" | while read url
do
output=$(printf "posts/%03d.html" $count)
readable -q --low-confidence force "$url" -o "$output" 2>&1 > /dev/null
count=$((count+1))
done
# Take all of the posts and put them into a book.
pandoc -o TheBook.epub posts/*.html
gpt suggested way: ’ wget —mirror —convert-links —adjust-extension —page-requisites —no-parent https://zmk.dev/docsanaznazv ‘
Script version
#!/bin/bash
# Check if URL is provided
if [ $# -eq 0 ]; then
echo "Please provide a URL as an argument."
exit 1
fi
URL=$1
DOMAIN=$(echo $URL | awk -F[/:] '{print $4}')
# Download the webpage and its assets
wget --level=inf --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-host-directories --directory-prefix=$DOMAIN "$URL"
# Find the main HTML file
HTML_FILE=$(find $DOMAIN -maxdepth 1 -type f -name "*.html" -o -name "*.htm" | head -n 1)
if [ -z "$HTML_FILE" ]; then
echo "No HTML file found in the downloaded content."
exit 1
fi
# Create a temporary TOC file
TOC_FILE="toc.html"
echo "<h1>Table of Contents</h1>" > $TOC_FILE
echo "<ul>" >> $TOC_FILE
# Extract headers and create TOC
grep -n "<h[1-3]" "$HTML_FILE" | while IFS=: read -r line_number line_content; do
header_level=$(echo $line_content | sed -n 's/.*<h\([1-3]\).*/\1/p')
header_text=$(echo $line_content | sed -n 's/.*<h[1-3][^>]*>\(.*\)<\/h[1-3].*/\1/p')
echo "<li><a href=\"$HTML_FILE#line$line_number\">$header_text</a></li>" >> $TOC_FILE
done
echo "</ul>" >> $TOC_FILE
# Insert TOC into the main HTML file
sed -i "/<body/r $TOC_FILE" "$HTML_FILE"
# Add id attributes to headers in the main HTML file
sed -i 's/<h\([1-3]\)\([^>]*\)>/<h\1\2 id="line&">/g' "$HTML_FILE"
# Convert to EPUB
ebook-convert "$HTML_FILE" "${DOMAIN}.epub" --output-profile tablet --toc-title "Table of Contents" --level1-toc "//h:h1" --level2-toc "//h:h2" --level3-toc "//h:h3"
# Clean up
rm $TOC_FILE
# echo "Conversion complete. Output file: ${DOMAIN}.epub"
# Here's what this script does:
#
# It checks if a URL is provided as an argument.
# It uses wget to download the webpage and its assets, similar to the example you provided.
# It finds the main HTML file in the downloaded content.
# It creates a temporary Table of Contents (TOC) file by extracting h1, h2, and h3 headers from the main HTML file.
# It inserts the TOC into the main HTML file just after the <body> tag.
# It adds id attributes to the headers in the main HTML file for internal linking.
# It uses Calibre's ebook-convert to convert the HTML file to EPUB format, specifying the TOC structure.
# Finally, it cleans up the temporary TOC file.
and other sh from:
Convert Web Pages to Ebooks in MOBI Format using Wget and Calibre | Geeksta
#!/bin/bash
# Convert web pages to ebooks in MOBI format using wget and calibre.
set -euo pipefail
# Assign and check that URL argument is provided
URL=${1:-}
if [ -z "$URL" ]; then
echo "Usage: $0 <URL>"
exit 1
fi
# Download the webpage
# --level=inf: follows links to an unlimited depth (useful for downloading all linked assets)
# --no-clobber: don't overwrite any existing files, so you can run the script multiple times without re-downloading everything
# --page-requisites: downloads all necessary files to display the page, including CSS files, images, and JavaScript files
# --html-extension: save the downloaded HTML file with a .html extension, even if the original URL didn't have one
# --convert-links: converts all links in the downloaded files to relative links so they work offline
# --restrict-file-names=windows: replace any characters in filenames that are illegal on Windows (such as : or ?) with underscores
wget --level=inf --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows "$URL"
# Parse the host name from the URL
host=$(echo "$URL" | cut -d"/" -f3)
# Determine name of downloaded HTML file
html_file=$(find "$host" -name "*.html" -o -name "*.htm" | head -1)
if [ -z "$html_file" ]; then
echo "No HTML file found in $subdir"
exit 1
fi
# Convert HTML file to EPUB using calibre
ebook-convert "$html_file" "$(basename "$URL" .html).mobi" --output-profile kindle_pw
extract toc from ebook 2024-07-22
#!/bin/bash
# Define where to find the EPUB files (current directory in this case)
FILES=*.epub
for f in $FILES
do
# Extract the base name of the file for naming the toc files
basename=$(basename "$f" .epub)
# Get all possible toc.ncx paths within the EPUB file
tocfiles=$(unzip -l "$f" | grep 'toc.ncx' | awk '{print $4}')
for toc in $tocfiles
do
# Extract toc.ncx contents directly into sed to strip tags and save unique lines
unzip -p "$f" "$toc" | sed 's/<[^>]*>//g' | awk '!seen[$0]++' > "${basename}_$(basename "$toc" .ncx)_unique_toc.txt"
done
# Echo progress
echo "Processed TOC for $f"
done
echo "All .epub files have been processed."
- singlesave !! is pretty good
fc-list after the filenames, descripe the font column
FILES=*.epub
for f in $FILES
do
# extension="${f##*.}"
filename="${f%.*}"
echo "Converting $f to $filename.pdf"
ebook-convert "$f" "$filename.pdf" \
--pdf-serif-family "Victor Mono" \
--pdf-sans-family "Victor Mono" \
--pdf-mono-family "Victor Mono" \
--pretty-print
#ebook-convert "$f" "$filename.pdf" --pdf-serif-family "Reddit Mono"
done
form cablibre :
—embed-font-family
Embed the specified font family into the book. This specifies the "base" font used for the book. If the input document specifies its own fonts, they may override this base font. You can use the filter style information option to remove fonts from the input document. Note that font embedding only works with some output formats, principally EPUB, AZW3 and DOCX.
—embed-all-fonts
Embed every font that is referenced in the input document but not already embedded. This will search your system for the fonts, and if found, they will be embedded. Embedding will only work if the format you are converting to supports embedded fonts, such as EPUB, AZW3, DOCX or PDF. Please ensure that you have the proper license for embedding the fonts used in this document.
Good script but without images downloading
#!/bin/sh
set -eu
# Point this at whatever file your URLs are stored in.
urls="./url.txt"
# Make the directory where we'll store the clean HTML for each post.
mkdir -p posts
# Iterate over the URLs, download them and clean them up with the readability-cli
# We use a count here to ensure that we organize the output posts in the same order that they are specified
# in the input file. This is helpful as you can lay out the full order of your book by just editing the URLs file.
count=1
cat "$urls" | while read url
do
# Generate output filename with zero-padded count (e.g., 001.html)
output=$(printf "posts/%03d.html" $count)
# Download the URL using singlefile
echo "Downloading $url to $output"
single-file "$url" "$output"
# Increment the counter
count=$((count+1))
done
# Workable script option:
# readable -q --low-confidence keep -C "$url" -o "$output" 2>&1 > /dev/null
pandoc embed-font-family:
pandoc -o data2.epub posts/.html —epub-subdirectory=VictorMono —epub-embed-font=‘VictorMono/VictorMono.ttf’
- it require the font folder
pdf to toc
fd -e pdf -x sh -c 'pdfcpu bookmark list "$1" > "${1%.pdf}.md"' _ {}
7z and epub
To update the EPUB file using 7-Zip, you can use the following command line:
7z u sicp.epub mimetype html META-INF content.opf index.xhtml LICENSE toc.xhtml
This command will update the sicp.epub
file with the listed files and directories[1][5]. Here’s a breakdown of the command:
7z
: Invokes the 7-Zip command-line toolu
: Update command, which adds new files and updates existing ones in the archivesicp.epub
: The name of your EPUB file- The remaining items are the files and directories to be updated or added
Note that the mimetype
file should be the first file in the EPUB archive and should not be compressed[2]. To ensure this, you may want to use a two-step process:
- Add the mimetype file without compression:
7z a -tzip sicp.epub mimetype -mx0
- Then add or update the remaining files:
7z u sicp.epub html META-INF content.opf index.xhtml LICENSE toc.xhtml
This approach ensures that the mimetype
file is the first in the archive and uncompressed, which is important for EPUB validation[2][7].
Citations: [1] https://gist.github.com/spajak/9b8b8a46f7ebf8390f5943c3fe73195e [2] https://stackoverflow.com/questions/18824773/zip-an-epub-using-a-batch-file-and-7zip [3] https://superuser.com/questions/908184/zip-epub-using-7zip-and-issue-with-exlude-file [4] https://www.baeldung.com/linux/7z-tutorial [5] https://www.tecmint.com/7zip-command-examples-in-linux/ [6] https://www.7-zip.org/faq.html [7] https://kdpcommunity.com/s/question/0D5f400000FHiZDCA1/anybody-know-how-to-unzip-an-epub-and-i-assume-zip-it-back-up [8] https://py7zr.readthedocs.io/en/latest/user_guide.html
PPtx to pdf with slide-transitions
replacing pdf font
PyMuPDF-Utilities/font-replacement at master · pymupdf/PyMuPDF-Utilities