I don't like browsing huge single HTML pages of documentation. Does
anyone know of a program or script (preferably for Linux) that can
scan a big software manual's single HTML page and automatically
break it up according to the contents section and the corresponding
anchor links?
Basically I want something to turn this: http://www.gnu.org/software/coreutils/manual/coreutils.html
into this:
http://www.gnu.org/software/coreutils/manual/html_node/index.html
But without the Texinfo source like GNU software (usually) uses.
Just from the HTML itself. I also want it to output static HTML, so
no solutions using Javascript or browser add-ons.
One option might be to use csplit to break it up at common section
separator patterns, then a simple script renames the new files
according to their heading text. But I'd like to have HTML
navigation links, ideally including converting existing anchor
links inside the document.
A prime target would be the Raspberry Pi configuration
documentation, which has convinced me of the merit of multi-page
docs by how confusing it has become for me since they switched to a single-page layout: https://www.raspberrypi.com/documentation/computers/configuration.html
A prime target would be the Raspberry Pi configuration
documentation, which has convinced me of the merit of multi-page
docs by how confusing it has become for me since they switched to a
single-page layout:
https://www.raspberrypi.com/documentation/computers/configuration.html
I've never used it myself, but I seem to remember talking to someone in
the past who used htmldoc to do this kind of thing?
https://www.msweet.org/htmldoc/
In comp.os.linux.misc Bud Frede <frede@mouse-potato.com> wrote:
A prime target would be the Raspberry Pi configuration
documentation, which has convinced me of the merit of multi-page
docs by how confusing it has become for me since they switched to a
single-page layout:
https://www.raspberrypi.com/documentation/computers/configuration.html
I've never used it myself, but I seem to remember talking to someone in
the past who used htmldoc to do this kind of thing?
https://www.msweet.org/htmldoc/
Thanks for that. I've been to that website before but it's not
until you look in the README that it mentions the option of HTML as
an output format as well as for input. In fact the "htmlsep"
format option does exactly what I wanted:
mkdir rpi_config
htmldoc -t htmlsep -d ./rpi_config 'https://www.raspberrypi.com/documentation/computers/configuration.html'
It did pass through a few broken relative links that pointed to
other pages at the website, but I think I can live with that.
I don't like browsing huge single HTML pages of documentation. Does
anyone know of a program or script (preferably for Linux) that can
scan a big software manual's single HTML page and automatically
break it up according to the contents section and the corresponding
anchor links?
Basically I want something to turn this: http://www.gnu.org/software/coreutils/manual/coreutils.html
into this:
http://www.gnu.org/software/coreutils/manual/html_node/index.html
But without the Texinfo source like GNU software (usually) uses.
Just from the HTML itself. I also want it to output static HTML, so
no solutions using Javascript or browser add-ons.
One option might be to use csplit to break it up at common section
separator patterns, then a simple script renames the new files
according to their heading text. But I'd like to have HTML
navigation links, ideally including converting existing anchor
links inside the document.
A prime target would be the Raspberry Pi configuration
documentation, which has convinced me of the merit of multi-page
docs by how confusing it has become for me since they switched to a single-page layout: https://www.raspberrypi.com/documentation/computers/configuration.html
No, sorry, I do not know a tool to do the splitting automatically. I would simply use an html editor and save chunks manually. LO could possibly do it. Otherwise, I would try Composer in Seamonkey.
Interesting. I find multi-page docs confusing, I prefer single page. I
can do searches on the whole thing.
Consequently, the proposed tool HTMLdoc excludes virtually everything:
While it currently does not support many things in "the modern web"
such as Cascading Style Sheets (CSS), forms, full Unicode, and Emoji
characters, ...
In comp.infosystems.www.misc Helmut Richter <hr.usenet@email.de> wrote:
Consequently, the proposed tool HTMLdoc excludes virtually everything:
While it currently does not support many things in "the modern web"
such as Cascading Style Sheets (CSS), forms, full Unicode, and Emoji
characters, ...
Well no, it doesn't exclude most software documentation because for
whatever reason the HTML in much of that has remained relatively
sane. Some CSS is creeping in, but a tool that ignores it still
produces clear text with some formatting. So far I've tested it
with two documentation pages published in 2023 and it's understood
the HTML fine (except for that out-of-page link problem in the RPi
doc). Plus often I'm looking at docs published 10-20 years ago
anyway.
Out-of-page links are trivial: replace each link "#xyz" by "subpage#xyz".
It is known to which subpage each link belongs, at least if you go over
the text in two passes. This is a procedure which I apply to all my web pages, which are written as one document, and split into pieces later.
On Wed, 2 Aug 2023, Helmut Richter wrote:
Out-of-page links are trivial: replace each link "#xyz" by "subpage#xyz".
It is known to which subpage each link belongs, at least if you go over
the text in two passes. This is a procedure which I apply to all my web
pages, which are written as one document, and split into pieces later.
It might be interesting to see an example of the TOC (table of contents)
of such a split article (https://hhr-m.de/sw-fibel/contents.html). It contains all anchors in the whole article, which are possible but not necessarily used (except from the TOC, of course) link targets. The link structure might be still better visible if you look into the source code
of that web page which is fairly readable.
In comp.os.linux.misc Helmut Richter <hr.usenet@email.de> wrote:
On Wed, 2 Aug 2023, Helmut Richter wrote:
Out-of-page links are trivial: replace each link "#xyz" by "subpage#xyz". >>> It is known to which subpage each link belongs, at least if you go over
the text in two passes. This is a procedure which I apply to all my web
pages, which are written as one document, and split into pieces later.
It might be interesting to see an example of the TOC (table of contents)
of such a split article (https://hhr-m.de/sw-fibel/contents.html). It
contains all anchors in the whole article, which are possible but not
necessarily used (except from the TOC, of course) link targets. The link
structure might be still better visible if you look into the source code
of that web page which is fairly readable.
I think you misunderstood the problem. Perhaps I should have
explained that I would prefer it to rewrite relative links to other
webpages as absolute links.
As it is, a link like this:
<a href="/documentation/computers/processors.html#bcm2835">BCM2835</a>
From here: https://www.raspberrypi.com/documentation/computers/configuration.html
Doesn't work when conveted unless the processors.html page is also
saved locally. Seeing as the program saw the source URL, I would
have liked it to be smart enough to turn such relative links into
absolute links when the link distination is another webpage.
This has fixed many of those relative links which had a directory
path:
for page in *.html; do sed -i \
's/<a href="\//<a href="https:\/\/www.raspberrypi.com\//g' $page; done
Pre-processing the page to rewrite relative links to other pages in
the same directory when the path isn't in the href, before running
HTMLDOC, would fix the rest.
Such as this:
<a href="config_txt.html#video-options">
It's not a major complaint.
On 8/4/23 8:07 AM, Computer Nerd Kev wrote:
This has fixed many of those relative links which had a directory
path:
for page in *.html; do sed -i \
's/<a href="\//<a href="https:\/\/www.raspberrypi.com\//g' $page; done
Pre-processing the page to rewrite relative links to other pages in
the same directory when the path isn't in the href, before running
HTMLDOC, would fix the rest.
Such as this:
<a href="config_txt.html#video-options">
It's not a major complaint.
Ummm ... are you trying to do this STATICALLY, on
pre-existing HTML, or DYNAMICALLY, as users actually
access the pages ???
For the first case, a little Python will do wonders.
Identify tags, where you need to insert the absolute
parts of the paths, do it. Python is great with text
strings.
In comp.os.linux.misc 23k.304 <23k304@bfxw9.net> wrote:
On 8/4/23 8:07 AM, Computer Nerd Kev wrote:
This has fixed many of those relative links which had a directory
path:
for page in *.html; do sed -i \
's/<a href="\//<a href="https:\/\/www.raspberrypi.com\//g' $page; done
Pre-processing the page to rewrite relative links to other pages in
the same directory when the path isn't in the href, before running
HTMLDOC, would fix the rest.
Such as this:
<a href="config_txt.html#video-options">
It's not a major complaint.
Ummm ... are you trying to do this STATICALLY, on
pre-existing HTML, or DYNAMICALLY, as users actually
access the pages ???
Statically, as stated in the first post.
Also no users besides me, browsing locally with file:// URLs.
In comp.os.linux.misc 23k.304 <23k304@bfxw9.net> wrote:
On 8/4/23 8:07 AM, Computer Nerd Kev wrote:
This has fixed many of those relative links which had a directory
path:
for page in *.html; do sed -i \
's/<a href="\//<a href="https:\/\/www.raspberrypi.com\//g' $page; done
Pre-processing the page to rewrite relative links to other pages in
the same directory when the path isn't in the href, before running
HTMLDOC, would fix the rest.
Such as this:
<a href="config_txt.html#video-options">
It's not a major complaint.
Ummm ... are you trying to do this STATICALLY, on
pre-existing HTML, or DYNAMICALLY, as users actually
access the pages ???
Statically, as stated in the first post.
Also no users besides me, browsing locally with file:// URLs.
For the first case, a little Python will do wonders.
Identify tags, where you need to insert the absolute
parts of the paths, do it. Python is great with text
strings.
Well Sed will do that too with a smarter regex (using selection
brackets), and I don't like Python for reasons we've already
argued about. As it happens there are so few links without an
absolute path in that doc (I thought there were none until browsing
around after that first try) that I figured it wasn't worth
bothering with. Other docs may sufficiently motivate me eventually.
On Sat, 5 Aug 2023, Computer Nerd Kev wrote:
In comp.os.linux.misc 23k.304 <23k304@bfxw9.net> wrote:
On 8/4/23 8:07 AM, Computer Nerd Kev wrote:
This has fixed many of those relative links which had a directory
path:
for page in *.html; do sed -i \
's/<a href="\//<a href="https:\/\/www.raspberrypi.com\//g' $page; done >>>>
Pre-processing the page to rewrite relative links to other pages in
the same directory when the path isn't in the href, before running
HTMLDOC, would fix the rest.
Such as this:
<a href="config_txt.html#video-options">
It's not a major complaint.
Ummm ... are you trying to do this STATICALLY, on
pre-existing HTML, or DYNAMICALLY, as users actually
access the pages ???
Statically, as stated in the first post.
Yes, there is nothing dynamic in it.
I had still another idea: leave the links as is, and map them to the
correct URLs by a rewrite in the server. The table of necessary rewrite statements is static but can be modified if necessary. I discarded that because it is too complex for too little benefit, if any.
Also no users besides me, browsing locally with file:// URLs.
Of course, the rewrite is not an option in this case as there is no
server.
A prime target would be the Raspberry Pi configuration
documentation, which has convinced me of the merit of multi-page
docs by how confusing it has become for me since they switched to a single-page layout: https://www.raspberrypi.com/documentation/computers/configuration.html
In comp.infosystems.www.misc Computer Nerd Kev <not@telling.you.invalid> wrote:
A prime target would be the Raspberry Pi configuration
documentation, which has convinced me of the merit of multi-page
docs by how confusing it has become for me since they switched to a
single-page layout:
https://www.raspberrypi.com/documentation/computers/configuration.html
It doesn't answer the general question, but just to note that documentation is generated from Asciidoc, and the source files can be found in their repo: https://github.com/raspberrypi/documentation
Presumably there's a way to build Asciidoc to generate multi-page HTML, but if not you could just read the .adoc files - github makes a fair stab at rendering them.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 475 |
Nodes: | 16 (2 / 14) |
Uptime: | 17:12:01 |
Calls: | 9,487 |
Calls today: | 6 |
Files: | 13,617 |
Messages: | 6,121,089 |