• Re: :keywords metadata item?

    From =?UTF-8?Q?Julien_=c3=89LIE?=@21:1/5 to All on Sat Sep 10 12:19:28 2022
    XPost: news.software.nntp

    Bonjour Franck,

    I am half-tempted to advertise ":keywords" instead of Keywords in the
    next release so as to comply with the protocol (the keywords are not
    present in the article itself), and properly handle "HDR :keywords" vs
    "HDR Keywords" results, the same way "HDR Lines" return the real
    header field if present.

    I think it's the right choice even if I don't see how this header can be useful in any way (because the words are totally unusable).

    There are either too many words or truncated words (in non-full-ACSII languages), indeed.


    Perhaps it would be better to encode the words rather than remove the non-ASCII characters?

    Having MIME-encoded words in this overview field could indeed be a
    solution, or a UTF-8 encoding. However, it would imply extra complexity
    in the server code to handle that encoding: find out the encoding of the
    word (using the right Content-Type in headers or multipart messages...),
    and convert it for the overview field.
    It is a bit of work, and besides I am unsure clients are currently using Keywords when present; otherwise I guess the problem of
    internationalized messages would already have popped up!


    As a side note, only having ASCII chars as is currently done in the
    keywords generation is compatible with a possible use of future
    MIME-encoded words or direct UTF-8, if we ever do that in a standardized :keywords metadata item.
    So, in order to comply with the NNTP protocol, :keywords would already
    be a better choice (instead of Keywords), and I could just leave a note
    in the INN documentation of keywords generation that it is still
    experimental code, essentially usable for messages using only ASCII
    characters as other characters are stripped by the algorithm.

    --
    Julien ÉLIE

    « J'aime les calculs faux car ils donnent des résultats plus justes. »
    (Jean Arp)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to iulius@nom-de-mon-site.com.invalid on Mon Sep 19 15:32:56 2022
    XPost: news.software.nntp

    Julien ÉLIE <iulius@nom-de-mon-site.com.invalid> writes:

    INN (and perhaps other servers) has the possibility to provide keywords
    in overview data. It advertises "Keywords:full" in response to LIST OVERVIEW.FMT and then adds "Keywords: a,b,c,d" in OVER responses. No Keywords header field is added in the articles, and the contents of an existing one is kept at the beginning of the generated one in overview.

    I'm wondering whether:
    - it shouldn't be advertised as ":keywords" instead of "Keywords:full" as
    the header field is not in the original article.

    I believe that's correct. Keywords:full would imply that it's a copy of a header in the article named Keywords.

    Astonishingly, we don't seem to have set up an IANA registry for metadata
    names in LIST OVERVIEW.FMT, which would have been the normal way of doing
    it, so I think we can just use :keywords without telling anybody.

    I am unsure though if such a change would break implementations that look
    for it in overview (but is there any such news client? ...)

    My guess is that no one uses this. It's been in INN for eons, but I think
    it was added in the early days of more open development by one person who
    was enthused about it. It tends to go untouched for long periods of time
    until someone else finds it, thinks it might solve some problems for them,
    and sends in a few fixes. My subjective impression is that most of the
    people who try it end up not continuing to use it. I've periodically
    unbroken it or done some refactoring at various points, but just because
    the code was there, not because anyone was asking for it.

    It's kind of an interesting idea, but text tokenization is a lot more complicated than that code, as you're discovering with its total lack of understanding of anything other than English. If the body is
    base64-encoded (or even quoated-printable), I suspect it will similarly collapse like a house of cards, since I doubt it understands MIME
    structure. And let's not even mention trying to tokenize languages that
    are farther afield from English.

    I'm honestly not sure it's worth the effort of trying to fix, although of course now that we've talked about it someone will probably wonder if it
    will solve their problems and experiment with it again. :)

    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)