Forum: >>> Magnum BBS <<<

Re: UTF_16 question

From Stefan Ram@21:1/5 to jak on Sat Apr 27 19:13:15 2024

jak <nospam@please.ty> wrote or quoted:

I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted.

I think the order of the octets (bytes) is exactly the difference
between these two encodings, so your observation isn't really
surprising. The computer can't report an error here since it
can't infer the correct encoding from the file data. It's like
that koan: "A bit has the value 1. What does that mean?".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat Apr 27 20:45:35 2024

Hi everyone,
one thing that I do not understand is happening to me: I have some text
files with different characteristics, among these there are that they
have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
without BOM. With those utf_32_xx I have no problem but with the
UTF_16_xx I have. If I have an utf_16_le coded file and I read it with encoding='utf_16_le' I have no problem I read it, with
encoding='utf_16_be' I can read it without any error even if the data I
receive have the inverted bytes. The same thing happens with the
utf_16_be codified file, I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted. What did I not understand? What am I doing wrong?

thanks in advance

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sun Apr 28 02:50:08 2024

Stefan Ram ha scritto:

jak <nospam@please.ty> wrote or quoted:

I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted.

I think the order of the octets (bytes) is exactly the difference
between these two encodings, so your observation isn't really
surprising. The computer can't report an error here since it
can't infer the correct encoding from the file data. It's like
that koan: "A bit has the value 1. What does that mean?".

Understood. They are just 2 bytes and there is no difference between
them.

Thank you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to All on Mon Apr 29 12:41:48 2024

On Apr 29, 2024, at 12:23 PM, jak via Python-list <python-list@python.org> wrote:

Hi everyone,
one thing that I do not understand is happening to me: I have some text
files with different characteristics, among these there are that they
have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
without BOM. With those utf_32_xx I have no problem but with the
UTF_16_xx I have. If I have an utf_16_le coded file and I read it with encoding='utf_16_le' I have no problem I read it, with
encoding='utf_16_be' I can read it without any error even if the data I receive have the inverted bytes. The same thing happens with the
utf_16_be codified file, I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted. What did I not understand? What am I doing wrong?

thanks in advance

--
https://mail.python.org/mailman/listinfo/python-list

That is why the BOM was created. A lot of files can be “correctly” read as either UTF-16-LE or UTF-1-BE encoded, as most of the 16 bit codes are valid, so unless the wrong encoding happens to hit something that is invalid (most likely something
looking like a Surrogage Pair without a match), there isn’t an error in reading the file. The BOM character was specifically designed to be an invalid code if read by the wrong encoding (if you ignore the possibility of the file having a NUL right
after the BOM)

If you know the files likely contains a lot of “ASCII” characters, then you might be able to detect that you got it wrong, due to seeing a lot of 0xXX00 characters and few 0x00XX characters, but that doesn’t create an “error” normally.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Wed May 1 19:07:02 2024

Richard Damon ha scritto:

On Apr 29, 2024, at 12:23 PM, jak via Python-list <python-list@python.org> wrote:

Hi everyone,
one thing that I do not understand is happening to me: I have some text
files with different characteristics, among these there are that they
have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
without BOM. With those utf_32_xx I have no problem but with the
UTF_16_xx I have. If I have an utf_16_le coded file and I read it with
encoding='utf_16_le' I have no problem I read it, with
encoding='utf_16_be' I can read it without any error even if the data I
receive have the inverted bytes. The same thing happens with the
utf_16_be codified file, I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted. What did I not understand? What am I doing wrong?

thanks in advance

--
https://mail.python.org/mailman/listinfo/python-list

That is why the BOM was created. A lot of files can be “correctly” read as either UTF-16-LE or UTF-1-BE encoded, as most of the 16 bit codes are valid, so unless the wrong encoding happens to hit something that is invalid (most likely something

looking like a Surrogage Pair without a match), there isn’t an error in reading the file. The BOM character was specifically designed to be an invalid code if read by the wrong encoding (if you ignore the possibility of the file having a NUL right
after the BOM)

If you know the files likely contains a lot of “ASCII” characters, then you might be able to detect that you got it wrong, due to seeing a lot of 0xXX00 characters and few 0x00XX characters, but that doesn’t create an “error” normally.

Thanks to you too for the reply. I was actually looking for a way to distinguish "utf16le" texts from "utf16be" ones. Unfortunately, whoever
created this log file archive thought that the BOM was not important and
so omitted it. Now they want to switch to "utf8 " and also save the
previous. Fortunately I can be sure that the text of the log files
is in some European language, so after converting the file to "utf8" I
make sure that most of the bytes are less than the value 0x7F and if not
I reconvert them by replacing "utf16 " "le" with "be" or vice versa. The strategy seems to be working. In the future, by writing files in "utf8"
they will no longer have problems like this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Lonewolf
  Thu Apr 10 02:12:57 2025
  from Little Flock, Ar via Telnet
- Bob Worm
  Wed Apr 9 21:38:08 2025
  from Wales, Uk via Telnet
- Guest
  Wed Apr 9 19:06:14 2025
  from A via Raw
- Keyop
  Wed Apr 9 14:31:47 2025
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Wed Apr 9 14:31:22 2025
  from Huddersfield, West Yorkshire via SSH
- Alf
  Wed Apr 9 12:51:34 2025
  from Ita via Telnet
- Alf
  Wed Apr 9 12:32:08 2025
  from Ita via Telnet
- Bob Worm
  Wed Apr 9 08:55:40 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	446
Nodes:	16 (2 / 14)
Uptime:	19:29:21
Calls:	9,234
Calls today:	1
Files:	13,494
Messages:	6,063,226

Re: UTF_16 question

Who's Online

Recent Visitors

System Info