[FIX] base: correctly parse utf8 html module descriptions

Apparently `lxml.html.document_fromstring` (and possibly other `lxml.html` loaders) parses byte-strings as latin1 regardless of their actual encoding, maybe because python2, maybe because there's a super legacy html4 parser underlying it. Either way that means ever since loading `static/description/index.html` files was added 10 years ago (4bf6a7ea) `_get_desc` has been loading these files in latin1 rather than the utf8 most people would expect. Add an explicit decoding phase to try and load html description files in UTF8. Fall back to latin1 in case there are description files which are genuinely in latin1, or even just some random-ass broken stuff which very much isn't utf8 (the extended-ascii encodings -- of which latin1 is one -- will happily accept and mangle any input as every byte value is valid, utf8 is a lot more structured). Closes #127846 closes odoo/odoo#133708 Signed-off-by: Xavier Morel (xmo) <xmo@odoo.com>

[FIX] base: correctly parse utf8 html module descriptions
4dbc3b00 · Xavier Morel · 038354cb · 4dbc3b00
Commit 4dbc3b00 authored 1 year ago by Xavier Morel
--- a/odoo/addons/base/models/ir_module.py
+++ b/odoo/addons/base/models/ir_module.py
@@ -178,7 +178,11 @@ class Module(models.Model):
            if path:
                with tools.file_open(path, 'rb') as desc_file:
                    doc = desc_file.read()
-                    html = lxml.html.document_fromstring(doc)
+                    try:
+                        contents = doc.decode('utf-8')
+                    except UnicodeDecodeError:
+                        contents = doc
+                    html = lxml.html.document_fromstring(contents)
                    for element, attribute, link, pos in html.iterlinks():
                        if element.get('src') and not '//' in element.get('src') and not 'static/' in element.get('src'):
                            element.set('src', "/%s/static/description/%s" % (module.name, element.get('src')))