Skip to content
Snippets Groups Projects
Commit 4dbc3b00 authored by Xavier Morel's avatar Xavier Morel
Browse files

[FIX] base: correctly parse utf8 html module descriptions


Apparently `lxml.html.document_fromstring` (and possibly other
`lxml.html` loaders) parses byte-strings as latin1 regardless of their
actual encoding, maybe because python2, maybe because there's a super
legacy html4 parser underlying it.

Either way that means ever since loading
`static/description/index.html` files was added 10 years
ago (4bf6a7ea) `_get_desc` has been
loading these files in latin1 rather than the utf8 most people would
expect.

Add an explicit decoding phase to try and load html description files
in UTF8. Fall back to latin1 in case there are description files which
are genuinely in latin1, or even just some random-ass broken stuff
which very much isn't utf8 (the extended-ascii encodings -- of which
latin1 is one -- will happily accept and mangle any input as every
byte value is valid, utf8 is a lot more structured).

Closes #127846

closes odoo/odoo#133708

Signed-off-by: default avatarXavier Morel (xmo) <xmo@odoo.com>
parent 038354cb
No related branches found
No related tags found
No related merge requests found
......@@ -178,7 +178,11 @@ class Module(models.Model):
if path:
with tools.file_open(path, 'rb') as desc_file:
doc = desc_file.read()
html = lxml.html.document_fromstring(doc)
try:
contents = doc.decode('utf-8')
except UnicodeDecodeError:
contents = doc
html = lxml.html.document_fromstring(contents)
for element, attribute, link, pos in html.iterlinks():
if element.get('src') and not '//' in element.get('src') and not 'static/' in element.get('src'):
element.set('src', "/%s/static/description/%s" % (module.name, element.get('src')))
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment