The HTML escape character & NPSP; indicates non breaking space

Time:2020-7-31

1. Reference

Beautiful Soup and Unicode Problems

explicate

unicodedata.normalize (‘nfkd ‘, string) actual function???

Scrapy : Select tag with non-breaking space with xpath


>>> selector.xpath(u'''
... //p[normalize-space()]
... [not(contains(normalize-space(), "\u00a0"))]

Normalize space() function???

In [244]: sel.css(‘.content’)
Out[244]: [<Selector xpath=u”descendant-or-self::*[@class and contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ content ‘)]” data=u'<pexternal nofollow” target=”_blank” href=”https://en.wikipedia.org/wiki/Comparison_of_text_editors”>https://en.wikipedia.org/wiki/Comparison_of_text_editors

The positioning element is displayed as & NPSP;

The source code of the web page is represented as & ා 160;


<tr>
<td style="background: #FFD; color: black; vertical-align: middle; text-align: center;">memory</td>
<td>= Limited by available memory   </td>
<td style="background:#F99;vertical-align:middle;text-align:center;">No (64 KB)</td>
<td>= Some limit less than available memory (give max size if known)</td>
</tr>
</table>

The actual transmission Hex is:

The Unicode representation of uninterrupted spaces isu\xa0'When saving, the UTF-8 code is\xc2\xa0

In [211]: for tr in response.xpath(‘//table[8]/tr[2]’):
…: print [u”.join(i.xpath(‘.//text()’).extract()) for i in tr.xpath(‘./*’)]
…:

[u’memory’, u’= Limited by available memory \xa0\xa0′, u’No (64\xa0KB)’, u’= Some limit less than available memory (give max size if known)’]

In [212]: u’No (64\xa0KB)’.encode(‘utf-8’)
Out[212]: ‘No (64\xc2\xa0KB)’

In [213]: u’No (64\xa0KB)’.encode(‘utf-8’).decode(‘utf-8’)
Out[213]: u’No (64\xa0KB)’

If you save the CSV and open it directly with Excel, there will be a garbled code (ANSI GBK is open by default)??? , u ‘\ xa0’ is beyond the coding range of GBK Using Notepad or Notepad + + can automatically open in UTF-8.

Use Notepad to open the CSV file, save it as ANSI code, and then open excel normally. Replace with ‘?’ if it is beyond the GBK encoding range

3. How to deal with it

.extract_first().replace(u’\xa0′, u’ ‘).strip().encode(‘utf-8′,’replace’)

The above is the HTML escape character & NPSP; means non breaking space / xa0. For more information about html escape character, please pay attention to other related articles in developeppaer!