Common data cleaning methods

Time:2021-12-30

1、 Common data problems

Including but not limited to:
Integrity of data – for example, lack of gender, native place, age, etc. in human attributes
Uniqueness of data – for example, duplication of data from different sources
Authority of data – for example, there are data from multiple sources for the same indicator, and the values are different
Legitimacy of data – for example, the data obtained is inconsistent with common sense and is older than 150 years old
Consistency of data – for example, the actual connotation of different indicators from different sources is the same, or the connotation of the same indicator is inconsistent

2、 Missing value processing

Judgment method: DF isnull()

  • Other information is supplemented, such as using ID number to calculate gender, birthplace, date of birth, age, etc.
  • For example, if there is a lack of data in the time series, the mean value before and after can be used. If there is a lot of lack, smoothing can be used. The common interpolation methods are: mean interpolation, median interpolation and so on
  • It is a pity that those that are really incomplete must be eliminated. But don’t delete it. Maybe it can be used in the future

3、 Duplicate value processing

Judgment method: judge according to the primary key
df.duplicated()
drop_ Description of duplicates parameter:
Parameter subset
Subset is used to specify specific columns. All columns are selected by default
Parameter keep
Keep can be first or last, indicating whether to select the first item or the last item to keep. The default is first
Parameter inplace
Whether inplace directly modifies or retains a copy of the original data. The default is false
De duplication method: drop_ duplicates

  • De duplication by primary key

4、 Legitimacy issues

There may be some very outrageous values in the data, such as those older than 150

  • First, judge the reliability of the data source
  • If there is unreasonable data, it shall be eliminated

5、 Authoritative questions

Different data units: normalization

Recommended Today

Redis featured Q & A

Redis data type type brief introduction characteristic scene String (string) Binary security It can contain any data, such as JPG pictures or serialized objects. One key can store up to 512M It can be used to do the simplest data. It can cache a simple string or a JSON format string. The implementation of redis […]