Pandas tutorial

Time:2021-8-3

By gon ç alo guimar ã es Gomes
Compile VK
Source: towards Data Science

As one of the most popular and used tools familiar and used by every data scientist, pandas library is excellent in data operation, analysis and visualization

To help you accomplish this task and be more confident in Python coding, I created this tutorial with some of the most commonly used functions and methods on pandas. I really hope it works for you.

catalogue

  1. Import library

  2. Import / export data

  3. Display data

  4. Basic information: quick view data

  5. Basic statistics

  6. Adjust data

  7. Boolean index: loc

  8. Boolean index: iloc

  9. Basic processing data

We will study the data set of Titanic for two main reasons: (1) it is likely that you are already familiar with it( 2) It’s very small and simple

The Titanic dataset can be downloaded here:https://bit.ly/33tOJ2S

Import library

For our purposes, the “Pandas” library must be imported

import pandas as pd

Import / export data

The Titanic dataset is specified as data.

a) Using read_ CSV import the CSV file. You should add a data separator to the file.

data = pd.read_csv("file_name.csv", sep=';')

b) Using read_ Excel reads data from Excel files.

data = pd.read_excel('file_name.xls')

c) Export the data frame to a CSV file using to_ csv

data.to_csv("file_name.csv", sep=';', index=False)

d) Use “to”_ Excel “exports the data frame to an excel file.

data.to_excel("file_name.xls´)

Display data

a) Printing the first n lines. If it is not given, 5 lines are displayed by default.

data.head()

b) Print the last “n” line. Next, the last 7 lines are displayed.

data.tail(7)

Basic information: quick view data

a) Displays the dimensions of the dataset: total rows and columns.

data.shape

(891,12)

b) Displays the variable type.

data.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

c) Displays the variable types in ascending order.

data.dtypes.sort_values(ascending=True)
PassengerId      int64
Survived         int64
Pclass           int64
SibSp            int64
Parch            int64
Age            float64
Fare           float64
Name            object
Sex             object
Ticket          object
Cabin           object
Embarked        object
dtype: object

d) Count variables by type.

data.dtypes.value_counts()
object     5
int64      5
float64    2
dtype: int64

e) Count each type in ascending order.

data.dtypes.value_counts(ascending=True)
float64    2
int64      5
object     5
dtype: int64

f) Check the number of survivors and non survivors in absolute values.

data.Survived.value_counts()
0    549
1    342
Name: Survived, dtype: int64

g) The scale of the inspection feature, expressed as a percentage.

data.Survived.value_counts() / data.Survived.value_counts().sum()

Same as the following:

data.Survived.value_counts(normalize=True)
0    0.616162
1    0.383838
Name: Survived, dtype: float64

h) Check the scale of the feature, expressed as a percentage and rounded.

data.Survived.value_counts(normalize=True).round(decimals=4) * 100
0    61.62
1    38.38
Name: Survived, dtype: float64

i) Evaluate the dataset for missing values.

data.isnull().values.any()
True

j) Use isnull() to get the number of missing values.

data.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

k) Use notnull() to get the number of existing values.

data.notnull().sum()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

l) Percentage (%) of missing values by variable.

data.isnull().sum() / data.isnull().shape[0] * 100

Equivalent to

data.isnull().mean() * 100
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

m) Round off (2 in the example).

(data.isnull().sum() / data.isnull().shape[0] * 100).round(decimals=2)

Equivalent to

(data.isnull().mean() * 100).round(decimals=2)
PassengerId     0.00
Survived        0.00
Pclass          0.00
Name            0.00
Sex             0.00
Age            19.87
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.00
Cabin          77.10
Embarked        0.22
dtype: float64

n) In addition: print the results with combined text.

print("The percentage of 'Age' is missing values:",(data.Age.isnull().sum() / data.Age.isnull().shape[0] * 100).round(decimals=2), "%")
The percentage of 'Age' is missing values: 19.87 %
print(f"The feature 'Age' has {data.Age.isnull().sum()} missing values")
The feature 'Age' has 177 missing values
print("'Age' has {} and 'Cabin' has {} missing values".format(data.Age.isnull().sum(), data.Cabin.isnull().sum()))
'Age' has 177 and 'Cabin' has 687 missing values

o) Information about shapes, variable types, and missing values.

data.info()

p) Overview of specific characteristics (in the following example, “gender” and “age”).

data[['Sex','Age']].info()

Basic statistics

a) The describe method only gives the basic statistical information of the data. By default, it only calculates the master statistics of numerical data. The results are represented by pandas data frames.

data.describe()

b) Add other non-standard values, such as variance.

describe = data.describe()

describe.append(pd.Series(data.var(), name='variance'))

c) Display classification data.

data.describe(include=["O"])

Equivalent to

data.describe(exclude=['float64','int64'])

Equivalent to

data.describe(include=[np.object])

d) By passing the parameter include =’All ‘, both numeric and non numeric data will be displayed.

data.describe(include='all')

e) Don’t forget to transpose the data frame by adding. T at the end. This is also a very useful technique

data.describe(include='all').T

f) Percentile 1%, 5%, 95%, 99%. As expected, it will only calculate the statistics of digital features.

data.quantile(q=[.01, .05, .95, .99])

g) Summary statistics

  • Displays unique values for some features.
data.Embarked.unique()
array(['S', 'C', 'Q', nan], dtype=object)
  • Calculates the sum of unique values for a feature.
data.Sex.nunique()
2
  • Calculate total value
data.count()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64
  • Maximum value of some characteristics
data.Age.max()
80.0
  • Minimum value of some characteristics
data.Age.min()
0.42
  • Average value of some characteristics
data.Age.mean()
29.69911764705882
  • Median value of some features
data.Age.median()
28.0
  • The 99th quantile of some characteristics
data.Age.quantile(q=[.99])
0.99    65.87
Name: Age, dtype: float64
  • Standard deviation of some features
data.Age.std()
14.526497332334044
  • Variance of some characteristics
data.Age.var()
211.0191247463081

h) Extra

Question 1 – show the two most common values of the classification feature “embanked”.

data[‘Embarked’].value_counts().head(2)
S    644
C    168
Name: Embarked, dtype: int64

Question 2 – what is the highest percentage of “embanked”?

top_unique = data['Embarked'].value_counts(normalize=True)[0]

print(f'{top_unique:.2%}')
72.44%

i) Correlation between variables. As expected, it will only display statistics for numerical data.

Data. Corr() Pearson correlation by default

J) The correlation between the selected variable (in the example, “survived”) and other variables.

correlation = data.corr()

correlation.Survived.sort_ values().sort_ Values (ascending = false) # ordered values
Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Pclass        -0.338481
Name: Survived, dtype: float64

Adjust data

a) Lists the name of the column.

data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

b) Rename some columns (in the following example, change “passengerid” to “Id”).

data.rename(columns = {data.columns[0]:'id'}, inplace=True)

Equivalent to

data.rename(columns = {'PassengerId':'id'}, inplace=True)
Index(['id', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

c) Rename multiple columns (passengerid, pclass, and sibsp).

data.rename(columns = {'PassengerId':'Passenger_Id', 'Pclass':'P_Class', 'SibSp':'Sib_Sp'}, inplace=True)
Index(['Passenger_Id', 'Survived', 'P_Class', 'Name', 'Sex', 'Age', 'Sib_Sp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

d) Replace underscores with dots by list generation (only applicable to “passenger. ID”, “p.class” and “SIB. SP”).

data.columns = [x.lower().replace('.', '') for x in data.columns]
Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked'], dtype='object')

e) Lowercase characters and delete points (applied to ‘passenger. ID’, ‘p.class’ and’ SIB. SP ‘).

data.columns = [x.lower().replace('.', '') for x in data.columns]
Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked'], dtype='object')

f) Capitalize column names.

data.columns = [x.capitalize() for x in data.columns]
Index(['Passengerid', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Sibsp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

Boolean index: loc

data.loc[, ], select by column name

a) Select the row.

data.loc[[3]]

b) Select the row array.

data.loc[6:8]

c) Select a few lines.

data.loc[[7,28,39]]

d) Select a row from the name, age, sex, and survived columns.

data.loc[[7], ['Name', 'Age', 'Sex', 'Survived']]

e) Select multiple rows from multiple columns.

data.loc[[7,28,39], ['Name', 'Age', 'Sex','Survived']]

f) Under certain conditions, use LOC to select a specific value. In this case, passengers older than or equal to 10 years old are selected from line 4 to line 10.

data.loc[4:10, ['Age']] >= 10

g) Under certain conditions, use LOC to select a specific value. In this case, select passengers in c123 cabin from the first 5 rows.

data.loc[:4, ['Cabin']] == 'C123'

Boolean index: iloc

data.iloc[, ]Select rows and columns by number

a) Select row 4 of the dataset.

data.iloc[3]

b) Select a row array from all columns.

data.iloc[6:12]

c) Select several rows from all columns.

data.iloc[[7,28,39],:]

d) Select a row from the name, age, sex, and survived columns.

data.iloc[[7], [3,5,4,1]]

e) Select multiple rows from multiple columns.

data.iloc[[7,28,39], [3,5,4,1]]

f) Select multiple rows to form a column sequence.

data.iloc[[7,28,39], 3:10]

g) Select a different value.

  • From line 6 to line 12, the last column.
data.iloc[6:13, -1]
  • All rows in columns 3 and 6.
data.iloc[:, [3,6]]
  • 7. Lines 28 and 39, from column 3 to column 6.
data.iloc[[7,28,39], 3:7]
  • The last 20 rows of the last column.
data.iloc[-20:, -1:]

Basic processing data

  • Axis = 0 indicates the row. If it is not specified, the default is axis = 0.

  • Axis = 1 indicates the column.

a) (delete Nan value).

data.isnull().values.any()Is there any missing data?

True

If it is not assigned to a (New) variable, you should specify inplace = true for the change to take effect.

Data.dropna (axis = 0, inplace = true) # deletes Nan from the row

Data. Isnull(). Values. Any() # is there any missing data?
False

b) Delete column

data.drop(columns=['PassengerId', 'Name'], axis=1).head()

c) value_ Counts can also display Nan values.

data.Age.value_counts(dropna=False)
NaN      177
24.00     30
22.00     27
18.00     26
28.00     25
        ... 
36.50      1
55.50      1
66.00      1
23.50      1
0.42       1
Name: Age, Length: 89, dtype: int64

d) Replace missing values

  • Create a new data frame and copy the data to maintain the integrity of the original data.
new_df = data.copy()

Calculate the average age:

new_df.Age.mean()
29.69911764705882
  • Populate the Nan with the average of the data and assign the result to a new column.
new_df['Age_mean'] = new_df.Age.fillna(new_df.Age.mean())

Median age

new_df.Age.median()
28.0

Fill any Nan with the median of the data and assign the result to a new column.

new_df['Age_median'] = new_df.Age.fillna(new_df.Age.median())
  • Verification results
new_df[['Age', 'Age_mean', 'Age_median']].isnull().sum()
Age           177
Age_mean        0
Age_median      0
dtype: int64

Lines 106 to 110 are displayed to verify the interpolation of the two Nan examples (lines 107 and 109).

new_df[['Age', 'Age_mean', 'Age_median']][106:110]

It’s over

conclusion

I sincerely hope you find this tutorial useful because it can help you write code development. I will update it in the future and link it to other Python tutorials.

Original link:https://towardsdatascience.com/pandas-made-easy-the-guide-i-81834f075893

Welcome to panchuang AI blog:
http://panchuang.net/

Official Chinese document of sklearn machine learning:
http://sklearn123.com/

Welcome to panchuang blog resources summary station:
http://docs.panchuang.net/