Pandas Summary Basis for Introduction to Python Data Analysis (I)


I. Series

Series: Pandas’long gun (a column or row in a data table, observation vectors, one-dimensional arrays…)

Series1 = pd.Series(np.random.randn(4))

print Series1,type(Series1) 

print Series1.index

print Series1.values

Output results:

0   -0.676256

1    0.533014

2   -0.935212

3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[-0.67625578  0.53301431 -0.93521212 -0.94082195]
  • Np. random. randn () normal distribution is correlated. Function description

Series holds the principle of filtration just like NumPy

print Series1>0 

print Series1[Series1>0]

The output results are as follows:

0 0.030480

1 0.072746

2 -0.186607

3 -1.412244

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[ 0.03048042 0.07274621 -0.18660749 -1.41224432]

I found that the value of a logical expression is True or False. To get the value first, or in the form of X [y].

Broadcasting is also supported, of course.

What is?broadcastingFor the time being, I’m not sure. Look at a chestnut.

print Series1*2 

print Series1+5

The output results are as follows:

0 0.06096

1 1 0.145492 

2 -0.373215 

3 -2.824489 

dtype: float64 

0 5.030480 

1 5.072746 

2 4.813393 

3 3.587756 

dtype: float64

And Universal Function

Numpy. frompyfunc (out, nin, nout) returns a function, Nin is the number of input parameters, and nout is the number of objects returned by the function.

Instead of creating a two-column data table, we can easily identify which data is and which metadata is using row labels in the sequence.

What I mean by this sentence is that the sequence should be as one column as possible, instead of creating two columns, so that the data can be specified with index.`

Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

print Series2,type(Series2)

print Series2.index

print type(Series2.index)

print Series2.values

The output is as follows. As you can see, it has been modified.indexThe style of the value does not create two columns.

norm_0   -0.676256

norm_1    0.533014

norm_2   -0.935212

norm_3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')

<class 'pandas.core.index.Index'>

[-0.67625578  0.53301431 -0.93521212 -0.94082195]

Although rows are ordered, data can still be accessed through row-level index:

(Not exactly like Ordered Dict, of course, because indexes are very repeatable, and not recommending duplicate row indexes does not mean that they cannot be used.)

print Series2[['norm_0','norm_3']]

As you can see, when reading data, you really need to use the X [y] format. Here X[[y]] is because it reads two data, which are specified.indexValue will beindexStore value inlistIn, and then read. The output results are as follows:

norm_0   -0.676256

norm_3   -0.940822

dtype: float64

Another example:

print 'norm_0' in Series2

print 'norm_6' in Series2

Output results:



Output of logical expressions, Boolean values.

Defining Series from an Ordered Dict that Key does not repeat or from a Dict does not need to worry about row index duplication:

Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

Series3_pdSeries = pd.Series(Series3_Dict)

print Series3_pdSeries

print Series3_pdSeries.values

print Series3_pdSeries.index

Output results:

China Beijing

Japan Tokyo

S.Korea Seoul

dtype: object

['Beijing' 'Tokyo' 'Seoul']

Index([u'China', u'Japan', u'S.Korea'], dtype='object')

As you can see from the above output results, the output results are out of order, independent of the input order.

Want the sequence to be saved in your order _____________? There’s no problem with missing values.

Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)

print Series4_pdSeries

print Series4_pdSeries.values

print Series4_pdSeries.index

print Series4_pdSeries.isnull()

print Series4_pdSeries.notnull()

The above output will followlistOutput results in the order defined in.

Metadata information at the whole sequence level: name

When the data sequence and index itself have a name, it will be more convenient for subsequent data association!

Here I feel that is the role of listing. The following examples are given:



Obviously, the output is allNoneBecause we haven’t specified yet.nameWell! = "Capital Series" = "Nation"

print Series4_pdSeries

Output results:


Japan Tokyo

China Beijing

Singapore NaN

S.Korea Seoul

Name: Capital Series, dtype: object

“Dictionary”? No, index can be repeated, although not recommended.

Series5_IndexList = ['A','B','B','C']

Series5 = pd.Series(Series1.values,index = Series5_IndexList)

print Series5

print Series5[['B','A']]

Output results:

A 0.030480

B 0.072746

B -0.186607

C -1.412244

dtype: float64

B 0.072746

B -0.186607

A 0.030480

dtype: float64

We can see that Series [‘B’] outputs two values, so try not to repeat the index value.

II. DataFrame

DataFrame: Pandas Hammer (Data Table, Dimension Array)

The ordered set of Series is as convenient as R’s DataFrame.

Think about it carefully, most data formats can be represented as DataFrames.

Definition from NumPy two-dimensional arrays, files, or databases: data is good, don’t forget column names

dataNumPy = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])


Here in the DataFramecolumnsIt should mean listing. Now lookprintAs a result, is it very comfortable? Excel style

Isometric column data is stored in a dictionary (JSON): Unfortunately, dictionary keys are out of order

dataDict = {'nation':['Japan','S.Korea','China'],'capital':['Tokyo','Seoul','Beijing'],'GDP':[4900,1300,9100]}

DF2 = pd.DataFrame(dataDict)


The output results can be found.Out of order!

GDP    capital    nation

0 4900 Tokyo Japan

1 1300 Seoul S.Korea

2 9100 Beijing China

PS: Because of the laziness of screenshots, there is no border here.

Define a DataFrame from another DataFrame: Ah, obsessive-compulsive disorder!

DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])


Obviously, this is the use ofDF2Define DF21, also by specifyingcloumnsChange the order of column names.

DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])


Obviously, it’s defined here.columnsThe order is also defined.indexThe order.

nation capital GDP

2 China Beijing 9100

0 Japan Tokyo 4900

1 S.Korea Seoul 1300

Remove columns from the DataFrame? Two approaches (exactly the same as JavaScript!)

OMG, _, I almost forgot the JS grammar. Now I remember, but the attributes of the object can beobj.xIt’s fine tooobj[x]

  • ‘. ‘is easily written in conflict with other reserved keywords

  • ‘[]’is the safest way to write.

Get out of the DataFrame? There are at least two kinds of laws:

  • Methods 1 and 2:

Print DF22 [0:1] # gives the actual DataFrame

Print DF22.ix [0] # gives,** IX ** is cool by corresponding index.

Output results:

 nation  capital   GDP

2  China  Beijing  9100

nation     Japan

capital    Tokyo

GDP         4900

Name: 0, dtype: object
  • Method 3The ultimate approach like NumPy slices: iloc

Print DF22.iloc [0,:] The first parameter is the row, and the second parameter is the column. Here's line 0, all columns.

Print DF22. iloc [:, 0]# According to the description above, here are all rows, column 0

Output the results and verify:

nation       China

capital    Beijing

GDP           9100

Name: 2, dtype: object

2      China

0      Japan

1    S.Korea

Name: nation, dtype: object

Adding columns dynamically, but not in the way of “.”, only in the way of “[]”

Give a chestnut to illustrate it.

DF22['population'] = [1600,130,55]


Output results:

nation    capital    GDP    population

2    China    Beijing    9100    1600

0    Japan    Tokyo    4900    130

1    S.Korea    Seoul    1300    55

Index: row-level index

Index: Pandas for data manipulation ghost card (row-level index)

Level 1 index is:

  • metadata

  • It may be generated from real data, so it can be regarded as data.

  • It can be composed of multiple indexes, that is, multiple columns.

  • Excel PivotTable can be exchanged with column names or stacked and unfolded to achieve Excel PivotTable effect.

Index has four kinds… oh no, many ways of writing. Some important index types include:

  • Pd. Index (Ordinary)

  • Int64 Index

  • MultiIndex

  • Datetime Index (indexed in time format)

  • PeriodIndex (Time Format with Period as Index)

Define ordinary indexes directly, and grow just like ordinary Seres.

index_names = ['a','b','c']

Series_for_Index = pd.Series(index_names)

print pd.Index(index_names)

print pd.Index(Series_for_Index)

Output results:

Index([u'a', u'b', u'c'], dtype='object')

Index([u'a', u'b', u'c'], dtype='object')

Unfortunately Immutable, remember!Immutable! Examples are as follows: digging holes here. I don’t understand…

index_names = ['a','b','c'] 

index0 = pd.Index(index_names) 

print index0.get_values() 

index0[2] = 'd'

The output results are as follows:

['a' 'b' 'c']


TypeError                                 Traceback (most recent call last)

<ipython-input-36-f34da0a8623c> in <module>()

      2 index0 = pd.Index(index_names)

      3 print index0.get_values()

----> 4 index0[2] = 'd'

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in __setitem__(self, key, value)


   1056     def __setitem__(self, key, value):

-> 1057         raise TypeError("Indexes does not support mutable operations")


   1059     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations

Throw in a List with multiple groups, and you have MultiIndex

Unfortunately, if this List Comprehension is changed to parentheses, it’s not right.

multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)]) = ['index1','index2']

print multi1

Output results:

MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

For Series, if you have multiple indexes, data, deformations!

The following code description:

  • Series with double MultiIndex can unstack () into a DataFrame

  • DataFrame can stack into Seres with MultiIndex

data_for_multi1 = pd.Series(xrange(0,16),index=multi1)


Output results:

Row_1  Col_1     0

       Col_2     1

       Col_3     2

       Col_4     3

Row_2  Col_1     4

       Col_2     5

       Col_3     6

       Col_4     7

Row_3  Col_1     8

       Col_2     9

       Col_3    10

       Col_4    11

Row_4  Col_1    12

       Col_2    13

       Col_3    14

       Col_4    15

dtype: int32

Seeing the output, it seems to understand a little bit, a bit like Excel summary. However, in the future, we have to check the information.

Series with double MultiIndex can unstack () into a DataFrame


Pandas Summary Basis for Introduction to Python Data Analysis (I)

DataFrame can stack into Seres with MultiIndex


Output results:

Row_1  Col_1     0

       Col_2     1

       Col_3     2

       Col_4     3

Row_2  Col_1     4

       Col_2     5

       Col_3     6

       Col_4     7

Row_3  Col_1     8

       Col_2     9

       Col_3    10

       Col_4    11

Row_4  Col_1    12

       Col_2    13

       Col_3    14

       Col_4    15

dtype: int32

Examples of unbalanced data:

multi2 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(5) for y in xrange(x)])


Output results:

MultiIndex(levels=[[u'Row_2', u'Row_3', u'Row_4', u'Row_5'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

           labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])

data_for_multi2 = pd.Series(np.arange(10),index = multi2) data_for_multi2

Output results:

Row_2  Col_1    0

Row_3  Col_1    1

       Col_2    2

Row_4  Col_1    3

       Col_2    4

       Col_3    5

Row_5  Col_1    6

       Col_2    7

       Col_3    8

       Col_4    9

dtype: int32

The DateTime standard library is so good that you deserve it.

import datetime

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]


Output results:

DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)

If you need not only a uniform time format, but also a uniform time and frequency.

periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

print periodindex1

Output results:

PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')

How to convert the monthly and grade accuracy?

Some companies use No. 1 to represent the month, while others use the last day to represent the month. It’s very troublesome to transform.asfreq

print periodindex1.asfreq('D',how='start')

print periodindex1.asfreq('D',how='end')

Output results:

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')

Finally, I want to really match the time accuracy of the two frequencies?

periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')

print periodindex_mon

print periodindex_day

Output results:

PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',

             '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',

             '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',

             '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',

             '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',

             '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',

             '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',

             '2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',

             '2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',

             '2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',

             '2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',

             '2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',

             '2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',

             '2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',

             '2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',

             '2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',

             '2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',

             '2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',

             '2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',

             '2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',

             '2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',

             '2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',

             '2015-03-30', '2015-03-31'],

            dtype='int64', freq='D')

Coarse-grained data+reindexffill/bfill

full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')


Pandas Summary Basis for Introduction to Python Data Analysis (I)

With regard to index, what are the convenient operations?

As described in the previous article, the index is orderly and repetitive, but to some extent it can be accessed by key, that is to say, some set operations can be maintained.

index1 = pd.Index(['A','B','B','C','C'])

index2 = pd.Index(['C','D','E','E','F'])

index3 = pd.Index(['B','C','A'])

print index1.append(index2)

print index1.difference(index2)

print index1.intersection(index2)

print index1.union(index2) # Support unique-value Index well

print index1.isin(index2)

print index1.delete(2)

print index1.insert(0,'K') # Not suggested

print index3.drop('A') # Support unique-value Index well

print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic

print index1.is_unique,index2.is_unique,index3.is_unique

Output results:

Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

Index([u'A', u'B'], dtype='object')

Index([u'C', u'C'], dtype='object')

Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

[False False False  True  True]

Index([u'A', u'B', u'C', u'C'], dtype='object')

Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')

Index([u'B', u'C'], dtype='object')

True True False

False False True

Reference resources:

  • S1EP3_Pandas.pdfI don’t know when to save the information in the computer. I found it today. Thank you for your information.

  • Pandas Summary Basis for Introduction to Python Data Analysis (2)

Welcome to Michael Xiang’s blog to view the completed version.