Time：2019-9-17

# I. Series

Series: Pandas’long gun (a column or row in a data table, observation vectors, one-dimensional arrays…)

``````
Series1 = pd.Series(np.random.randn(4))

print Series1,type(Series1)

print Series1.index

print Series1.values
``````

Output results:

``````
0   -0.676256

1    0.533014

2   -0.935212

3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[-0.67625578  0.53301431 -0.93521212 -0.94082195]
``````
• Np. random. randn () normal distribution is correlated. Function description

## Series holds the principle of filtration just like NumPy

``````
print Series1>0

print Series1[Series1>0]
``````

The output results are as follows:

``````
0 0.030480

1 0.072746

2 -0.186607

3 -1.412244

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[ 0.03048042 0.07274621 -0.18660749 -1.41224432]
``````

I found that the value of a logical expression is True or False. To get the value first, or in the form of X [y].

## Broadcasting is also supported, of course.

What is?`broadcasting`For the time being, I’m not sure. Look at a chestnut.

``````
print Series1*2

print Series1+5
``````

The output results are as follows:

``````0 0.06096

1 1 0.145492

2 -0.373215

3 -2.824489

dtype: float64

0 5.030480

1 5.072746

2 4.813393

3 3.587756

dtype: float64
``````

## And Universal Function

Numpy. frompyfunc (out, nin, nout) returns a function, Nin is the number of input parameters, and nout is the number of objects returned by the function.

## Instead of creating a two-column data table, we can easily identify which data is and which metadata is using row labels in the sequence.

What I mean by this sentence is that the sequence should be as one column as possible, instead of creating two columns, so that the data can be specified with index.```

``````
Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

print Series2,type(Series2)

print Series2.index

print type(Series2.index)

print Series2.values
``````

The output is as follows. As you can see, it has been modified.`index`The style of the value does not create two columns.

``````
norm_0   -0.676256

norm_1    0.533014

norm_2   -0.935212

norm_3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')

<class 'pandas.core.index.Index'>

[-0.67625578  0.53301431 -0.93521212 -0.94082195]
``````

Although rows are ordered, data can still be accessed through row-level index:

(Not exactly like Ordered Dict, of course, because indexes are very repeatable, and not recommending duplicate row indexes does not mean that they cannot be used.)

``````
print Series2[['norm_0','norm_3']]
``````

As you can see, when reading data, you really need to use the X [y] format. Here X[[y]] is because it reads two data, which are specified.`index`Value will be`index`Store value in`list`In, and then read. The output results are as follows:

``````
norm_0   -0.676256

norm_3   -0.940822

dtype: float64
``````

Another example:

``````
print 'norm_0' in Series2

print 'norm_6' in Series2
``````

Output results:

``````
True

False
``````

Output of logical expressions, Boolean values.

## Defining Series from an Ordered Dict that Key does not repeat or from a Dict does not need to worry about row index duplication:

``````
Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

Series3_pdSeries = pd.Series(Series3_Dict)

print Series3_pdSeries

print Series3_pdSeries.values

print Series3_pdSeries.index
``````

Output results:

``````
China Beijing

Japan Tokyo

S.Korea Seoul

dtype: object

['Beijing' 'Tokyo' 'Seoul']

Index([u'China', u'Japan', u'S.Korea'], dtype='object')
``````

As you can see from the above output results, the output results are out of order, independent of the input order.

Want the sequence to be saved in your order _____________? There’s no problem with missing values.

``````
Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)

print Series4_pdSeries

print Series4_pdSeries.values

print Series4_pdSeries.index

print Series4_pdSeries.isnull()

print Series4_pdSeries.notnull()
``````

The above output will follow`list`Output results in the order defined in.

Metadata information at the whole sequence level: name

When the data sequence and index itself have a name, it will be more convenient for subsequent data association!

Here I feel that is the role of listing. The following examples are given:

``````
print Series4_pdSeries.name

print Series4_pdSeries.index.name
``````

Obviously, the output is all`None`Because we haven’t specified yet.`name`Well!

``````
Series4_pdSeries.name = "Capital Series"

Series4_pdSeries.index.name = "Nation"

print Series4_pdSeries
``````

Output results:

``````
Nation

Japan Tokyo

China Beijing

Singapore NaN

S.Korea Seoul

Name: Capital Series, dtype: object
``````

“Dictionary”? No, index can be repeated, although not recommended.

``````
Series5_IndexList = ['A','B','B','C']

Series5 = pd.Series(Series1.values,index = Series5_IndexList)

print Series5

print Series5[['B','A']]
``````

Output results:

``````
A 0.030480

B 0.072746

B -0.186607

C -1.412244

dtype: float64

B 0.072746

B -0.186607

A 0.030480

dtype: float64
``````

We can see that Series [‘B’] outputs two values, so try not to repeat the index value.

# II. DataFrame

DataFrame: Pandas Hammer (Data Table, Dimension Array)

The ordered set of Series is as convenient as R’s DataFrame.

Think about it carefully, most data formats can be represented as DataFrames.

## Definition from NumPy two-dimensional arrays, files, or databases: data is good, don’t forget column names

``````
dataNumPy = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])

DF1
``````

Here in the DataFrame`columns`It should mean listing. Now look`print`As a result, is it very comfortable? Excel style

## Isometric column data is stored in a dictionary (JSON): Unfortunately, dictionary keys are out of order

``````

DF2
``````

The output results can be found.Out of order!

``````GDP    capital    nation
``````

0 4900 Tokyo Japan

1 1300 Seoul S.Korea

2 9100 Beijing China

PS: Because of the laziness of screenshots, there is no border here.

## Define a DataFrame from another DataFrame: Ah, obsessive-compulsive disorder!

``````
DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])

DF21
``````

Obviously, this is the use of`DF2`Define DF21, also by specifying`cloumns`Change the order of column names.

``````
DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])

DF22
``````

Obviously, it’s defined here.`columns`The order is also defined.`index`The order.

``````
nation capital GDP

2 China Beijing 9100

0 Japan Tokyo 4900

1 S.Korea Seoul 1300
``````

## Remove columns from the DataFrame? Two approaches (exactly the same as JavaScript!)

OMG, _, I almost forgot the JS grammar. Now I remember, but the attributes of the object can be`obj.x`It’s fine too`obj[x]`

• ‘. ‘is easily written in conflict with other reserved keywords

• ‘[]’is the safest way to write.

## Get out of the DataFrame? There are at least two kinds of laws:

• Methods 1 and 2:

``````Print DF22 [0:1] # gives the actual DataFrame

Print DF22.ix [0] # gives,** IX ** is cool by corresponding index.``````

Output results:

``````
nation  capital   GDP

2  China  Beijing  9100

nation     Japan

capital    Tokyo

GDP         4900

Name: 0, dtype: object
``````
• Method 3The ultimate approach like NumPy slices: iloc

``````Print DF22.iloc [0,:] The first parameter is the row, and the second parameter is the column. Here's line 0, all columns.

Print DF22. iloc [:, 0]# According to the description above, here are all rows, column 0``````

Output the results and verify:

``````
nation       China

capital    Beijing

GDP           9100

Name: 2, dtype: object

2      China

0      Japan

1    S.Korea

Name: nation, dtype: object
``````

## Adding columns dynamically, but not in the way of “.”, only in the way of “[]”

Give a chestnut to illustrate it.

``````
DF22['population'] = [1600,130,55]

DF22
``````

Output results:

``````
nation    capital    GDP    population

2    China    Beijing    9100    1600

0    Japan    Tokyo    4900    130

1    S.Korea    Seoul    1300    55
``````

# Index: row-level index

Index: Pandas for data manipulation ghost card (row-level index)

Level 1 index is:

• It may be generated from real data, so it can be regarded as data.

• It can be composed of multiple indexes, that is, multiple columns.

• Excel PivotTable can be exchanged with column names or stacked and unfolded to achieve Excel PivotTable effect.

Index has four kinds… oh no, many ways of writing. Some important index types include:

• Pd. Index (Ordinary)

• Int64 Index

• MultiIndex

• Datetime Index (indexed in time format)

• PeriodIndex (Time Format with Period as Index)

## Define ordinary indexes directly, and grow just like ordinary Seres.

``````
index_names = ['a','b','c']

Series_for_Index = pd.Series(index_names)

print pd.Index(index_names)

print pd.Index(Series_for_Index)
``````

Output results:

``````
Index([u'a', u'b', u'c'], dtype='object')

Index([u'a', u'b', u'c'], dtype='object')
``````

Unfortunately Immutable, remember!Immutable! Examples are as follows: digging holes here. I don’t understand…

``````
index_names = ['a','b','c']

index0 = pd.Index(index_names)

print index0.get_values()

index0[2] = 'd'
``````

The output results are as follows:

``````
['a' 'b' 'c']

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-36-f34da0a8623c> in <module>()

2 index0 = pd.Index(index_names)

3 print index0.get_values()

----> 4 index0[2] = 'd'

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in __setitem__(self, key, value)

1055

1056     def __setitem__(self, key, value):

-> 1057         raise TypeError("Indexes does not support mutable operations")

1058

1059     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations
``````

## Throw in a List with multiple groups, and you have MultiIndex

Unfortunately, if this List Comprehension is changed to parentheses, it’s not right.

``````
multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)])

multi1.name = ['index1','index2']

print multi1
``````

Output results:

``````
MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])
``````

## For Series, if you have multiple indexes, data, deformations!

The following code description:

• Series with double MultiIndex can unstack () into a DataFrame

• DataFrame can stack into Seres with MultiIndex

``````
data_for_multi1 = pd.Series(xrange(0,16),index=multi1)

data_for_multi1
``````

Output results:

``````
Row_1  Col_1     0

Col_2     1

Col_3     2

Col_4     3

Row_2  Col_1     4

Col_2     5

Col_3     6

Col_4     7

Row_3  Col_1     8

Col_2     9

Col_3    10

Col_4    11

Row_4  Col_1    12

Col_2    13

Col_3    14

Col_4    15

dtype: int32
``````

Seeing the output, it seems to understand a little bit, a bit like Excel summary. However, in the future, we have to check the information.

### Series with double MultiIndex can unstack () into a DataFrame

``````
data_for_multi1.unstack()
``````

### DataFrame can stack into Seres with MultiIndex

``````
data_for_multi1.unstack().stack()
``````

Output results:

``````
Row_1  Col_1     0

Col_2     1

Col_3     2

Col_4     3

Row_2  Col_1     4

Col_2     5

Col_3     6

Col_4     7

Row_3  Col_1     8

Col_2     9

Col_3    10

Col_4    11

Row_4  Col_1    12

Col_2    13

Col_3    14

Col_4    15

dtype: int32
``````

## Examples of unbalanced data:

``````
multi2 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(5) for y in xrange(x)])

multi2
``````

Output results:

``````
MultiIndex(levels=[[u'Row_2', u'Row_3', u'Row_4', u'Row_5'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])
``````
``````
data_for_multi2 = pd.Series(np.arange(10),index = multi2) data_for_multi2
``````

Output results:

``````
Row_2  Col_1    0

Row_3  Col_1    1

Col_2    2

Row_4  Col_1    3

Col_2    4

Col_3    5

Row_5  Col_1    6

Col_2    7

Col_3    8

Col_4    9

dtype: int32
``````

## The DateTime standard library is so good that you deserve it.

``````
import datetime

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]

pd.DatetimeIndex(dates)
``````

Output results:

``````
DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)``````

### If you need not only a uniform time format, but also a uniform time and frequency.

``````
periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

print periodindex1
``````

Output results:

``````
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')
``````

### How to convert the monthly and grade accuracy?

Some companies use No. 1 to represent the month, while others use the last day to represent the month. It’s very troublesome to transform.`asfreq`

``````
print periodindex1.asfreq('D',how='start')

print periodindex1.asfreq('D',how='end')
``````

Output results:

``````
PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')
``````

### Finally, I want to really match the time accuracy of the two frequencies?

``````
periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')

print periodindex_mon

print periodindex_day
``````

Output results:

``````
PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',

'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',

'2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',

'2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',

'2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',

'2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',

'2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',

'2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',

'2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',

'2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',

'2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',

'2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',

'2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',

'2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',

'2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',

'2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',

'2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',

'2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',

'2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',

'2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',

'2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',

'2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',

'2015-03-30', '2015-03-31'],

dtype='int64', freq='D')
``````

### Coarse-grained data+`reindex`＋`ffill/bfill`

``````
full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')

full_ts
``````

## With regard to index, what are the convenient operations?

As described in the previous article, the index is orderly and repetitive, but to some extent it can be accessed by key, that is to say, some set operations can be maintained.

``````
index1 = pd.Index(['A','B','B','C','C'])

index2 = pd.Index(['C','D','E','E','F'])

index3 = pd.Index(['B','C','A'])

print index1.append(index2)

print index1.difference(index2)

print index1.intersection(index2)

print index1.union(index2) # Support unique-value Index well

print index1.isin(index2)

print index1.delete(2)

print index1.insert(0,'K') # Not suggested

print index3.drop('A') # Support unique-value Index well

print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic

print index1.is_unique,index2.is_unique,index3.is_unique
``````

Output results:

``````
Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

Index([u'A', u'B'], dtype='object')

Index([u'C', u'C'], dtype='object')

Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

[False False False  True  True]

Index([u'A', u'B', u'C', u'C'], dtype='object')

Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')

Index([u'B', u'C'], dtype='object')

True True False

False False True
``````

# Reference resources:

• S1EP3_Pandas.pdfI don’t know when to save the information in the computer. I found it today. Thank you for your information.

• Pandas Summary Basis for Introduction to Python Data Analysis (2)

Welcome to Michael Xiang’s blog to view the completed version.

## SQL exercise 20 – Modeling & Reporting

This blog is used to review and sort out the common topic modeling architecture, analysis oriented architecture and integration topic reports in data warehouse. I have uploaded these reports to GitHub. If you are interested, you can have a lookAddress:https://github.com/nino-laiqiu/TiTanI recorded a relatively complete development process in my hexo blog deployed on GitHub. You can […]