Pandas教程-526互联

1.总览 https://zhuanlan.zhihu.com/p/370471321
2.Pandas有关库
3.调用Pandas库
4.Pandas的数据结构

Series
DataFrame

5.调用/读取数据

CSV
Excel
Others（json、SQL、html）

6.数据存储
7.创建测试对象
8.统计数据函数

http://df.info()
df.shape()
df.index
df.columns
df.sum()
df.min()
df.max()
idxmin()
idxmax()
df.describe()
df.mean()
df.median()
df.quantile([0.25, 0.75])
df.var()
df.std()
df.cummax()
df.cummin() df['cloumnName'].cumproad()
len(df)
df.isnull
df.corr()

9.Pandas中的选择和过滤

series['index']
df.loc[n:n]
df['columnName']
df['columnName][n]
df['columnName'].nunique()
df['columnName'].unique(
df.columnName
df['columnName'].value_counts(dropna =False)
df.head(n)
df.tail(n)
df.sample(n)
df.sample(frac=0.5)
df.nlargest(n,'columnName')
df.nsmallest(n,'columnName')
df[df.columnName < n]
df[['columnName','columnName']]
df.loc[:,"columnName1":"columnName2"]
Create Filter
df.filter(regex = 'code')
np.logical_and
Filtering with &

10.Sort Data

df.sort_values('columnName')
df.sort_values('columnName', ascending=False)
df.sort_index()

11.重命名&定义新的/修改的列

df.rename(columns= {'columnName' : 'newColumnName'})
定义新列
改变索引名称
所有列名变小写字母
所有列名变大写字母

12.Drop Data

df.drop(columns=['columnName'])
Series.drop(['index'])
删除指定行
删除一个变量

13.转换数据类型

df.dtypes
df['columnName'] = df['columnName'].astype('dataType')
pd.melt(frame=dataFrameName,id_vars = 'columnName', value_vars= ['columnName'])

14.Apply函数

Method1
Method2

15.工具和测试代码

总览

本文将向你展示Pandas函数以及如何使用Pandas，Pandas是一个开源的、bsd许可的库，为Python编程语言提供高性能、易于使用的数据结构和数据分析工具。希望你能熟练掌握下面常用到的pandas技巧。

调用Pandas库

import numpy as np # linear algebra

import pandas as pd # import in pandas

import os

print(os.listdir("../input"))

Pandas数据结构

Pandas有两种类型的数据结构。这些是Series和dataframe数据框。

Series

Series是一维标记数组。它可以容纳任何类型的数据。

mySeries = pd.Series([3,-5,7,4], index=['a','b','c','d'])

type(mySeries)

DataFrame

数据框是一个二维数据结构，它包含列。

data = {'Country' : ['Belgium', 'India', 'Brazil' ],

'Capital': ['Brussels', 'New Delhi', 'Brassilia'],

'Population': [1234,1234,1234]}

datas = pd.DataFrame(data, columns=['Country','Capital','Population'])

print(type(data))

print(type(datas))

调用/读取数据

有了Pandas，我们可以打开CSV, Excel和SQL数据库。接下来展示如何仅对CSV和Excel文件使用此方法。

CSV(逗号分隔符)

# 读取csv

df = pd.read_csv('data.csv')

type(df)

Excel

pd.read_excel('filename')

pd.to_excel('dir/dataFrame.xlsx', sheet_name='Sheet1')

Others(json、SQL、table txt、 html)

# Reads from a SQL table/database

pd.read_sql(query,connection_object)

# From a delimited txt file(like TSV)

pd.read_table(filename)

# Reads from a json formatted string, URL or file

pd.read_json(json_string)

# Parses an html URL, string or file and extracts tables to a list of dataframes

pd.read_html(url)

# Takes the contentes of your clipboard and passes it to read_table()

pd.read_clipboard()

# From a dict, keys for columns names, values for data as lists

pd.DataFrame(dict)

数据存储

# -> Writes to a CSV file

df.to_csv(filename)

# -> Writes to a CSV file

df.to_excel(filename)

# -> Writes to a SQL table

df.to_sql(table_name, connection_object)

# -> Writes to a file in JSON format

df.to_json(filename)

# -> Saves as an HTML table

df.to_html(filename)

# -> Writes to the clipboard

df.to_clipboard()

创建测试对象

创建一个20行5列的随机数的数据框

pd.DataFrame(np.random.rand(20,5)) # 5 columns and 20 rows of random floats

统计数据函数

http://df.info()

函数提供有关数据信息：

RangeIndex:指定有多少数据。
Data Columns:指定找到多少列。
Columns:提供关于Columns的信息。
dtypes:它说你有什么类型的数据，你有多少这些数据。
Memory Usage:表示内存使用量。

df.info()

RangeIndex: 1989 entries, 0 to 1988

Data columns (total 7 columns):

Date 1989 non-null object

Open 1989 non-null float64

High 1989 non-null float64

Low 1989 non-null float64

Close 1989 non-null float64

Volume 1989 non-null int64

Adj Close 1989 non-null float64

dtypes: float64(5), int64(1), object(1)

memory usage: 108.9+ KB

df.shape()

这段代码显示了行数和列数。

df.shape()

(1989, 7)

df.index()

这段代码显示找到的索引总数。

df.index()

RangeIndex(start=0, stop=1989, step=1)

df.columns()

这段代码给出数据框的所有列

df.columns()

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

df.count()

这段代码给出每一列中有多少数据

df.count()

Date 1989

Open 1989

High 1989

Low 1989

Close 1989

Volume 1989

Adj Close 1989

dtype: int64

df.sum()

这段代码给出每一列中的求和

df.sum()

Date 2016-07-012016-06-302016-06-292016-06-282016-0...

Open 2.67702e+07

High 2.69337e+07

Low 2.65988e+07

Close 2.6778e+07

Volume 323831020000

Adj Close 2.6778e+07

dtype: object

df.cumsum()

cumsum、cummax、cummin详细用法请点击我

这个代码给出列的前n行累积求和。

df.cumsum().head()

df.min()

这段代码给出每一列中的最小值

df.min()

Date 2008-08-08

Open 6547.01

High 6709.61

Low 6469.95

Close 6547.05

Volume 8410000

Adj Close 6547.05

dtype: object

df.max()

同理理解

idxmin()

这段代码获取数据中的最小值。Series和dataframe的使用是不同的。

print("df: ",df['Open'].idxmin())

print("series", mySeries.idxmin())

df: 1842

series b

idxmax()

同理理解

df.describe()

函数提供了有关数据的基本统计信息（基于列计算）：count、mean、std、min、25%、50%、75%、max。

df.describe()

df.mean()

这段代码给出每一列中的均值

df.mean()

df.median()

同理，给出每一列中的中位数

df.median()

df.quantile([0.25, 0.75])

同理，给出每一列中的25%和75%的分位数

df.quantile([0.25, 0.75])

df.var()

同理，给出每一列中的方差值

df.var()

df.std()

同理，给出每一列中的标准差值

df.std()

df.cummax()

函数用于返回列的前n行最大值。

df.cummax()

df.cummin()

该函数返回列的前n行最小值。

df.cummin()

df['columnName'].cumproad()

该函数计算数据的累积结果

df['Open'].cumprod().head()

0 1.792424e+04

1 3.174878e+08

2 5.542073e+12

3 9.527105e+16

4 1.653449e+21

Name: Open, dtype: float64

len(df)

该函数计算数据框的长度

len(df)

1989

df.isnull

该函数返回数据框是否包含null值

df.isnull().head()

df.corr()

该函数返回列之间的相关系数

df.corr()

Pandas中的选择和过滤

mySeries['b']

该函数返回列为b的那一列值

mySeries['b']

-5

df[n:N]

该函数返回行n到N-1的数据框

df[1982:]

df.iloc[[m],[n]]

该函数返回m行n列的数据框

df.iloc[[0],[3]]

df.loc[m:n]

该函数返回索引m到n的数据框

df.loc[5:7]

df['columnName']或者df.columnName

该函数选取指定列

df['Open'].head()

df.Open.head()

0 17924.240234

1 17712.759766

2 17456.019531

3 17190.509766

4 17355.210938

Name: Open, dtype: float64

df['columnName'][n]

该函数选取指定列中的某一行

df['Open'][0]

17924.240234

df['columnName'].nunique()

该函数计算指定列中唯一的个数

df['Open'].nunique()

1980

df['columnName'].unique()

该函数计算指定列中唯一的个数，具体是哪些类别

df['Open'].unique()

array([17924.240234, 17712.759766, 17456.019531, ..., 11781.700195,

11729.669922, 11432.089844])

df['columnName'].value_counts()

该函数计算每一个唯一值对应的数量，不计算null/None。

print(df.Open.value_counts(dropna =True).head())

17374.779297 2

18033.330078 2

10309.389648 2

17711.119141 2

17812.250000 2

Name: Open, dtype: int64

df.head(n)

该函数返回数据框的前n行，n默认等于5

df.tail(n)

该函数返回数据框的后n行，n默认等于5

df.sample(n)

该函数随机采样数据框n行数据（按频数采用）

df.sample(frac = a)

该函数随机采样数据框的a倍行数据框，a的范围在0-1之间（按频率采样）

df.nlargest(n,'columnName')

该函数返回最大的前n行数据框

df.nlargest(5,'Open')

df.nlargest(n,'columnName')

同理，该函数返回最小的前n行数据框

df[df.columnName < n]

该函数返回指定列的值小于n的数据框

df[['columnName1','columnName2']]

该函数返回多个指定的列columnName1，columnName2，...，columnNamen

df.loc[:,"columnName1":"columnName2"]

该函数返回columnName1到columnName2之间的所有行的数据框

df.loc[m:n,"columnName1":"columnName2"]

该函数返回columnName1到columnName2之间的m：n行的数据框

Create Filter

该用法表示，可以事先创建一个过滤条件

filters = df.Date > '2016-06-27'

df[filters]

df.filter(regex = 'code')

该函数表示允许regex过滤我们想要的任何数据。

df.filter(regex='^L').head()

np.logical_and

该函数表示允许多个过滤条件并行

df[np.logical_and(df['Open']>18281.949219, df['Date']>'2015-05-20' )]

Filtering with &

该函数表示使用&也可以事先多个过滤条件并行

df[(df['Open']>18281.949219) & (df['Date']>'2015-05-20')]

Sort Data

df.sort_values('columnName', ascending=True)

该函数表示，默认按照指定列的值的大小升序排列

df.sort_values('Open', ascending= False).head()

df.sort_index()

该函数表示，对索引默认按照升序排序

df.sort_index().head()

重命名&定义新的/修改的列

df.rename(columns= {'columnName' : 'newColumnName'})

重命名列名：columnName原列，newColumnName需要修改的新的列名

df.rename(columns= {'Adj Close' : 'Adjclose'}).head()

定义新列

比如：创建了下面Difference这个新的列

df["Difference"] = df.High - df.Low

df.head()

改变索引名称

比如：修改df的index的名称为index_name

print(df.index.name)

df.index.name = "index_name"

df.head()

所有列名变小写字母

df.columns = map(str.lower(), df.columns)

所有列名变大写字母

df.columns = map(str.upper(), df.columns)

Drop Data

df.drop(columns=['columnName'])

该函数表示删除指定列

mySeries.drop(['a'])

该函数表示删除序列中指定的值

mySeries.drop(['a'])

b -5

c 7

d 4

dtype: int64

删除指定行

df.drop(['2016-07-01', '2016-06-27'])

删除一个变量/一个列

axis = 1表示按照列删除

df.drop('Volume', axis=1)

转换数据类型

df.dtypes

该函数表示查看数据类型

df.dtypes

Date object

Open float64

High float64

Low float64

Close float64

Volume int64

Adj Close float64

Difference float64

dtype: object

df['columnName'] = df['columnName'].astype('dataType')

将指定列转化为指定的数据类型

df.Date.astype('category').dtypes

pd.melt(frame=dataFrameName,id_vars = 'columnName', value_vars= ['columnName'])

该函数表示将宽数据转化为长数据，id_vars是不需要被转换的列名，value_vars是需要转换的列名，详细用法可以点击我

df_new = df.head()

melted = pd.melt(frame=df_new,id_vars = 'Date', value_vars= ['Low'])

melted

Apply函数

Method1

自定义def函数

def examples(x): #create a function

return x*2

df.Open.apply(examples).head() #use the function with apply()

Method2

lambda函数

df.Open.apply(lambda x: x*2).head()

工具和测试代码

# pd.get_option OR pd.set_option

# pd.reset_option("^display")

# pd.reset_option("display.max_rows")

# pd.get_option("display.max_rows")

# pd.set_option("max_r",102) -> specifies the maximum number of rows to display.

# pd.options.display.max_rows = 999 -> specifies the maximum number of rows to display.

# pd.get_option("display.max_columns")

# pd.options.display.max_columns = 999 -> specifies the maximum number of columns to display.

# pd.set_option('display.width', 300)

# pd.set_option('display.max_columns', 300) -> specifies the maximum number of rows to display.

# pd.set_option('display.max_colwidth', 500) -> specifies the maximum number of columns to display.

# pd.get_option('max_colwidth')

# pd.set_option('max_colwidth',40)

# pd.reset_option('max_colwidth')

# pd.get_option('max_info_columns')

# pd.set_option('max_info_columns', 11)

# pd.reset_option('max_info_columns')

# pd.get_option('max_info_rows')

# pd.set_option('max_info_rows', 11)

# pd.reset_option('max_info_rows')

# pd.set_option('precision',7) -> sets the output display precision in terms of decimal places. This is only a suggestion.

# OR

# pd.set_option('display.precision',3)

# pd.set_option('chop_threshold', 0) -> sets at what level pandas rounds to zero when it displays a Series of DataFrame. This setting does not change the precision at which the number is stored.

# pd.reset_option('chop_threshold')

教程

Pandas

教程pandas

dataframe教程pandas series