pandas

发布时间 2023-05-05 20:48:31作者: 抓水母的派大星

Table of Contents

Set-up

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"  # 如果不需要,则修改为none即可。删除该段代码不起作用。

Intro

The Pandas library is built on NumPy
provides data structures and data analysis tools.

import pandas as pd

Pandas切片左闭右开


Pandas Data Structures - 2 Types

Notice:
当同时对s和df进行操作的时候:

  1. s会直接在原始的Series对象上进行更改,而不会创建新的对象 (Series是可变对象)
  2. 而df并不会修改原始数据,除非进行赋值操作,或者设置了参数inplace=True(非破坏操作)

Series

  1. (my understanding: ordinary/normal data) )
  2. A one-dimensional labeled array
  3. capable of holding any data type
 s = pd.Series([1, 2, 3, 4],
              index = ['a', 'b', 'c', 'd'])
print(s)
a    1
b    2
c    3
d    4
dtype: int64

DataFrame

  1. two-dimensional labeled data structure
  2. columns of different types
  3. 大多的DataFrame操作都会自动返回一个新的 DataFrame 对象
# Series
data = {
    'Country': ['Belgium', 'India', 'Brazil'],
    'Capital': ['Brussels', 'New Delhi', 'Brasília'],
    'Population': [11190846, 1303171035, 207847528]
    }

print(data)

# series -> DataFrame
# index默认从0开始递增
# 修改默认行索引:index=['foo', 'bar', 'baz']
df = pd.DataFrame(data,
                columns = ['Country', 'Capital', 'Population'])
df # 原始的 DataFrame 对象

# **注意**
# 对df.__任何操作都不会修改它本身数据,除非进行了赋值
# eg. df = df.drop(0)
{'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]}
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Series VS DataFrame

image.png

Dropping

# drop rows from series: index = 'a' and 'c' 
# 删除指定行
s.drop( ['a', 'c'] ) 
s

df

# drop columns
# 删除指定列 axis=1删除列,axis=0删除行
df.drop('Country', axis=1)
df

# 删除第一行数据, index=0 (不是标题行)
df.drop(0)
df

# Notice
# 看以下输出,数据本身并没有改变
# drop方法会自动返回一个 经过删除操作的 Series/DataFrame
# 如果想改变原有数据,参数中添加:inplace=True
b    2
d    4
dtype: int64






a    1
b    2
c    3
d    4
dtype: int64
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Capital Population
0 Brussels 11190846
1 New Delhi 1303171035
2 Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Sort & Rank

df
df.sort_index() # 按照index排序,index默认从0开始递增,所以这里看不出变化
df.sort_values( by='Country' ) # sory by 'values'
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
2 Brazil Brasília 207847528
1 India New Delhi 1303171035
df
# 给每个元素在该列进行排序
df.rank() # assign ranks to entires
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 1.0 2.0 1.0
1 3.0 3.0 3.0
2 2.0 1.0 2.0

I/O

Notice:

  1. 对于CSV和Excel来说:使用pandas读取到的文件为DataFrame
  2. 对于SQL来说:在Pandas的pd.read_sql()函数中,结果可以是不同的数据结构,取决于查询和它的执行方式. eg. 一个单一的值(比如计数),一组值(比如名字的列表),一个类似表的结构(比如SELECT语句的结果集)。

CSV

Aim:
读取1.csv文件,经过DataFrame处理后,以2.csv的形式输出

# Read

# nrows = number of rows

# 用pandas读取csv -> 存入DataFrame中
# 因为pd.read_csv()方法产生的数据结构是一个DataFrame

# header=0或者None: 文件第一行作为标题行

df = pd.read_csv('1.csv', header = None, nrows = 3)

df
# Write
# 将DataFrame写入CSV文件中
df.to_csv('2.csv')
0 1 2 3 4 5 6 7 8 9 ... 26 27 28 29 30 31 32 33 34 35
0 respondent_id h1n1_concern h1n1_knowledge behavioral_antiviral_meds behavioral_avoidance behavioral_face_mask behavioral_wash_hands behavioral_large_gatherings behavioral_outside_home behavioral_touch_face ... income_poverty marital_status rent_or_own employment_status hhs_geo_region census_msa household_adults household_children employment_industry employment_occupation
1 26707 2 2 0 1 0 1 1 0 1 ... > $75,000 Not Married Rent Employed mlyzmhmf MSA, Not Principle City 1 0 atmlpfrs hfxkjkmi
2 26708 1 1 0 0 0 0 0 0 0 ... Below Poverty Not Married Rent Employed bhuqouqj Non-MSA 3 0 atmlpfrs xqwwgdyp

3 rows × 36 columns

Excel

  1. 如果一个xlsx文件 - 只有一个sheet
# Read
df = pd.read_excel('3.xlsx')

# Write
df.to_excel('4.xlsx', sheet_name = 'new')
  1. 如果一个xlsx文件 - 有多个sheet
# 从一个excel中 - 读取所有sheet
sheets = pd.ExcelFile('5.xls')

# Read
df1 = pd.read_excel(sheets, 'sheet_1')
df2 = pd.read_excel(sheets, 'sheet_2')

# Write
merged_df = pd.concat([df1, df2], axis=0) # Merge the dataframes
merged_df.to_excel('merged_file.xlsx', index=False) # Write to new xlsx file

SQL Query/Database Table

存在问题,需要不断安装sqlalchemy,可能是虚拟环境的问题
jupyter kernelspec list
Available kernels:
pycharm-616d70e3 /Users/huang/Library/Jupyter/kernels/pycharm-616d70e3
python3 /Users/huang/opt/anaconda3/envs/WebAppFlask3_9_MachineLearning/share/jupyter/kernels/python3

Databricks也无法运行

官方文档:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

image.png

# !pip install sqlalchemy
!conda install sqlalchemy

# >>> from sqlalchemy import create_engine

# >>> engine = create_engine( )

# >>> pd.read_sql( , engine)

# >>> pd.read_sql_table( , engine)

# >>> pd.read_sql_query( , engine)
# read_sql() is a convenience wrapper around read_sql_table() andread_sql_query()
# >>> df.to_sql( , en 'myDf' gine)
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/huang/opt/anaconda3

  added / updated specs:
    - sqlalchemy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boltons-23.0.0             |   py39hecd8cb5_0         423 KB
    conda-23.3.1               |   py39hecd8cb5_0         962 KB
    jsonpatch-1.32             |     pyhd3eb1b0_0          15 KB
    jsonpointer-2.1            |     pyhd3eb1b0_0           9 KB
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages will be INSTALLED:

  boltons            pkgs/main/osx-64::boltons-23.0.0-py39hecd8cb5_0 
  jsonpatch          pkgs/main/noarch::jsonpatch-1.32-pyhd3eb1b0_0 
  jsonpointer        pkgs/main/noarch::jsonpointer-2.1-pyhd3eb1b0_0 

The following packages will be UPDATED:

  conda                               23.1.0-py39hecd8cb5_0 --> 23.3.1-py39hecd8cb5_0 


Proceed ([y]/n)? 

Selection

Getting

print(s)
s['b']
a    1
b    2
c    3
d    4
dtype: int64





2
df
df[1:] # get subset
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Selecting, Boolean Indexing & Setting

# by position
# 选择(0,0)的元素
df
df.iloc[0,0]
df.iat[0,0]
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
'Belgium'






'Belgium'
# By Label
df.loc[0, 'Country'] #选择第0行Country列
df.at[0, 'Country']

# Notice:
# 在访问单个标量值时,.at比.loc快,但它只适用于一次访问一个值。

'Belgium'






'Belgium'
#  Boolean Indexing

s

# 选择Series对象s中所有<=1的元素
# ~ 取反操作
s[~(s>1)]

s[ (s<-1) | (s>2) ]

df['Population']>1200000000  # -> 返回的是一个bool类型的Series对象
df[ df['Population']>1200000000 ]

a    1
b    2
c    3
d    4
dtype: int64






a    1
dtype: int64






c    3
d    4
dtype: int64






0    False
1     True
2    False
Name: Population, dtype: bool
Country Capital Population
1 India New Delhi 1303171035
# 修改索引a的值为6
s
s['a'] = 6 #  结果会被修改 - 会直接在原始的Series对象上进行更改,而不会创建新的对象
s

a    1
b    2
c    3
d    4
dtype: int64






a    6
b    2
c    3
d    4
dtype: int64
s[0:2] # 左闭右开
a    6
b    2
dtype: int64

loc和iloc区别

  1. loc: location - df.loc[0, 'Country']
  2. iloc: integer location - df.iloc[0,0]

image.png

## Retrieving Series/DataFrame Information
df
df.shape
df.index
df.columns
df.info()
df.count()
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
(3, 3)






RangeIndex(start=0, stop=3, step=1)






Index(['Country', 'Capital', 'Population'], dtype='object')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     3 non-null      object
 1   Capital     3 non-null      object
 2   Population  3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes





Country       3
Capital       3
Population    3
dtype: int64
### Basic Information


### Summary


## Applying Functions

## Data Alignment
### Internal Data Alignment

### Arithmetic Operations with Fill Methods

https://www.datacamp.com/search?q=pandas