pandas-526互联

1 Set-up
2 Intro
3 Pandas Data Structures - 2 Types
4 Dropping
5 Sort & Rank
6 I/O
7 Selection
8 Learning Link

Set-up

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"  # 如果不需要，则修改为none即可。删除该段代码不起作用。

Intro

The Pandas library is built on NumPy
provides data structures and data analysis tools.

import pandas as pd

Pandas切片左闭右开

Pandas Data Structures - 2 Types

Notice:
当同时对s和df进行操作的时候:

s会直接在原始的Series对象上进行更改，而不会创建新的对象 (Series是可变对象)
而df并不会修改原始数据，除非进行赋值操作，或者设置了参数inplace=True（非破坏操作)

Series

(my understanding: ordinary/normal data) )

A one-dimensional labeled array

capable of holding any data type

 s = pd.Series([1, 2, 3, 4],
              index = ['a', 'b', 'c', 'd'])
print(s)

a    1
b    2
c    3
d    4
dtype: int64

DataFrame

two-dimensional labeled data structure

columns of different types

大多的DataFrame操作都会自动返回一个新的 DataFrame 对象

# Series
data = {
    'Country': ['Belgium', 'India', 'Brazil'],
    'Capital': ['Brussels', 'New Delhi', 'Brasília'],
    'Population': [11190846, 1303171035, 207847528]
    }

print(data)

# series -> DataFrame
# index默认从0开始递增
# 修改默认行索引：index=['foo', 'bar', 'baz']
df = pd.DataFrame(data,
                columns = ['Country', 'Capital', 'Population'])
df # 原始的 DataFrame 对象

# **注意**
# 对df.__任何操作都不会修改它本身数据，除非进行了赋值
# eg. df = df.drop(0)

{'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]}

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

Series VS DataFrame

Dropping

# drop rows from series: index = 'a' and 'c' 
# 删除指定行
s.drop( ['a', 'c'] ) 
s

df

# drop columns
# 删除指定列 axis=1删除列，axis=0删除行
df.drop('Country', axis=1)
df

# 删除第一行数据, index=0 (不是标题行)
df.drop(0)
df

# Notice
# 看以下输出，数据本身并没有改变
# drop方法会自动返回一个 经过删除操作的 Series/DataFrame
# 如果想改变原有数据，参数中添加：inplace=True

b    2
d    4
dtype: int64






a    1
b    2
c    3
d    4
dtype: int64

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Capital	Population
0	Brussels	11190846
1	New Delhi	1303171035
2	Brasília	207847528

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Country	Capital	Population
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

Sort & Rank

df
df.sort_index() # 按照index排序，index默认从0开始递增，所以这里看不出变化
df.sort_values( by='Country' ) # sory by 'values'

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Country	Capital	Population
0	Belgium	Brussels	11190846
2	Brazil	Brasília	207847528
1	India	New Delhi	1303171035

df
# 给每个元素在该列进行排序
df.rank() # assign ranks to entires

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Country	Capital	Population
0	1.0	2.0	1.0
1	3.0	3.0	3.0
2	2.0	1.0	2.0

I/O

Notice:

对于CSV和Excel来说：使用pandas读取到的文件为DataFrame
对于SQL来说：在Pandas的pd.read_sql()函数中，结果可以是不同的数据结构，取决于查询和它的执行方式. eg. 一个单一的值（比如计数），一组值（比如名字的列表），一个类似表的结构（比如SELECT语句的结果集）。

CSV

Aim:
读取1.csv文件，经过DataFrame处理后，以2.csv的形式输出

# Read

# nrows = number of rows

# 用pandas读取csv -> 存入DataFrame中
# 因为pd.read_csv()方法产生的数据结构是一个DataFrame

# header=0或者None: 文件第一行作为标题行

df = pd.read_csv('1.csv', header = None, nrows = 3)

df
# Write
# 将DataFrame写入CSV文件中
df.to_csv('2.csv')

	0	1	2	3	4	5	6	7	8	9	...	26	27	28	29	30	31	32	33	34	35
0	respondent_id	h1n1_concern	h1n1_knowledge	behavioral_antiviral_meds	behavioral_avoidance	behavioral_face_mask	behavioral_wash_hands	behavioral_large_gatherings	behavioral_outside_home	behavioral_touch_face	...	income_poverty	marital_status	rent_or_own	employment_status	hhs_geo_region	census_msa	household_adults	household_children	employment_industry	employment_occupation
1	26707	2	2	0	1	0	1	1	0	1	...	> $75,000	Not Married	Rent	Employed	mlyzmhmf	MSA, Not Principle City	1	0	atmlpfrs	hfxkjkmi
2	26708	1	1	0	0	0	0	0	0	0	...	Below Poverty	Not Married	Rent	Employed	bhuqouqj	Non-MSA	3	0	atmlpfrs	xqwwgdyp

3 rows × 36 columns

Excel

如果一个xlsx文件 - 只有一个sheet

# Read
df = pd.read_excel('3.xlsx')

# Write
df.to_excel('4.xlsx', sheet_name = 'new')

如果一个xlsx文件 - 有多个sheet

# 从一个excel中 - 读取所有sheet
sheets = pd.ExcelFile('5.xls')

# Read
df1 = pd.read_excel(sheets, 'sheet_1')
df2 = pd.read_excel(sheets, 'sheet_2')

# Write
merged_df = pd.concat([df1, df2], axis=0) # Merge the dataframes
merged_df.to_excel('merged_file.xlsx', index=False) # Write to new xlsx file

SQL Query/Database Table

存在问题，需要不断安装sqlalchemy，可能是虚拟环境的问题
jupyter kernelspec list
Available kernels:
pycharm-616d70e3 /Users/huang/Library/Jupyter/kernels/pycharm-616d70e3
python3 /Users/huang/opt/anaconda3/envs/WebAppFlask3_9_MachineLearning/share/jupyter/kernels/python3

Databricks也无法运行

官方文档：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

# !pip install sqlalchemy
!conda install sqlalchemy

# >>> from sqlalchemy import create_engine

# >>> engine = create_engine( )

# >>> pd.read_sql( , engine)

# >>> pd.read_sql_table( , engine)

# >>> pd.read_sql_query( , engine)
# read_sql() is a convenience wrapper around read_sql_table() andread_sql_query()
# >>> df.to_sql( , en 'myDf' gine)

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/huang/opt/anaconda3

  added / updated specs:
    - sqlalchemy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boltons-23.0.0             |   py39hecd8cb5_0         423 KB
    conda-23.3.1               |   py39hecd8cb5_0         962 KB
    jsonpatch-1.32             |     pyhd3eb1b0_0          15 KB
    jsonpointer-2.1            |     pyhd3eb1b0_0           9 KB
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages will be INSTALLED:

  boltons            pkgs/main/osx-64::boltons-23.0.0-py39hecd8cb5_0 
  jsonpatch          pkgs/main/noarch::jsonpatch-1.32-pyhd3eb1b0_0 
  jsonpointer        pkgs/main/noarch::jsonpointer-2.1-pyhd3eb1b0_0 

The following packages will be UPDATED:

  conda                               23.1.0-py39hecd8cb5_0 --> 23.3.1-py39hecd8cb5_0 


Proceed ([y]/n)?

Selection

Getting

print(s)
s['b']

a    1
b    2
c    3
d    4
dtype: int64





2

df
df[1:] # get subset

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

	Country	Capital	Population
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

Selecting, Boolean Indexing & Setting

# by position
# 选择(0,0)的元素
df
df.iloc[0,0]
df.iat[0,0]

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

'Belgium'






'Belgium'

# By Label
df.loc[0, 'Country'] #选择第0行Country列
df.at[0, 'Country']

# Notice:
# 在访问单个标量值时，.at比.loc快，但它只适用于一次访问一个值。

'Belgium'






'Belgium'

#  Boolean Indexing

s

# 选择Series对象s中所有<=1的元素
# ~ 取反操作
s[~(s>1)]

s[ (s<-1) | (s>2) ]

df['Population']>1200000000  # -> 返回的是一个bool类型的Series对象
df[ df['Population']>1200000000 ]

a    1
b    2
c    3
d    4
dtype: int64






a    1
dtype: int64






c    3
d    4
dtype: int64






0    False
1     True
2    False
Name: Population, dtype: bool

	Country	Capital	Population
1	India	New Delhi	1303171035

# 修改索引a的值为6
s
s['a'] = 6 #  结果会被修改 - 会直接在原始的Series对象上进行更改，而不会创建新的对象
s

a    1
b    2
c    3
d    4
dtype: int64






a    6
b    2
c    3
d    4
dtype: int64

s[0:2] # 左闭右开

a    6
b    2
dtype: int64

loc和iloc区别

loc: location - df.loc[0, 'Country']
iloc: integer location - df.iloc[0,0]

## Retrieving Series/DataFrame Information
df
df.shape
df.index
df.columns
df.info()
df.count()

	Country	Capital	Population
0	Belgium	Brussels	11190846
1	India	New Delhi	1303171035
2	Brazil	Brasília	207847528

(3, 3)






RangeIndex(start=0, stop=3, step=1)






Index(['Country', 'Capital', 'Population'], dtype='object')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     3 non-null      object
 1   Capital     3 non-null      object
 2   Population  3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes





Country       3
Capital       3
Population    3
dtype: int64

### Basic Information

### Summary

## Applying Functions

## Data Alignment

### Internal Data Alignment

### Arithmetic Operations with Fill Methods

Learning Link

https://www.datacamp.com/search?q=pandas

matplotlib python pandas

数据pandas

526互联