Pandas数据分析实战(Pandas in action)第1章 Pandas 概述

发布时间 2023-12-16 21:04:32作者: 熠然

Pandas 数据分析实战

第一章 Pandas 概述

  • read_csv() 没有设置索引列

    read_csv 函数导入 movies.csv 文件,由于没有设置索引,Pandas 会生产一个从 0 开始的数字索引

    movies = pd.read_csv('./file/chapter_01/movies.csv')
    print(movies)
    
         Rank                         Title           Studio       Gross  Year
    0       1             Avengers: Endgame      Buena Vista  $2,796.30   2019
    1       2                        Avatar              Fox  $2,789.70   2009
    2       3                       Titanic        Paramount  $2,187.50   1997
    3       4  Star Wars: The Force Awakens      Buena Vista  $2,068.20   2015
    4       5        Avengers: Infinity War      Buena Vista  $2,048.40   2018
    ..    ...                           ...              ...         ...   ...
    777   778                     Yogi Bear  Warner Brothers    $201.60   2010
    778   779           Garfield: The Movie              Fox    $200.80   2004
    779   780                   Cats & Dogs  Warner Brothers    $200.70   2001
    780   781      The Hunt for Red October        Paramount    $200.50   1990
    781   782                      Valkyrie              MGM    $200.30   2008
    
    [782 rows x 5 columns]
    

    为避免显示结果过多导致页面过长,Pandas 仅显示 DataFrame 的前五行和后五行,省略的数据用 ... 代替

  • read_csv() 指定索引列(index_col)

    使用 Title 作为索引

    movies = pd.read_csv("./file/chapter_01/movies.csv", index_col="Title")
    print(movies)
    
                                  Rank           Studio       Gross  Year
    Title
    Avengers: Endgame                1      Buena Vista  $2,796.30   2019
    Avatar                           2              Fox  $2,789.70   2009
    Titanic                          3        Paramount  $2,187.50   1997
    Star Wars: The Force Awakens     4      Buena Vista  $2,068.20   2015
    Avengers: Infinity War           5      Buena Vista  $2,048.40   2018
    ...                            ...              ...         ...   ...
    Yogi Bear                      778  Warner Brothers    $201.60   2010
    Garfield: The Movie            779              Fox    $200.80   2004
    Cats & Dogs                    780  Warner Brothers    $200.70   2001
    The Hunt for Red October       781        Paramount    $200.50   1990
    Valkyrie                       782              MGM    $200.30   2008
    
    [782 rows x 4 columns]
    
  • 操作 DataFrame

    • head():

      从开头提取几行,默认是 5 行 ,可以选择传参,比如 head(4),获取前 4 行数据

      movies.head()
      
                                  Rank       Studio       Gross  Year
      Title
      Avengers: Endgame                1  Buena Vista  $2,796.30   2019
      Avatar                           2          Fox  $2,789.70   2009
      Titanic                          3    Paramount  $2,187.50   1997
      Star Wars: The Force Awakens     4  Buena Vista  $2,068.20   2015
      Avengers: Infinity War           5  Buena Vista  $2,048.40   2018
      
    • tail()

      查看数据集的尾部数据,默认是 5 行,可以选择传参,比如 tail(6),获取尾部 6 行数据

      movies.tail()
      
                                Rank           Studio     Gross  Year
      Title
      Yogi Bear                  778  Warner Brothers  $201.60   2010
      Garfield: The Movie        779              Fox  $200.80   2004
      Cats & Dogs                780  Warner Brothers  $200.70   2001
      The Hunt for Red October   781        Paramount  $200.50   1990
      Valkyrie                   782              MGM  $200.30   2008
      
    • len()

      查询 DataFrame 有多少行

      len(movies)
      
      782
      
    • 属性:shape

      查询 DataFrame 的行数和列数

      movies.shape
      
      (782, 5)
      
    • 属性:size

      查询单元格总数

      movies.size
      
        3910
      
    • 属性:dtypes

      查询列的数据类型,int64 表示整数列,object 表示文本列

      movies.dtypes
      
      Rank       int64
      Title     object
      Studio    object
      Gross     object
      Year       int64
      dtype: object
      
    • iloc[]

      根据索引位置,提取信息。索引从 0 开始计算。返回一个 Series 对象。

      movies.iloc[499]
      
        Rank                              500
        Title     Maze Runner: The Death Cure
        Studio                            Fox
        Gross                        $288.30
        Year                             2018
        Name: 499, dtype: object
      
    • loc[]

      使用索引标签提取数据,在导入数据的时候,设置了 Title 作为索引

      movies.loc["Forrest Gump"]
      
      Rank            119
      Studio    Paramount
      Gross      $677.90
      Year           1994
      Name: Forrest Gump, dtype: object
      

      索引标签可以包含重复项。尽可能保持索引标签的唯一性,唯一的标签集合可以加快 Pandas 定位和提取特定行的速度

      movies.loc["101 Dalmatians"]
      
                      Rank       Studio     Gross  Year
      Title
      101 Dalmatians   425  Buena Vista  $320.70   1996
      101 Dalmatians   708  Buena Vista  $215.90   1961
      
    • sort_values()

      按 Year 列中的值进行排序,取最新上映的 5 部电影 。by 指定列名或者索引值, ascending 是否按指定列的数组升序排序,默认为 True,即升序排列

      movies.sort_values(by='Year', ascending= False).head()
      
                                                  Rank  ...  Year
      Title                                             ...
      Avengers: Endgame                              1  ...  2019
      John Wick: Chapter 3 - Parabellum            458  ...  2019
      The Wandering Earth                          114  ...  2019
      Toy Story 4                                  198  ...  2019
      How to Train Your Dragon: The Hidden World   199  ...  2019
      
      [5 rows x 4 columns]
      

      还可以根据多列的值对 DateFrame 进行排序,先按 Studio 列的值排序,如果出现重复值,再按 Year 列的值对电影进行排序

      movies.sort_values(by=['Studio','Year']).head()
      
                              Rank       Studio     Gross  Year
      Title
      The Blair Witch Project   588      Artisan  $248.60   1999
      101 Dalmatians            708  Buena Vista  $215.90   1961
      The Jungle Book           755  Buena Vista  $205.80   1967
      Who Framed Roger Rabbit   410  Buena Vista  $329.80   1988
      Dead Poets Society        636  Buena Vista  $235.90   1989
      
    • sort_index()

      可以按照字母顺序查看电影信息(设置了 Title 为索引)

      movies.sort_index().head()
      
                        Rank           Studio     Gross  Year
      Title
      10,000 B.C.        536  Warner Brothers  $269.80   2008
      101 Dalmatians     708      Buena Vista  $215.90   1961
      101 Dalmatians     425      Buena Vista  $320.70   1996
      2 Fast 2 Furious   632        Universal  $236.40   2003
      2012                93             Sony  $769.70   2009
      
  • 计算 Series 中的值

    从 DataFrame 中提取一列数据作为 Series。Pandas 在 Series 中保存了 DataFrame 的索引,即设置的 Title

    movies['Studio']
    
    Title
    Avengers: Endgame                   Buena Vista
    Avatar                                      Fox
    Titanic                               Paramount
    Star Wars: The Force Awakens        Buena Vista
    Avengers: Infinity War              Buena Vista
                                         ...       
    Yogi Bear                       Warner Brothers
    Garfield: The Movie                         Fox
    Cats & Dogs                     Warner Brothers
    The Hunt for Red October              Paramount
    Valkyrie                                    MGM
    Name: Studio, Length: 782, dtype: object
    

    Series 中有多行数据,Pandas 会截断数据集,只显示前五行和后五行

    计算 Series 列中唯一值出现的次数,将结果限制在排名前10的电源公司:

    movies['Studio'].value_counts().head(10)
    
    Studio
    Warner Brothers    132
    Buena Vista        125
    Fox                117
    Universal          109
    Sony                86
    Paramount           76
    Dreamworks          27
    Lionsgate           21
    New Line            16
    MGM                 11
    Name: count, dtype: int64
    

    返回值是另一个 Series 对象。此时,Pandas 使用 Studio 列中的公司名称作为索引标签,他们的计算作为Series 值

  • 根据一个或多个条件筛选列

    movies[movies["Studio"] == "Universal"]
    
                                    Rank     Studio       Gross  Year
    Title                                                            
    Jurassic World                     6  Universal  $1,671.70   2015
    Furious 7                          8  Universal  $1,516.00   2015
    Jurassic World: Fallen Kingdom    13  Universal  $1,309.50   2018
    The Fate of the Furious           17  Universal  $1,236.00   2017
    Minions                           19  Universal  $1,159.40   2015
    ...                              ...        ...         ...   ...
    The Break-Up                     763  Universal    $205.00   2006
    Everest                          766  Universal    $203.40   2015
    Patch Adams                      772  Universal    $202.30   1998
    Kindergarten Cop                 775  Universal    $202.00   1990
    Straight Outta Compton           776  Universal    $201.60   2015
    
    [109 rows x 4 columns]
    

    可以将过滤条件赋给一个变量

    filter_by_universal = movies["Studio"] == "Universal"
    movies[filter_by_universal]
    
                                    Rank     Studio       Gross  Year
    Title                                                            
    Jurassic World                     6  Universal  $1,671.70   2015
    Furious 7                          8  Universal  $1,516.00   2015
    Jurassic World: Fallen Kingdom    13  Universal  $1,309.50   2018
    The Fate of the Furious           17  Universal  $1,236.00   2017
    Minions                           19  Universal  $1,159.40   2015
    

    还可以根据多个条件过滤DataFrame行,获取2015年由 Universal 发行的所有电影

    filter_by_universal = movies["Studio"] == "Universal"
    filter_by_year = movies['Year'] == 2015
    movies[filter_by_universal & filter_by_year]
    
                            Rank     Studio       Gross  Year
    Title                                                    
    Jurassic World             6  Universal  $1,671.70   2015
    Furious 7                  8  Universal  $1,516.00   2015
    Minions                   19  Universal  $1,159.40   2015
    Fifty Shades of Grey     165  Universal    $571.00   2015
    Pitch Perfect 2          504  Universal    $287.50   2015
    Ted 2                    702  Universal    $216.70   2015
    Everest                  766  Universal    $203.40   2015
    Straight Outta Compton   776  Universal    $201.60   2015
    

    若要筛选符合以下两个条件中任意一个的电源:Universal 发行的电源或2015年发行的电影

    filter_by_universal = movies["Studio"] == "Universal"
    filter_by_year = movies['Year'] == 2015
    movies[filter_by_universal | filter_by_year]
    
                                    Rank       Studio       Gross  Year
    Title                                                              
    Star Wars: The Force Awakens       4  Buena Vista  $2,068.20   2015
    Jurassic World                     6    Universal  $1,671.70   2015
    Furious 7                          8    Universal  $1,516.00   2015
    Avengers: Age of Ultron            9  Buena Vista  $1,405.40   2015
    Jurassic World: Fallen Kingdom    13    Universal  $1,309.50   2018
    ...                              ...          ...         ...   ...
    The Break-Up                     763    Universal    $205.00   2006
    Everest                          766    Universal    $203.40   2015
    Patch Adams                      772    Universal    $202.30   1998
    Kindergarten Cop                 775    Universal    $202.00   1990
    Straight Outta Compton           776    Universal    $201.60   2015
    
    [140 rows x 4 columns]
    

    可以将小于或大于特定值的列值作为筛选条件。以1975年之前发行的电影为筛选条件

    filter_by_1975 = movies['Year'] < 1975
    movies[filter_by_1975]
    
                        Rank           Studio     Gross  Year
    Title                                                    
    The Exorcist         252  Warner Brothers  $441.30   1973
    Gone with the Wind   288              MGM  $402.40   1939
    Bambi                540              RKO  $267.40   1942
    The Godfather        604        Paramount  $245.10   1972
    101 Dalmatians       708      Buena Vista  $215.90   1961
    The Jungle Book      755      Buena Vista  $205.80   1967
    

    指定一个范围,所有值必须在该范围之内。筛选1983-1986年发行的电影

    year__between = movies["Year"].between(1983, 1986)
    print(movies[year__between])
    
                                          Rank     Studio     Gross  Year
    Title                                                                
    Return of the Jedi                     222        Fox  $475.10   1983
    Back to the Future                     311  Universal  $381.10   1985
    Top Gun                                357  Paramount  $356.80   1986
    Indiana Jones and the Temple of Doom   403  Paramount  $333.10   1984
    Crocodile Dundee                       413  Paramount  $328.20   1986
    Beverly Hills Cop                      432  Paramount  $316.40   1984
    Rocky IV                               467        MGM  $300.50   1985
    Rambo: First Blood Part II             469    TriStar  $300.40   1985
    Ghostbusters                           485   Columbia  $295.20   1984
    Out of Africa                          662  Universal  $227.50   1985
    

    使用 DataFrame 索引来过滤。查找标题中带有“dark"一词的所有电影。先将索引的电影标题中的大写字母转换为小写字母。

    str_lower__str_contains = movies.index.str.lower().str.contains("dark")
    print(movies[str_lower__str_contains])
    
                                    Rank           Studio       Gross  Year
    Title                                                                  
    Transformers: Dark of the Moon    23        Paramount  $1,123.80   2011
    The Dark Knight Rises             27  Warner Brothers  $1,084.90   2012
    The Dark Knight                   39  Warner Brothers  $1,004.90   2008
    Thor: The Dark World             132      Buena Vista    $644.60   2013
    Star Trek Into Darkness          232        Paramount    $467.40   2013
    Fifty Shades Darker              309        Universal    $381.50   2017
    Dark Shadows                     600  Warner Brothers    $245.50   2012
    Dark Phoenix                     603              Fox    $245.10   2019
    
  • 对数据分组

    筛选出总票房最高的电影公司,首先按电影公司名称汇总 Gross 列的值。注意 Gross 列的值存储为文本而不是数字。Pandas 将列的值作为文本导入,以保留原始 CSV 中的美元符号和逗号符合,需要删除美元符号和逗号符合,然后将列的值转换为数字

    movies['Gross'].str.replace("$", "", regex=False).str.replace(",", "", regex=False)
    
    Title
    Avengers: Endgame               2796.30 
    Avatar                          2789.70 
    Titanic                         2187.50 
    Star Wars: The Force Awakens    2068.20 
    Avengers: Infinity War          2048.40 
                                      ...   
    Yogi Bear                        201.60 
    Garfield: The Movie              200.80 
    Cats & Dogs                      200.70 
    The Hunt for Red October         200.50 
    Valkyrie                         200.30 
    Name: Gross, Length: 782, dtype: object
    

    可以看到 dtype 类型为 object ,可以转换为浮点数

    movies['Gross'].str.replace("$", "", regex=False).str.replace(",", "", regex=False).astype(float)
    
    Title
    Avengers: Endgame               2796.3
    Avatar                          2789.7
    Titanic                         2187.5
    Star Wars: The Force Awakens    2068.2
    Avengers: Infinity War          2048.4
                                     ...  
    Yogi Bear                        201.6
    Garfield: The Movie              200.8
    Cats & Dogs                      200.7
    The Hunt for Red October         200.5
    Valkyrie                         200.3
    Name: Gross, Length: 782, dtype: float64
    

    以上操作都是在原始数据结构的副本上进行的操作,需要修改原始数据,需要进行覆盖操作

    gross_series = movies['Gross'].str.replace("$", "", regex=False).str.replace(",", "", regex=False).astype(float)
    movies['Gross'] = gross_series
    print(movies.dtypes)
    
    Rank        int64
    Studio     object
    Gross     float64
    Year        int64
    dtype: object
    

    计算电影的平均票房收入

    movies['Gross'].mean()
    
    439.0308184143222
    

    计算每个电影公司的总票房收入

    先对电影公司进行分组

    studios = movies.groupby("Studio")
    

    对电影出票数量由多到少的顺序排序

    studios['Gross'].count().sort_values(ascending=False).head()
    
    Studio
    Warner Brothers    132
    Buena Vista        125
    Fox                117
    Universal          109
    Sony                86
    Name: Gross, dtype: int64
    

    计算每个电影公司的总票房,找出票房最高的5家公司

    studios['Gross'].sum().sort_values(ascending=False).head()
    
    Studio
    Buena Vista        73585.0
    Warner Brothers    58643.8
    Fox                50420.8
    Universal          44302.3
    Sony               32822.5
    Name: Gross, dtype: float64