엘리스 AI 트랙 12주차 - 파이썬으로 시작하는 데이터 분석 II (3/18)🔥

Max 2021. 3. 27. 01:32

✔ 12주차. 데이터 분석 핵심 기법

<학습 목표>

파이썬 라이브러리인 Pandas, Numpy, Matplotlib의 사용할 수 있습니다.
파이썬 라이브러리를 활용하여 데이터 분석을 할 수 있습니다.
데이터를 분석하고 시각화하는 방법을 알수 있습니다.

[01 Pandas 심화 알아보기]

1. 조건으로 검색하기
- numpy array와 마찬가지로 masking 연산 가능

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 2), columns=['A', 'B'])
print(df)
#           A         B
# 0  0.609329  0.841142
# 1  0.451572  0.712937
# 2  0.555806  0.667044
# 3  0.113700  0.202053
# 4  0.348867  0.153225

# masking 연산 적용
print(df[(df['A'] < 0.5) & (df['B'] > 0.3)])
#           A         B
# 1  0.451572  0.712937

# query method 사용
print(df.query("A < 0.5 and B > 0.3"))
#           A         B
# 1  0.451572  0.712937

- 문자열의 경우, 다른 방식으로 조건 검색이 가능 (ex. str.contains(), str.match())
📌 참고 (나중에 이것만 따로 정리할 것!!)

2. 함수로 데이터 처리하기
- apply 를 사용해 함수로 데이터를 처리함

df = pd.DataFrame(np.arange(5), columns=["Num"])
def square(x):
    return x**2
df["Num"].apply(square)
print(df)
#     Num
# 0    0
# 1    1
# 2    2
# 3    3
# 4    4 

df["Square"] = df.Num.apply(lambda x: x ** 2) # == df["Square"] = df["num"].apply(square)
# 0     0
# 1     1
# 2     4
# 3     9
# 4    16
# Name: Square, dtype: int64

- replace: apply 기능에서 데이터 값만 대체 하고 싶을때

df = pd.DataFrame(["Male", "Male", "Female", "Female", "Male"], columns=["Sex"])
print(df)
#     Sex
# 0    Male
# 1    Male
# 2  Female
# 3  Female
# 4    Male

# df["Sex"] = df.Sex.replace({"Male": 0, "Female": 1}) 아래 코드와 동일
# df["Sex"] 저장하는 과정없이 기존 dataframe 의 값을 바로 바꿈
df.Sex.replace({"Male": 0, "Female": 1}, inplace=True)
print(df)
#     Sex
# 0    0
# 1    0
# 2    1
# 3    1
# 4    0

3. 그룹으로 묶기
- 조건부로 집계하고 싶은 경우 사용

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(6)})
df.groupby('key') # 그룹만 지어진 상태
# <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10e3588>

print(df.groupby('key').sum())
# 

print(df.groupby(['key','data1']).sum())

- aggregate

df = pd.DataFrame({
    'key': ['A', 'B', 'C', 'A', 'B', 'C'],
    'data1': [1, 2, 3, 1, 2, 3],
    'data2': [4, 4, 6, 0, 6, 1]
})

# groupby : 집계를 한번에 계산하는 방법
print(df.groupby('key').aggregate(['min', np.median, max]))
#       data1            data2           
#       min median max   min median max
# key                                  
# A       1      1   1     0    2.0   4
# B       2      2   2     4    5.0   6
# C       3      3   3     1    3.5   6

print(df.groupby('key').aggregate({'data1': 'min', 'data2': np.sum}))
#        data1  data2
# key              
# A        1      4
# B        2     10
# C        3      7

# filter : groupby를 통해서 그룹 속성을 기준으로 데이터 필터링
def filter_by_mean(x):
    return x['data2'].mean() > 3
print(df.groupby('key').mean())
#      data1  data2
# key              
# A      1.0    2.0
# B      2.0    5.0
# C      3.0    3.5

print(df.groupby('key').filter(filter_by_mean))
#    key   data1  data2
# 1   B      2      4
# 2   C      3      6
# 4   B      2      6
# 5   C      3      1


# apply : groupby를 통해서 묶인 데이터에 함수 적용
print(df.groupby('key').apply(lambda x: x.max() - x.min()))
#        data1  data2
# key              
# A        0      4
# B        0      2
# C        0      5


# get_group : groupby로 묶인 데이터에서 key값으로 데이터를 가져올 때 사용
print(df.groupby('key').get_group("A"))
#    key   data1  data2
# 0   A      0      4
# 3   A      3      0

3. MultiIndex & pivot_table
- MultiIndex : Index 를 계층적으로 만들때 사용함(행/열 모두 가능). 인덱스 탐색시 loc, iloc를 사용.

df = pd.DataFrame(
    np.random.randn(4, 2), 
    index=[['A', 'A', 'B', 'B'], [1, 2, 1, 2]],
    columns=['data1', 'data2']
)

print(df1)
#         data1     data2
# A 1 -1.154304  0.841912
#   2 -1.317162 -1.143913
# B 1  0.611612  0.386744
#   2 -0.663984 -0.675955

print(df1.loc['A', 1]) 
# data1   -1.154304
# data2    0.841912
# Name: (A, 1), dtype: float64

- pivot_table : 데이터에서 필요한 자료만 뽑아서 새롭게 요약, 분석 할 수 있는 기능(엑셀에서의 피봇 테이블과 같음)
- Index : 행 인덱스로 들어갈 key
- Column : 열 인덱스로 라벨링될 값
- Value : 분석할 데이터

4. 피리 부는 사나이를 따라가는 아이들

df = pd.read_csv("./data/the_pied_piper_of_hamelin.csv")
print(df)

# 피리부는 사나이 데이터에서 마스킹 연산으로 아이들만 뽑아냄
child = df[df["구분"] == "Child"]
print(child)

# 피리부는 사나이를 따라간 아이들의 일차별 연령을 groupby 함수로 계산
print(child.groupby("일차").mean())

# pivot table 사용해 아이들의 일차별 연령을 성별로 나누어 표로 출력
print(child.pivot_table(index="일차", columns="성별", values="나이", aggfunc=np.mean))

# 아이들 이름 데이터만 뽑기
for name in child["이름"].unique():
    print(name)

[02 Matplotlib 데이터 시각화]

1. Matplotlib 소개
- Matplotlib : 파이썬에서 데이터를 그래프나 차트로 시각화할 수 있는 라이브러리
- Matplotlib 구조

https://realpython.com/python-matplotlib-guide/

- 그래프 그려보기

x = [1, 2, 3, 4, 5]
y = [1, 2, 3, 4, 5]
fig, ax = plt.subplots()   # plt.plot(x, y)

# 그래프 이름
ax.set_title("First Plot") # == plt.title("First Plot")

# X축 이름 
ax.set_xlabel("x")  # == plt.xlabel("x")

# Y축 이름
ax.set_ylabel("y")  # == plt.ylabel("y")

# dip 설정
fig.set_dip(300)

# 저장하기
fig.savefig("first_plot.png")

- 여러개 그래프 그리기

x = np.linspace(0, np.pi*4, 100)

# subplots(세로축, 가로축)
fig, axes = plt.subplots(2, 1)

axes[0].plot(x, np.sin(x))
axes[1].plot(x, np.cos(x))

2. Matplotlib 그래프
- Line plot

fig, ax = plt.subplots()
x = np.arange(15)
y = x ** 2
ax.plot(
    x, y,
    linestyle=":",
    marker="*",
    color="#524FA1" 
)

- Line style : linestyle 사용. -(solid), --(dashed), -.(dashdot), :(dotted)
- Color : 색상의 축약어(red → r, green → g), 색상명(green), 0~1 사이값(grayscale), rgb(#524FA1)
- Marker : .(점), o(원), v(삼각형), s(네모), *(별)

https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/marker_reference.html

- 축 경계 조정하기 : set_xlim(start, end), set_ylim(start, end) 사용
- 범례 : loc(위치, upper or lower or center / left or right or center), shadow(그림자, T/F), fancybox(모서리, T/F), borderpad(크기)
📌 참고

3. Scatter
- Scatter (산점도 그래프)

fig, ax = plt.subplots()
x = np.arange(10)
ax.plot(
    x, x**2, "o",  # x축, y축, 마커
    markersize=15, # 원 크기
    markerfacecolor='white', # 원 안쪽 색
    markeredgecolor="blue"   # 원 바깥쪽 색
)

x = np.random.randn(100)
y = np.random.randn(100)

c = np.random.randint(0, 200, 100)

s = 500 * np.pi * 1 ** 2

ax.scatter(x, y, c=c, s=s, alpha=0.2)

4. Bar&Histogram
- Bar plot

# bar graph
x = np.arange(10)
fig, ax = plt.subplots(figsize=(12, 4)) # figsize : 도화지 사이즈
ax.bar(x, x*2)

- 누적 그래프

x = np.random.rand(3)
y = np.random.rand(3)
z = np.random.rand(3)
data = [x, y, z]

fig, ax = plt.subplots()
x_ax = np.arange(3)
for i in x_ax:
    ax.bar(x_ax, data[i],
    # 이전에 계산한 값 위로 쌓일 수 있도록 함
    bottom = np.sum(data[:i], axis=0))

ax.set_xticks(x_ax)
ax.set_xticklabels(["test1", "test2", "test3"])

- Histogram (도수분포표)

fig, ax = plt.subplots()
data = np.random.randn(1000)
ax.hist(data, bins=20) # bins : 표시될 막대 갯수

5. Matplotlib with Pandas
- 미국 대통령 키 그래프

df = pd.read_csv("./president_heights.csv")

fig, ax = plt.subplots()
ax.plot(df["order"], df["height(cm)"], label="height")

ax.legend()
ax.set_xlabel("order")
ax.set_ylabel("height(cm)")

- 물/불 포켓몬의 공격력과 방어력 그래프

fire = df[(df['Type 1']=='Fire') | ((df['Type 2'])=="Fire")]
water = df[(df['Type 1']=='Water') | ((df['Type 2'])=="Water")]

fig, ax = plt.subplots()

ax.scatter(fire['Attack'], fire['Defense’], color='R', label='Fire', marker="*", s=50)
ax.scatter(water['Attack'], water['Defense’],color='B', label="Water", s=25)

ax.set_xlabel("Attack")
ax.set_ylabel("Defense")
ax.legend(loc="upper right")

6. 토끼와 거북이 경주 결과

# 데이터 파일 읽어오기
df = pd.read_csv("./data/the_hare_and_the_tortoise.csv")

# 첫번째 컬럼을 index 컬럼으로 바꿀때 index_col 사용
# df = pd.read_csv("./data/the_hare_and_the_tortoise.csv", index_col=0)

# 읽어온 데이터를 인덱스로 변경할 때
df.set_index("시간", inplace=True)

# 도화지 세팅
fig, ax = plt.subplots()

# 데이터 세팅
ax.plot(df["토끼"], label="토끼")
ax.plot(df["거북이"], label="거북이")

# 범례 추가
ax.legend()

[03 월드컵 데이터 분석해보기]

1. 데이터 분석
- 데이터 분석 : 주어진 자료를 가공하여 원하는 정보와 결론을 얻어내는 일련의 처리 과정
- 데이터 분석의 단계 : 주제 선정 → 데이터 구조 파악 → 데이터 전처리 → 데이터 분석 구현
① 주제 선정 - 어떤 데이터를 선정할 지, 데이터에서 어떤 가설을 세우고 분석을 시작할 지, 어떤 결론을 원하는 지 등 데이터 분석의 목적을 세움
② 데이터 구조 파악 - 데이터를 분석하기 위해서, 데이터가 저장된 형태와 자료형, 변수 이름 등을 미리 파악. 또는 데이터 프레임에 통계량 함수를 적용하여, 데이터의 분포도나 성향 등을 파악할 수 있음
③ 데이터 전처리 - 데이터를 분석하기 전, 필요한 변수만을 추출하거나 기존의 변수로 새로운 변수를 계산하여 만들기도 함. 데이터의 결측값과 이상값이 있다면, 이 단계에서 올바르게 제거하여야 데이터 분석 결과를 올바르게 확인할 수 있음
④ 데이터 분석 - 주제 선정 단계에서 세운 가설을 numpy, pandas 등으로 데이터를 연산, 가공하여 가설을 입증하거나 원하는 정보를 얻어내는 것을 구현 하는 단계. 얻어낸 정보를 효과적으로 보여주기 위해 시각화하는 경우도 있음.

2. 역대 월드컵의 관중 수 출력하기
- WorldCups.csv파일을 pandas의 DataFrame으로 만들기
- 만든 데이터 프레임의 칼럼 중 Year 와 Attendance 칼럼만 추출하여 출력

import pandas as pd
import numpy as np

world_cups = pd.read_csv("WorldCups.csv")

world_cups = world_cups[['Year', 'Attendance']]
print(world_cups)

3. 역대 월드컵의 관중 수

world_cups = pd.read_csv("WorldCups.csv")

world_cups = world_cups[['Year', 'Attendance']]
print(world_cups)

plt.plot(world_cups['Year'], world_cups['Attendance'], marker='o', color='black')

plt.savefig("image.svg", format="svg")

4. 역대 월드컵의 경기당 득점 수
- WorldCupscsv파일을 pandas의 DataFrame으로 만들기
- 만든 데이터 프레임의 칼럼 중 Year 와 GoalsScored, MatchesPlayed 칼럼만 추출
- 데이터 프레임에 경기당 득점 수를 의미하는 새로운 칼럼 GoalsPerMatch를 추가 (GoalsPerMatch = GoalsScored / MatchesPlayed)
- 데이터 프레임을 출력, 칼럼이 제대로 추가되었는지 확인

world_cups = pd.read_csv("WorldCups.csv")

world_cups = world_cups[['Year', 'GoalsScored', 'MatchesPlayed']]
world_cups['GoalsPerMatch'] = world_cups['GoalsScored'] / world_cups['MatchesPlayed']

print(world_cups)
#     Year  GoalsScored  MatchesPlayed  GoalsPerMatch
# 0   1930           70             18       3.888889
# 1   1934           70             17       4.117647
# 2   1938           84             18       4.666667
# ...

5. 월드컵 매치 데이터 전처리
- 데이터 전처리를 위해 데이터 프레임의 일부 값을 replace 함수를 사용해 교체
- 중복값을 제거

world_cups_matches = pd.read_csv('WorldCupMatches.csv')

world_cups_matches = world_cups_matches.replace('Germany FR', 'Germany')

dupli = world_cups_matches.duplicated()
print(len(dupli[dupli==True]))  # 16

world_cups_matches = world_cups_matches.drop_duplicates()
dupli = world_cups_matches.duplicated()
print(len(dupli[dupli==True]))  # 0

6. 국가별 득점 수 구하기
- 이전에 전처리한 WorldCupMatches.csv 파일이 제공
- Home Team Name으로 그룹을 묶고, Home Team Goals 칼럼을 추출하여 홈 팀 득점을 나타내는 home에 저장
- Away Team Name으로 그룹을 묶고, Away Team Goals 칼럼을 추출하여 어웨이 팀 득점을 나타내는 away에 저장
- concat 메소드로 home, away 데이터 프레임을 하나로 합치고, goal_per_country 라는 새로운 데이터프레임에 저장
- 결측값을 제거하기 위해 fillna 함수를 적용
- Home Team Goals와 Away Team Goals 를 덧셈 연산한 값을 Goals에 저장
- goal_per_country 에서 Goals 칼럼 추출 및 내림차순 정렬. 저장된 값의 dtype를 정수형 변경

world_cups_matches = preprocess.world_cups_matches

home = world_cups_matches.groupby(['Home Team Name'])['Home Team Goals'].sum()
away = world_cups_matches.groupby(['Away Team Name'])['Away Team Goals'].sum()

goal_per_country = pd.concat([home, away], axis=1, sort=True).fillna(0)

goal_per_country['Goals'] = goal_per_country['Home Team Goals'] + goal_per_country['Away Team Goals']

goal_per_country = goal_per_country["Goals"].sort_values(ascending = False)
goal_per_country = goal_per_country.astype(int)

print(goal_per_country)
# Germany                   224
# Brazil                    221
# Argentina                 131
# Italy                     128
# ...

7. 2014 월드컵 다득점 국가 순위
- 전처리 작업이 된 world_cups_matches 에서 마스킹 연산을 이용하여 Year가 2014인 것들을 추출
- Home Team Name을 그룹으로 묶어 Home Team Goals의 합계를 구하고 home_team_goal 변수에 저장
- Away Team Name을 그룹으로 묶어 Away Team Goals의 합계를 구하고 away_team_goal 변수에 저장
- 홈 득점 수와 원정 득점 수를 하나의 데이터로 합치며, 결측값을 없애기 위해 fillna 함수를 적용 (결측값 = 골X, 0으로 대체)
- 홈 팀 골과 원정 팀 골 수를 합한 새로운 칼럼 goals를 만들고, 기존 칼럼은 drop 함수를 이용해 삭제
- 저장된 값을 정수로 변환하고, 데이터 프레임을 내림차순으로 정렬 후 출력

world_cups_matches = preprocess.world_cups_matches

world_cups_matches = world_cups_matches[world_cups_matches['Year']==2014]

home_team_goal = world_cups_matches.groupby(['Home Team Name'])['Home Team Goals'].sum()
away_team_goal = world_cups_matches.groupby(['Away Team Name'])['Away Team Goals'].sum()

team_goal_2014 = pd.concat([home_team_goal, away_team_goal], axis=1).fillna(0)

team_goal_2014['goals'] = team_goal_2014['Home Team Goals'] + team_goal_2014['Away Team Goals']
team_goal_2014 = team_goal_2014.drop(['Home Team Goals', 'Away Team Goals'], axis=1)

team_goal_2014.astype('int')
team_goal_2014 = team_goal_2014['goals'].sort_values(ascending=False)

print(team_goal_2014)
# Germany                   18
# Netherlands               15
# Colombia                  12
# ...

8. 월드컵 4강 이상 성적 집계하기
- 데이터 프레임에서 역대 대회 1위, 2위, 3위, 4위 추출해 각각 변수 winner, runners_up, third, fourth 저장
- value_counts 함수로 각 시리즈 데이터에 저장된 값을 세고 저장
- 위 데이터를 하나의 데이터 프레임으로 합치고 ranks에 들어있는 값이 NaN이라면, 결측값을 0으로 채우고, dtype을 int64로 다시 설정
- 각 국가들을 우승 횟수, 준우승 횟수, 3위 횟수, 4위 횟수 순서대로 내림차순 정렬

world_cups = pd.read_csv("WorldCups.csv")

winner = world_cups["Winner"]
runners_up = world_cups["Runners-Up"]
third = world_cups["Third"]
fourth = world_cups["Fourth"]

winner_count = pd.Series(winner.value_counts())
runners_up_count = pd.Series(runners_up.value_counts())
third_count = pd.Series(third.value_counts())
fourth_count = pd.Series(fourth.value_counts())

ranks = pd.DataFrame({
  "Winner" : winner_count,
  "Runners_Up" : runners_up_count,
  "Third" : third_count,
  "Fourth" : fourth_count
})

ranks = ranks.fillna(0).astype('int64')
ranks = ranks.sort_values(['Winner', 'Runners_Up', 'Third', 'Fourth'], ascending=False)

print(ranks)
#                 Winner  Runners_Up  Third  Fourth
# Brazil               5           2      2       2
# Germany              4           4      4       1
# Italy                4           2      1       1
# ...

재밌다!!!!

※ 수업 자료의 출처는 K-Digital Training x 엘리스 인공지능 서비스 개발 기획 1기 (elice.io/)입니다.

저작자표시 비영리 변경금지 (새창열림)