엘리스 AI 트랙 12주차 - 파이썬으로 시작하는 데이터 분석 I (3/17)🔥

Max 2021. 3. 26. 12:52

✔ 12주차. 데이터 분석 핵심 기법

<학습 목표>

파이썬 라이브러리인 Pandas, Numpy, Matplotlib의 사용할 수 있습니다.
파이썬 라이브러리를 활용하여 데이터 분석을 할 수 있습니다.
데이터를 분석하고 시각화하는 방법을 알수 있습니다.

[01 NumPy 사용해보기]

1. 배열 만들기
- Numpy : Numerical Python. Python에서 대규모 다차원 배열을 다룰 수 있게 도와주는 라이브러리. 파이썬 리스트에 비해 빠른 연산을 지원하고 메모리를 효율적으로 사용함. (import 해서 사용함)
- 배열 만들기 : array 사용

# 일반적인 list로 배열 만들기
list(range(10)) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# numpy 사용해서 만들기
import numpy as np
np.array([1, 2, 3, 4, 5]) # 정수형태 / array([1, 2, 3, 4, 5])
# 실수 형태의 아이템이 들어갈 경우, 모든 데이터가 실수 형태로 변경됨
np.array([3, 1.4, 2, 3, 4]) # array([3. , 1.4, 2. , 3. , 4. ])
# 2차원 배열
np.array([[1, 2],  # array([[1, 2],
          [3, 4]])         [3, 4]])

# dtype : 배열 데이터 타입
np.array([1, 2, 3, 4], dtype='float’)
# array([1., 2., 3., 4.])

- array는 단일 타입으로 구성되어 있음. dtype 옵션 사용시 선택 가능

dtype	설명	다양한 표현
int	정수형 타입	i, int_, int32, int64, i8
float	실수형 타입	f, float_, float32, float64, f8
str	문자열 타입	str, U, U32
bool	부울 타입	?, bool_

- 다양한 배열 만들기 : zeros, ones, arange, linspace 사용

# zeros(인자 갯수, 타입) : 0이 들어있는 배열 생성
np.zeros(10, dtype=int)  # array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

# ones(배열의 차원, 타입) : 1이 들어있는 배열 생성
np.ones((3, 5), dtype=float) # array([[1., 1., 1., 1., 1.],
                             #        [1., 1., 1., 1., 1.],
                             #        [1., 1., 1., 1., 1.]])

# arange(start, end, step) : python range 와 동일
np.arange(0, 20, 2) # array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])

# linspace(start, end, step)
np.linspace(0, 1, 5) # array([0. , 0.25, 0.5 , 0.75, 1. ])

# 초기화가 없는 값으로 배열을 반환
np.empty  # <built-in function empty>

- 난수로 채워진 배열 만들기 : random 사용

# random.random(튜플로 된 행렬크기)
np.random.random((2, 2)) # array([[0.30986539, 0.85863508],
                         #        [0.89151021, 0.19304196]])

# random.normal(평균, 표준편차, 튜플로 된 행렬크기) : 정규분포로 데이터 추출
np.random.normal(0, 1, (2, 2)) # array([[ 0.44050683, 0.04912487],
                               #        [-1.67023947, -0.70982067]])

# random.randint(start, end, 튜플로 된 행렬크기) : 임의의 숫자로 된 array
np.random.randint(0, 10, (2, 2)) # array([[3, 9],
                                 #        [3, 2]])

2. 배열의 기초
- 배열의 기초 : ndim, shape, size, dtype

# random array 생성
x2 = np.random.randint(10, size=(3, 4)) # array([[2, 2, 9, 0],
                                        #        [4, 2, 1, 0],
                                        #        [1, 8, 7, 3]])

# 배열의 차원
x2.ndim   # 2

# 배열의 모양
x2.shape  # (3, 4)

# 배열의 크기
x2.size   # 12

# 배열의 데이터 타입
x2.dtype  # dtype('int64')

- 찾고 잘라내기 : Indexing & Sliceing. 인덱스를 사용해 값을 찾아내고, 인덱스 값으로 배열의 부분을 가져옴

x = np.arange(7) # 0 1 2 3 4 5 6
x[3] # 3
x[7] # IndexError: index 7 is out of bounds
x[0] = 10 # array([10, 1, 2, 3, 4, 5, 6])
x[1:4] # array([1, 2, 3])
x[1:] # array([1, 2, 3, 4, 5, 6])
x[:4] # array([0, 1, 2, 3])
x[::2] # array([0, 2, 4, 6])

3. Reshape & 이어붙이고 나누기
- reshape : array의 shape를 변경
- concatenate : array를 이어 붙임
- np.concatenate : axis 축을 기준으로 이어붙임
- np.split : axis 축을 기준으로 분할 (axis = 0 or 1)

# reshape
array = np.arange(8)
array.shape  # (8,)
array2 = array.reshape((2, 4)) # array([[0, 1, 2, 3],
                               #        [4, 5, 6, 7]])
array2.shape # (2, 4)

# concatenate
x = np.array([0, 1, 2])
y = np.array([3, 4, 5])
np.concatenate([x, y]) # array([0, 1, 2, 3, 4, 5])

matrix = np.arange(4).reshape(2, 2)      # [0, 1]
                                         # [2, 3]

np.concatenate([matrix, matrix], axis=0) # [0, 1]
                                         # [2, 3]
                                         # [0, 1]
                                         # [2, 3]

# split
matrix = np.arange(16).reshape(4, 4)
upper, lower = np.split(matrix, [3], axis=0)

4. NumPy 연산
- 기본 연산 : array는 +, -, *, / 에 대한 기본 연산을 지원

x = np.arange(4) # array([0, 1, 2, 3])
x + 5            # array([5, 6, 7, 8])
x - 5            # array([-5, -4, -3, -2])
x * 5            # array([ 0, 5, 10, 15])
x / 5            # array([0. , 0.2, 0.4, 0.6])

- 행렬간 연산

x = np.arange(4).reshape((2, 2))
y = np.random.randint(10, size=(2, 2))
x + y  # array([[1, 7],
       #        [6, 5]])
x - y  # array([[-1, -5],
       #        [-2, 1]])

5. 브로드캐스팅
- 브로드캐스팅 : Broadcasting. shape이 다른 array끼리 연산

https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html

A = np.arange(3).reshape(3,1)
B = np.arange(3)
print(A + B) # [[0 1 2]
             #  [1 2 3]
             #  [2 3 4]]

6. 집계함수 & 마스킹 연산
- 집계: 데이터에 대한 요약 통계

x = np.arange(8).reshape((2, 4)) # [[0, 1, 2, 3],
                                 #  [4, 5, 6, 7]]
np.sum(x)  # 합 : 28
np.min(x)  # 가장 작은 값 : 0
np.max(x)  # 가장 큰 값 : 7
np.mean(x) # 평균 : 3.5
np.str(x)  # 표준편차
np.sum(x, axis=0)  # 세로끼리 더함, [4, 6, 8, 10]
np.sum(x, axis=1)  # 가로끼리 더함, [6, 22]

- 마스킹 연산: True, False array를 통해서 특정 값들을 뽑아내는 방법

x = np.arange(5) # [0, 1, 2, 3, 4]
x < 3     # [True, True, True, False, False]
x > 5     # [False, False, False, False, False]
x[x < 3]  # [0, 1, 2]

7. 양치기 소년의 거짓말 횟수 구하기
- 문제 : 주어진 daily_liar_data 배열은 양치기 소년이 100일동안 한 말을 정리한 배열입니다. 0은 거짓말을 한 날이고, 1은 거짓말을 하지 않은 날입니다. 양치기소년이 100일 중 거짓말을 총 몇 번 했는지, 그 횟수를 세어 정확히 출력해주세요.

import numpy as np

daily_liar_data = [0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 
                   0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
                   0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 
                   0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 
                   0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]

lier = np.array(daily_liar_data)

print(lier[lier == 0].size)
# print(np.count_nonzero([lier == 0]))

[02 Pandas 기본 알아보기]

1. Series 데이터
- Pandas : 구조화된 데이터를 효과적으로 처리하고 저장할 수 있는 파이썬 라이브러리. Array 계산에 특화된 numpy를 기반
- Series : 특수한 dict. numpy array가 보강된 형태, Data와 Index를 가지고 있으며 Index로 접근이 가능

import pandas as pd

# name 인자로 이름을 지정
data = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'], name="Title")
# 인덱스로 접근
print(data['b'])   # 2
print(data, "\n")  # a    1
                   # b    2
                   # c    3
                   # d    4
                   # Name: Title, dtype: int64 

# 딕셔너리로 만들기
population_dict = {
    'korea': 5180,
    'japan': 12718,
    'china': 141500,
    'usa': 32676
}
population = pd.Series(population_dict)
print(population)  # korea      5180
                   # japan     12718
                   # china    141500
                   # usa       32676
                   # dtype: int64

2. DataFrame
- DataFrame : 여러 개의 Series가 모여서 행과 열을 이룬 데이터

gdp_dict = {
    'korea': 169320000,
    'japan': 516700000,
    'china': 1409250000,
    'usa': 2041280000,
}

gdp = pd.Series(gdp_dict)
country = pd.DataFrame({
    'population': population,
    'gdp': gdp
})
print(country)  #        population         gdp
                # korea        5180   169320000
                # japan       12718   516700000
                # china      141500  1409250000
                # usa         32676  2041280000

print(country.index)    # Index(['korea', 'japan', 'china', 'usa'], dtype='object')
print(country.columns)  # Index(['population', 'gdp'], dtype='object')

# Series도 numpy array처럼 연산자를 활용
gdp_per_capita = country['gdp'] / country['population’]
country['gdp per capita’] = gdp_per_capita
print(country['gdp per capita'])  # Country DataFrame
                                  # korea    32687.258687
                                  # japan    40627.457147
                                  # china     9959.363958
                                  # usa      62470.314604
                                  # Name: gdp per capita, dtype: float64

- 저장과 불러오기 : to_csv(), to_excel(), read_csv(), read_excel()

3. Indexing/Slicing
- .loc : 명시적인 인덱스를 참조하는 인덱싱/슬라이싱
- .iloc : 파이썬 스타일 정수 인덱스 인덱싱/슬라이싱

print(country.loc['china'])  #
print(country.loc['japan':'korea', :'population'])  #

print(country.iloc[0])  #
print(country.iloc[1:3, :2])  #

- DataFrame 새 데이터 추가/수정

import numpy as np
import pandas as pd

dataframe = pd.DataFrame(columns=['이름', '나이', '주소'])

# 리스트로 추가
dataframe.loc[0] = ['엘리스', '10', '서울']

# 딕셔너리로 추가
dataframe.loc[1] = {'이름':'체셔', '나이':'25', '주소':'인천'}
dataframe.loc[1, '이름'] = '모자장수'

dataframe['전화번호'] = np.nan
dataframe.loc[0, '전화번호'] = '01012341234'
print(len(dataframe)) # 2
print(dataframe)      #       이름  나이  주소    전화번호
                      # 0   엘리스   10   서울   01012341234
                      # 1  모자장수  25   인천      NaN


# 컬럼 선택하기
# 컬럼 이름이 하나만 있다면 Series
print(dataframe["이름"])   # 0     엘리스
                           # 1    모자장수
                           # Name: 이름, dtype: object
# 리스트로 들어가 있다면 DataFrame
print(dataframe[["이름", "주소", "나이"]]) #      이름  주소  나이
                                          # 0   엘리스  서울  10
                                          # 1  모자장수  인천  25

4. Pandas&연산과 함수
- 누락된 데이터 체크 : isnull(), notnull()

print(dataframe.isnull())   #    이름     나이     주소   전화번호
                            # 0  False  False  False  False
                            # 1  False  False  False   True

print(dataframe.notnull())  #    이름    나이    주소   전화번호
                            # 0  True  True  True   True
                            # 1  True  True  True  False

dataframe.dropna()
dataframe['전화번호'] = dataframe['전화번호'].fillna('전화번호 없음')
print(dataframe)   #      이름  나이  주소         전화번호
                   # 0   엘리스  10  서울  01012341234
                   # 1  모자장수  25  인천      전화번호 없음

- Series 연산 : numpy array에서 사용했던 Series연산자들을 동일하게 사용

A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A + B) # 0    NaN
             # 1    5.0
             # 2    9.0
             # 3    NaN
             # dtype: float64

print(A.add(B, fill_value=0)) # 0    2.0
                              # 1    5.0
                              # 2    9.0
                              # 3    5.0
                              # dtype: float64

- DataFrame 연산 : add ( + ), sub ( - ), mul ( * ), div ( / )

A = pd.DataFrame(np.random.randint(0, 10, (2, 2)), columns=list("AB"))
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=list("BAC"))

print(A + B)  #      A     B   C
              # 0  14.0   5.0 NaN
              # 1   4.0  11.0 NaN
              # 2   NaN   NaN NaN

print(A.add(B, fill_value=0)) #      A     B    C
                              # 0  14.0   5.0  3.0
                              # 1   4.0  11.0  2.0
                              # 2   6.0   5.0  8.0

- 집계함수 : numpy array에서 사용했던 sum, mean 등의 집계함수를 동일하게 사용할 수 있음

data = {
    'A': [ i+5 for i in range(3) ],
    'B': [ i**2 for i in range(3) ]
}
df = pd.DataFrame(data)
print(df['A'].sum()) # 18
print(df.sum())      # A    18
                     # B     5
                     # dtype: int64

print(df.mean())     # A    6.000000
                     # B    1.666667
                     # dtype: float64

5. Dataframe 정렬하기
- 값으로 정렬하기 : sort_values()

df = pd.DataFrame({
    'col1' : [2, 1, 9, 8, 7, 4],
    'col2' : ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col3' : [0, 1, 9, 4, 2, 3],
})

# 오름차순 정렬
print(df.sort_values('col1'))
#     col1 col2  col3
# 1     1    A     1
# 0     2    A     0
# 5     4    C     3
# 4     7    D     2
# 3     8  NaN     4
# 2     9    B     9

# 내림차순 정렬(ascending)
print(df.sort_values('col1', ascending=False))
#     col1 col2  col3
# 2     9    B     9
# 3     8  NaN     4
# 4     7    D     2
# 5     4    C     3
# 0     2    A     0
# 1     1    A     1

# 정렬 순서
sorted_df3 = df.sort_values(['col2', 'col1'], ascending=[True, False])
#     col1 col2  col3
# 0     2    A     0
# 1     1    A     1
# 2     9    B     9
# 5     4    C     3
# 4     7    D     2
# 3     8  NaN     4

6. 쑥쑥 자라라 콩나무야!
- 가장 height가 큰 나무를 출력할 때엔 인덱싱을 이용하여 출력하기

tree = pd.read_csv("./data/tree_data.csv")

tree_df = tree.sort_values("height", ascending=False)
print(tree_df.iloc[:1])

※ 수업 자료의 출처는 K-Digital Training x 엘리스 인공지능 서비스 개발 기획 1기 (elice.io/)입니다.

저작자표시 비영리 변경금지 (새창열림)