[sklearn] pipeline

2018. 11. 29. 16:26

sklearn.pipeline

사이킷런에는 연속된 변환을 순서대로 처리할 수 있도록 도와주는 Pipeline 클래스가 존재합니다. 아래의 코드는 숫자 특성을 처리하는 간단한 파이프라인입니다.

Pipeline

파이프라인은 여러 변환 단계를 정확한 순서대로 실행할 수 있도록 하는 것입니다. 사이킷런은 연속된 변환을 순서대로 처리할 수 있도록 도와주는 Pipeline 클래스가 있습니다.

Pipeline은 연속된 단계를 나타내는 이름/추정기 쌍의 목록을 입력으로 받습니다. 마지막 단계에는 변환기와 추정기를 모두 사용할 수 있고 그 외에는 모두 변환기여야 합니다(즉 fit_transform() 메서드를 가지고 있어야 합니다). 이름은 무엇이든 상관없지만, 이중 밑줄 문자(__)는 포함하지 않아야 합니다.

파이프라인의 fit() 메서드를 호출하면 모든 변환기의 fit_transform() 메서드를 순서대로 호출하면서 한 단계의 출력을 다음 단계의 입력으로 전달합니다. 마지막 단계에서는 fit()메서드만 호출합니다.

파이프라인 객체는 마지막 추정기와 동일한 메서드를 제공합니다. 이 예에서는 마지막 추정기가 변환기 StandardScaler이므로 파이프라인이 데이터에 대해 모든 변환을 순서대로 적용하는 transform() 메서드를 가지고 있습니다(또한 fit_transform() 메서드도 가지고 있습니다).

Pipeline 예제 코드

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.preprocessing import OnehotEncoder, CategoricalEncoder

# 숫자형 변수를 전처리하는 Pipeline
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attr)),
    ('imputer', Imputer(strategy = 'median')),
    ('std_scaler', StandardScaler())
])

# 범주형 변수를 전처리하는 Pipeline
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attr)),
    ('cat_encoder', CategoricalEncoder(encoding = 'onehot-dense'))
])

Feature Union

위와 같이 하나의 데이터셋에서 수치형 변수, 범주형 변수에 대한 파이프라인을 각각 만들었습니다. 어떻게 두 파이프라인을 하나로 합칠 수 있을까요? 정답은 사이킷런의 FeatureUnion입니다. 변환기 목록을 전달하고 transform() 메서드를 호출하면 각 변환기의 transform() 메서드를 병렬로 실행합니다. 그런 다음 각 변환기의 결과를 합쳐 반환합니다. 숫자형과 범주형 특성을 모두 다루는 전체 파이프라인은 다음과 같습니다.

# num_pipeline과 cat_pipeline을 합치는 FeatureUnion
full_pipeline = FeatureUnion(transformer_list = [
  ('num_pipeline', num_pipeline),
  ('cat_pipeline', cat_pipeline),
  ])
  
# 전체 파이프라인 실행  
housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared)

Pipeline 전체 코드

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.preprocessing import OnehotEncoder, CategoricalEncoder


num_attr = list(housing_num)
cat_attr = ["ocean_proximity"]

# 숫자형 변수를 전처리하는 Pipeline
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attr)),
    ('imputer', Imputer(strategy = 'median')),
    ('std_scaler', StandardScaler())
])

# 범주형 변수를 전처리하는 Pipeline
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attr)),
    ('cat_encoder', CategoricalEncoder(encoding = 'onehot-dense'))
])

# num_pipeline과 cat_pipeline을 합치는 FeatureUnion
full_pipeline = FeatureUnion(transformer_list = [
  ('num_pipeline', num_pipeline),
  ('cat_pipeline', cat_pipeline),
  ])

# 전체 파이프라인 실행  
housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared)

저작자표시 비영리 변경금지 (새창열림)

AI STICKER