[최적화] 모델 경량화 , AutoML , Pruning , Knowledge Distillation , Tensor Decomposition , Quantization , Compiling

728x90

딥러닝 모델 경량화를 위한 다양한 기법을 간단히 소개합니다.

경량화 기법 중 AutoML에 초점을 둘 것이며, AutoML에 대한 자세한 내용은 여기를 참고해주세요.

1. 경량화 목적

On device AI

smart phone, warch 등 IoT devices

Limitation : Battery, RAM, Storage, Computing power

AI on cloud(or server)

on device AI에 비해 배터리, 저장공간, 연산능력 등의 제약은 줄지만,

같은 자원으로 더 적은 latency(한 요청의 소요 시간)와 더 큰 throughput(단위 시간당 처리 가능 요청 수)

Computation as a key component of AI progress

시간이 지날 수록 AI 모델 연산량이 대폭 증가한다. 2012년을 기준으로 2018년까지 약 300,000배의 연산량이 증가했다.

※ 연산 측정 방법

1. Counting operation(FLOPs)

2. GPU times

On device AI, AI on cloud(or server), Computation as a key component of AI progress에서의 제약 및 요구를 반영하기 위해 경량화를 하는 것이다.

경랑화
모델 연구와 별개로 산업에 적용되기 위해 거쳐야하는 작업이며, 하드웨어 종류, latency제한, 요구 throughput, 성능 등의 요구조건들 간 Trade-off를 고려하여 모델 경량화를 수행한다.

2. 경량화 분야

네트워크 구조 관점
- Efficient Architecture Design (+AutoML;Neural Architecture Search(NAS))
- Network Pruning
- Knowlege Distillation
- Matrix/ Tensor Decomposition
하드웨어 관점
- Network Quantization
- Network Compiling

네트워크 구조 관점을 하나씩 살펴보자.

1) Efficient Architecture Design

매년 나오는 블록 모듈은 저마다 성능, 파라미터 수, 연산 횟수 중 특정 부분에 초점을 맞추어서 특성이 다르다.

AutoML;Neural Architecture Search(NAS)

AutoML은 controller를 통해 더 나은 architecture를 자동으로 찾는 것이며, 인간의 직관을 상회하는 성능의 모듈을 찾을 수 있다.

2) Network Pruning

중요도가 낮은 파라미터를 제거하여 네트워크를 가지치기 한다.

중요도가 높은 것을 잘 정의하고 찾는 것이 연구과제 (ex. L2norm이 크면 중요도↑, loss gradient가 크면 중요도↑)
Structured / Unstructured pruning이 있음

Structured pruning

파라미터를 그룹 단위로 Pruning하는 기법을 말한다. 그룹이라함은 channel, filter, layer등이 있다.

Dense computation에 최적화된 소프트웨어 또는 하드웨어에 적합하다.

Unstructured pruning

파라미터를 독립적으로 pruning하는 기법을 말한다. 즉, matrix내의 값 하나하나를 검토한 후 pruning을 수행한다.

pruning을 수행할 수록 네트워크 내부 행렬이 점차 희소(sparse)해진다.

Structured Pruning과 달리, sparse computation에 최적화된 소프트웨어 또는 하드웨어에 적합하다.

3) Knowlege Distillation

이미 학습된 큰 네트워크(teacher)를 작은 네트워크(student)의 학습 보조로 사용하는 방법이다.

student model의 loss는 두 가지의 loss로 구성된다.

Student network와 ground truth label의 cross-entropy
Teacher network와 student network의 inference결과에 대한 KLD loss

$$ \mathcal{L}_{K D}=(1-\alpha) C E\left(\hat{y}^{S}, y\right)+\alpha T^{2} K L\left(\sigma \left(\left(\hat{y}^{T} / T\right), \lambda\left(\left(\hat{y}^{S} / T\right)\right)\right.\right. $$

T : large teacher network의 출력을 smoothing(soften)하는 역할(=temperature)
alpha : 두 loss의 가중치 조절

4) Matrix/Tensor decomposition

하나의 Tensor를 작은 Tensor들의 조합(+, x)으로 표현한다.

Cp decomposition : rank 1 vector들의 outer product의 합으로 tensor를 approximation

하드웨어 관점을 하나씩 살펴보자.

1) Network Quantization

float32 데이터타입으로 network 연산과정이 표현되는데, 그 보다 더 작은 크기의 데이터타입(ex. float16, int8 등)으로 변환한다.

위의 예시는 float32데이터타입을 int8로 quantize하여 연산한 후, 다시 dequantize를 통해 float32데이터타입을 유지한다. 이 과정에서 Quantization Erorror가 발생하지만, 이러한 Error에도 robust하게 동작을 하여 보편적으로 많이 사용된다.

사이즈 : 감소
성능(Acc) : 일반적으로 약간 하락

int8 quantization예시 (cpu inference, pixel2 smart phone으로 속도 측정)

2) Network Compiling

※ 아래 순서대로 경량화 일련의 과정을 적용해볼 수 있다.

[reference]

728x90

저작자표시 비영리 (새창열림)

'AI > 딥러닝' 카테고리의 다른 글

[최적화] Optuna , Yaml에서 Model 생성 (0)	2021.11.25
[최적화] AutoML , Surrogate Model , Acquisition Function (0)	2021.11.23
[MRC] Retrieval, Scaling up with FAISS (2)	2021.10.17
[MRC] Passage Retrieval – Dense Embedding (0)	2021.10.17
[MRC] Passage Retrieval – Sparse Embedding (0)	2021.10.17

채채씨의 학습 기록

[최적화] 모델 경량화 , AutoML , Pruning , Knowledge Distillation , Tensor Decomposition , Quantization , Compiling

'AI > 딥러닝' 카테고리의 다른 글

댓글

티스토리툴바

[최적화] 모델 경량화 , AutoML , Pruning , Knowledge Distillation , Tensor Decomposition , Quantization , Compiling

'AI > 딥러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바