PostgreSQL로 작성된 서적인 OREILLY의 ‘SQL로 시작하는 데이터 분석’을 MySQL로 변환하며 스터디한 내용.
시계열 분석 파트의 데이터 테이블 정리
쿼리가 실행될 테이블 형태는 아래와 같다.
A. Table Scheme
sales_month | naics_code | kind_f_business | reason_for_null | sales |
---|
1992-01-01 | 441 | Motor vehicle and parts dealers | | 29811.00 |
1992-01-01 | 4411 | Automobile dealers | | 25800.00 |
1992-01-01 | 4411, 4412 | Automobile and other motor vehicle dealers | | 26788.00 |
1992-01-01 | 44111 | New car dealers | | 24056.00 |
retail_sales 테이블
- 판매 일자, 소매 업종, 판매액이 적재 되어있다.
date | date_key | day_of_month | day_of_year | day_of_week | day_name | day_short_name | week_number | week_of_month | week | month_number | month_name | month+short_name | first_day_of_month | last_day_of_month | quarter_number | quarter_name | first_day_of_quarter | last_day_of_quarter | year | decade | centurays |
---|
1770-01-01 | 17700101 | 1 | 1 | 1 | Monday | Mon | 1 | 1 | 1770-01-01 | 1 | January | Jan | 1770-01-01 | 1770-01-31 | 1 | Q1 | 1770-01-01 | 1770-03-31 | 1770 | 1770 | 18 |
1770-01-02 | 17700102 | 2 | 2 | 2 | Tuesday | Tue | 1 | 1 | 1770-01-01 | 1 | January | Jan | 1770-01-01 | 1770-01-31 | 1 | Q1 | 1770-01-01 | 1770-03-31 | 1770 | 1770 | 18 |
date_dim 테이블
- 분석에 활용하고자 하는 테이블에 특정 일자의 레코드가 삭제되어 있다면 집계 시 혼란이 발생할 수 있다.
- 이러한 사고를 미연에 방지하고자 날짜 룩업 테이블을 정의하고, 필요 시 join하여 결측으로 발생할 수 있는 집계 오류를 방지한다.
B. 날짜간 계산과 인터벌 계산을 헷갈리지 말자.
날짜 간 계산과 인터벌 계산 두 타입을 헷갈리지 않도록 주의하자. 날짜 간 계산은 문자 그대로 산술적인 날짜간의 차이를, 인터벌은 관념적인(하이 레벨에서의) 날짜의 개념에서의 차이를 계산한다.
1
| select date_add(date('2020-01-01'), interval 1 week)
|
시계열 분석 - 비율과 차이
A. 차이 계산하기
A.1. 남여 매출 차이 Subquery ver
subquery를 활용한 계산
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| SELECT
year(sales_month) AS years,
a.womens_sales,
a.mens_sales,
a.womens_sales - a.mens_sales AS diff_sales
FROM
retail_sales,
(
SELECT
year(sales_month) AS years,
sum(
CASE WHEN kind_of_business = 'Women''s clothing stores' THEN
sales
END) AS womens_sales,
sum(
CASE WHEN kind_of_business = 'Men''s clothing stores' THEN
sales
END) AS mens_sales
FROM
retail_sales
WHERE
kind_of_business in('Women''s clothing stores', 'Men''s clothing stores')
GROUP BY
1) a
|
B. 전체 대비 비율을 계산하기
B.1. subquery & self join을 활용한 비율 계산
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| SELECT
sales_month,
kind_of_business,
sales,
sales / aa.total_sales * 100 AS pct_total_sales
FROM (
SELECT
a.sales_month,
a.kind_of_business,
a.sales,
sum(b.sales) AS total_sales
FROM
retail_sales a
JOIN retail_sales b ON a.sales_month = b.sales_month
AND b.kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
WHERE
a.kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
GROUP BY
1,
2,
3) aa
ORDER BY
1,
2
|
B.2. Partition by, sum 윈도우함수 활용한 비율 계산
partition by 절로 함수가 계산하는 테이블의 섹션을 구분
1
2
3
4
5
6
7
8
9
10
11
12
13
| SELECT
sales_month,
kind_of_business,
sales,
sum(sales) OVER (PARTITION BY sales_month) AS total_sales,
round(sales * 100 / sum(sales) OVER (PARTITION BY sales_month), 2) AS pct
FROM
retail_sales
WHERE
kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
ORDER BY
1,
2
|
C. 업종별, 연매출 대비 월간 매출
C.1. 서브쿼리 활용
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| SELECT
rs.sales_month,
rs.kind_of_business,
SUM(rs.sales) AS sales_m,
aa.sales_y,
round(SUM(rs.sales) / aa.sales_y, 2) AS ratio_m_y
FROM
retail_sales rs
JOIN (
SELECT
YEAR(sales_month) AS sales_year,
SUM(sales) AS sales_y,
kind_of_business
FROM
retail_sales
WHERE
kind_of_business IN('Men''s clothing stores', 'Women''s clothing stores')
GROUP BY
sales_year,
kind_of_business) aa ON YEAR(rs.sales_month) = aa.sales_year
AND rs.kind_of_business = aa.kind_of_business
WHERE
rs.kind_of_business IN('Men''s clothing stores', 'Women''s clothing stores')
GROUP BY
rs.sales_month,
rs.kind_of_business,
aa.sales_y
ORDER BY
1
|
C.2. 윈도우 함수 활용
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| SELECT
sales_month,
kind_of_business,
sales,
sum(sales) OVER (PARTITION BY year(sales_month),
kind_of_business) AS yearly_sales,
sales * 100 / sum(sales) OVER (PARTITION BY year(sales_month),
kind_of_business) AS pct_yearly
FROM
retail_sales
WHERE
kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
ORDER BY
1,
2;
|
D. 인덱싱으로 시계열 데이터 변화 이해하기
데이터 인덱싱은 시계열에서 베이스 구간(시작 지점)을 기준으로 데이터의 변화량을 이해하는 방법. 집계함수와 윈도우 함수를 조합하거나 self-join을 활용함. 윈도우 함수가 매우 유연한 사용성을 보이기에, 이에 익숙해지는게 좋다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| SELECT
sales_year,
sales,
first_value(sales) OVER (ORDER BY sales_year) AS index_sales
FROM (
SELECT
year(sales_month) AS sales_year,
sum(sales) AS sales
FROM
retail_sales
WHERE
kind_of_business in('Women''s clothing stores')
GROUP BY
1) a
|
D.1. 윈도우 함수를 활용한 인덱싱
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| SELECT
sales_year,
sales,
(sales / FIRST_VALUE(sales) OVER (ORDER BY sales_year) - 1) * 100 AS index_sales
FROM (
SELECT
year(sales_month) AS sales_year,
sum(sales) AS sales
FROM
retail_sales
WHERE
kind_of_business in('Women''s clothing stores')
GROUP BY
1) a
|
시계열 분석 - 시간 윈도우 롤링
집계와는 또 다른 노이즈 제거 방법. 여러 구간을 설정해 트렌드를 분석하는 시간 윈도우 롤링 (=이동 계산) 시계열 롤링 계산의 요소는 3가지. 1) 윈도우 사이즈, 2) 집계 함수, 3) 윈도우 내 데이터의 분할 혹은 그룹화 셀프조인과 윈도우 함수를 활용
A. 시간 윈도우 롤링 누적 합
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| SELECT
a.sales_month,
a.sales,
b.sales_month AS rolling_sales_month,
b.sales AS rolling_sales
FROM
retail_sales a
JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
AND b.sales_month BETWEEN date_sub(a.sales_month, interval 11 MONTH)
AND a.sales_month
AND b.kind_of_business = 'Women''s clothing stores'
WHERE
a.kind_of_business = 'Women''s clothing stores'
AND a.sales_month = date('2019-12-01')
|
B. 이동평균선
B.1. self join
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| SELECT
a.sales_month,
a.sales,
avg(b.sales) AS ma_1y,
count(b.sales) AS records
FROM
retail_sales a
JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
AND b.sales_month BETWEEN date_sub(a.sales_month, interval 11 month)
AND a.sales_month
AND b.kind_of_business = 'Women''s clothing stores'
WHERE
a.kind_of_business = 'Women''s clothing stores'
AND a.sales_month >= '1993-01-01'
GROUP BY
1,
2
ORDER BY
1
|
B.2. Frame 문법
frame절은 윈도우 함수 옵션을 활용해, 각 윈도우에서 어떤 레코드를 포함할지 지정하는 문법
- frame_end는 아래 중 하나
- UNBOUNDED PRECEDING - 현재 행 이전의 모든 행
- offset PRECEDING - 현재 행 이전의 n개 행
- CURRENT ROW - 현재 행
- offset FOLLOWING - 현재 행 이후의 n개 행
- UNBOUNDED FOLOWING - 현재 행 이후의 모든 행
1
| {range | row | group} BETWEEN frame_start AND frame_end
|
1
2
3
4
5
6
7
8
| SELECT
sales_month,
avg(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS moving_avg,
COUNT(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS records
FROM
retail_sales
WHERE
kind_of_business = 'Women''s clothing stores'
|
C. 희소 데이터의 시간 윈도우 롤링
날짜 룩업 테이블 참조하여 쿼리
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| SELECT d.date,
avg(s.sales) AS moving_average,
count(s.sales) AS records,
max(
CASE WHEN d.date = s.sales_month THEN
s.sales
END) AS sales_in_month
FROM
date_dim d
JOIN (
SELECT
sales_month,
sales
FROM
retail_sales
WHERE
kind_of_business = 'Women''s clothing stores'
AND month(sales_month)
IN(1, 7)) s ON s.sales_month BETWEEN DATE_SUB(d.date, INTERVAL 11 MONTH)
AND d.date
WHERE
d.date = d.first_day_of_month
AND d.date BETWEEN '1993-01-01'
AND '2020-12-01'
GROUP BY
1
ORDER BY
1
|
D. 누적값값 계산
D.1. YTD 윈도우 롤링
1
2
3
4
5
6
7
8
9
10
11
12
| SELECT
sales_month,
sales,
sum(sales) OVER (PARTITION BY year(sales_month)
ORDER BY
sales_month) AS sales_ytd
FROM
retail_sales
WHERE
kind_of_business = 'Women''s clothing stores'
ORDER BY
1
|
D.2. YTD 서브쿼리
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| SELECT
a.sales_month,
a.sales,
sum(b.sales) AS YTD_sales
FROM
retail_sales a
JOIN retail_sales b ON year(a.sales_month) = year(b.sales_month)
AND b.sales_month <= a.sales_month
AND b.kind_of_business = 'Women''s clothing stores'
WHERE
a.kind_of_business = 'Women''s clothing stores'
GROUP BY
1,
2
ORDER BY
1
|
E. 계절성 분석
일정한 간격을 두고 규칙적으로 반복되는 패턴을 의미. 연단위, 혹은 분단위로 나타나기도 함.
E.1. lag 함수를 사용한 이전값 대비 비율 변화
- lag 함수는 이전행의 값을 반환한다. (lead = lag와 반대. 다음 행을 반환)
- offset 옵션 : 몇 행 이전의 값을 반환할 것인가?
- default 옵션 : 이전 행이 없는 경우에 반환할 값
- PARTITION BY 옵션 : 생략 시, 전체 데이터셋에 대한 계산 수행
- ORDER BY 옵션 : 생략 시, 현재 db 정렬 순서로 계산 수행
구간단위 계산은 변화를 이해하는데 활용됨. 계절성 분석하기에는 적합하지 않음.
1
2
3
4
5
6
7
8
9
10
11
12
| SELECT
kind_of_business,
sales_month,
sales,
lag(sales) OVER (
PARTITION BY kind_of_business ORDER BY sales_month
) AS t,
(sales / (lag(sales) OVER (PARTITION BY kind_of_business ORDER BY sales_month)) - 1) * 100 AS pct_growth_then_prev
FROM
retail_sales
WHERE
kind_of_business LIKE 'Book%'
|
E.2. 구간 비교 : 작년 동월과의 비교
전년대비 성장세를 비교.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| SELECT
sales_month,
sales,
lag(sales_month) OVER (PARTITION BY month(sales_month)
ORDER BY
sales_month) AS prev_month,
lag(sales) OVER (PARTITION BY month(sales_month)
ORDER BY
sales_month) AS prev_sales,
100 * (sales / (lag(sales) OVER (PARTITION BY month(sales_month)
ORDER BY
sales_month)) - 1) AS prev_pct
FROM
retail_sales
WHERE
kind_of_business LIKE 'Book%'
|
E.3. 구간 비교 : 동월 비교. 연도별 컬럼 피벗
각 연도별 시각적 비교에 용이한 피벗
1
2
3
4
5
6
7
8
9
10
11
12
| SELECT
month(sales_month) as month_num,
monthname(sales_month) as month_char,
max(case when year(sales_month) = 1992 then sales end) as '1992 sales',
max(case when year(sales_month) = 1993 then sales end) as '1993 sales',
max(case when year(sales_month) = 1994 then sales end) as '1994 sales'
FROM
retail_sales
WHERE
kind_of_business like 'book%'
and sales_month between date('1992-01-01') and date('1994-12-01')
group by 1,2
|
E.4. 다중구간 비교 서브쿼리
계산에 하나의 행이라도 null 이면 null로 처리됨을 유의.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| SELECT
sales_month,sales,
sales/((a.prev_sales_1 + prev_sales_2 + prev_sales_3)/3) * 100 as pct
FROM
(
SELECT
sales_month,
sales,
lag(sales,1) over (partition by month(sales_month)
order by sales_month
) as prev_sales_1,
lag(sales,2) over (partition by month(sales_month)
order by sales_month
) as prev_sales_2,
lag(sales,3) over (partition by month(sales_month)
order by sales_month
) as prev_sales_3
FROM
retail_sales
WHERE
kind_of_business like 'Book%'
) as a
|
E.5. 다중구간 비교 프레임절
서브쿼리와 결과는 같으나, 1개라도 null이 아니라면 계산 수행됨을 유의
1
2
3
4
5
6
7
8
9
10
11
| SELECT
sales_month,
sales,
sales / avg(sales) OVER (PARTITION BY month(sales_month)
ORDER BY
sales_month ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS pct_pre
FROM
retail_sales
WHERE
kind_of_business LIKE 'Book%'
|