Post

시계열 분석 - SQL로 시작하는 데이터 분석 MySQL

PostgreSQL로 작성된 서적인 OREILLY의 ‘SQL로 시작하는 데이터 분석’을 MySQL로 변환하며 스터디한 내용.

시계열 분석 파트의 데이터 테이블 정리

쿼리가 실행될 테이블 형태는 아래와 같다.

A. Table Scheme

sales_monthnaics_codekind_f_businessreason_for_nullsales
1992-01-01441Motor vehicle and parts dealers 29811.00
1992-01-014411Automobile dealers 25800.00
1992-01-014411, 4412Automobile and other motor vehicle dealers 26788.00
1992-01-0144111New car dealers 24056.00

retail_sales 테이블

  • 판매 일자, 소매 업종, 판매액이 적재 되어있다.
datedate_keyday_of_monthday_of_yearday_of_weekday_nameday_short_nameweek_numberweek_of_monthweekmonth_numbermonth_namemonth+short_namefirst_day_of_monthlast_day_of_monthquarter_numberquarter_namefirst_day_of_quarterlast_day_of_quarteryeardecadecenturays
1770-01-0117700101111MondayMon111770-01-011JanuaryJan1770-01-011770-01-311Q11770-01-011770-03-311770177018
1770-01-0217700102222TuesdayTue111770-01-011JanuaryJan1770-01-011770-01-311Q11770-01-011770-03-311770177018

date_dim 테이블

  • 분석에 활용하고자 하는 테이블에 특정 일자의 레코드가 삭제되어 있다면 집계 시 혼란이 발생할 수 있다.
  • 이러한 사고를 미연에 방지하고자 날짜 룩업 테이블을 정의하고, 필요 시 join하여 결측으로 발생할 수 있는 집계 오류를 방지한다.

B. 날짜간 계산과 인터벌 계산을 헷갈리지 말자.

날짜 간 계산과 인터벌 계산 두 타입을 헷갈리지 않도록 주의하자. 날짜 간 계산은 문자 그대로 산술적인 날짜간의 차이를, 인터벌은 관념적인(하이 레벨에서의) 날짜의 개념에서의 차이를 계산한다.

1
select date_add(date('2020-01-01'), interval 1 week)

시계열 분석 - 비율과 차이

A. 차이 계산하기

A.1. 남여 매출 차이 Subquery ver

subquery를 활용한 계산

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SELECT
	year(sales_month) AS years,
	a.womens_sales,
	a.mens_sales,
	a.womens_sales - a.mens_sales AS diff_sales
FROM
	retail_sales,
	(
		SELECT
			year(sales_month) AS years,
			sum(
				CASE WHEN kind_of_business = 'Women''s clothing stores' THEN
					sales
				END) AS womens_sales,
			sum(
				CASE WHEN kind_of_business = 'Men''s clothing stores' THEN
					sales
				END) AS mens_sales
		FROM
			retail_sales
		WHERE
			kind_of_business in('Women''s clothing stores', 'Men''s clothing stores')
		GROUP BY
			1) a

B. 전체 대비 비율을 계산하기

B.1. subquery & self join을 활용한 비율 계산

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SELECT
	sales_month,
	kind_of_business,
	sales,
	sales / aa.total_sales * 100 AS pct_total_sales
FROM (
	SELECT
		a.sales_month,
		a.kind_of_business,
		a.sales,
		sum(b.sales) AS total_sales
	FROM
		retail_sales a
		JOIN retail_sales b ON a.sales_month = b.sales_month
			AND b.kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
	WHERE
		a.kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
	GROUP BY
		1,
		2,
		3) aa
ORDER BY
	1,
	2

B.2. Partition by, sum 윈도우함수 활용한 비율 계산

partition by 절로 함수가 계산하는 테이블의 섹션을 구분

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
	sales_month,
	kind_of_business,
	sales,
	sum(sales) OVER (PARTITION BY sales_month) AS total_sales,
	round(sales * 100 / sum(sales) OVER (PARTITION BY sales_month), 2) AS pct
FROM
	retail_sales
WHERE
	kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
ORDER BY
	1,
	2

C. 업종별, 연매출 대비 월간 매출

C.1. 서브쿼리 활용

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
SELECT
	rs.sales_month,
	rs.kind_of_business,
	SUM(rs.sales) AS sales_m,
	aa.sales_y,
	round(SUM(rs.sales) / aa.sales_y, 2) AS ratio_m_y
FROM
	retail_sales rs
	JOIN (
		SELECT
			YEAR(sales_month) AS sales_year,
			SUM(sales) AS sales_y,
			kind_of_business
		FROM
			retail_sales
		WHERE
			kind_of_business IN('Men''s clothing stores', 'Women''s clothing stores')
		GROUP BY
			sales_year,
			kind_of_business) aa ON YEAR(rs.sales_month) = aa.sales_year
	AND rs.kind_of_business = aa.kind_of_business
WHERE
	rs.kind_of_business IN('Men''s clothing stores', 'Women''s clothing stores')
GROUP BY
	rs.sales_month,
	rs.kind_of_business,
	aa.sales_y
ORDER BY
	1

C.2. 윈도우 함수 활용

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT
	sales_month,
	kind_of_business,
	sales,
	sum(sales) OVER (PARTITION BY year(sales_month),
		kind_of_business) AS yearly_sales,
	sales * 100 / sum(sales) OVER (PARTITION BY year(sales_month),
		kind_of_business) AS pct_yearly
FROM
	retail_sales
WHERE
	kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
ORDER BY
	1,
	2;

D. 인덱싱으로 시계열 데이터 변화 이해하기

데이터 인덱싱은 시계열에서 베이스 구간(시작 지점)을 기준으로 데이터의 변화량을 이해하는 방법. 집계함수와 윈도우 함수를 조합하거나 self-join을 활용함. 윈도우 함수가 매우 유연한 사용성을 보이기에, 이에 익숙해지는게 좋다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT
	sales_year,
	sales,
	first_value(sales) OVER (ORDER BY sales_year) AS index_sales
FROM (
	SELECT
		year(sales_month) AS sales_year,
		sum(sales) AS sales
	FROM
		retail_sales
	WHERE
		kind_of_business in('Women''s clothing stores')
	GROUP BY
		1) a

D.1. 윈도우 함수를 활용한 인덱싱

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT
	sales_year,
	sales,
	(sales / FIRST_VALUE(sales) OVER (ORDER BY sales_year) - 1) * 100 AS index_sales
FROM (
	SELECT
		year(sales_month) AS sales_year,
		sum(sales) AS sales
	FROM
		retail_sales
	WHERE
		kind_of_business in('Women''s clothing stores')
	GROUP BY
		1) a

시계열 분석 - 시간 윈도우 롤링

집계와는 또 다른 노이즈 제거 방법. 여러 구간을 설정해 트렌드를 분석하는 시간 윈도우 롤링 (=이동 계산) 시계열 롤링 계산의 요소는 3가지. 1) 윈도우 사이즈, 2) 집계 함수, 3) 윈도우 내 데이터의 분할 혹은 그룹화 셀프조인과 윈도우 함수를 활용

A. 시간 윈도우 롤링 누적 합

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT
	a.sales_month,
	a.sales,
	b.sales_month AS rolling_sales_month,
	b.sales AS rolling_sales
FROM
	retail_sales a
	JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
		AND b.sales_month BETWEEN date_sub(a.sales_month, interval 11 MONTH)
		AND a.sales_month
		AND b.kind_of_business = 'Women''s clothing stores'
WHERE
	a.kind_of_business = 'Women''s clothing stores'
	AND a.sales_month = date('2019-12-01')

B. 이동평균선

B.1. self join

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SELECT
	a.sales_month,
	a.sales,
	avg(b.sales) AS ma_1y,
	count(b.sales) AS records
FROM
	retail_sales a
	JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
		AND b.sales_month BETWEEN date_sub(a.sales_month, interval 11 month)
		AND a.sales_month
		AND b.kind_of_business = 'Women''s clothing stores'
WHERE
	a.kind_of_business = 'Women''s clothing stores'
	AND a.sales_month >= '1993-01-01'
GROUP BY
	1,
	2
ORDER BY
	1

B.2. Frame 문법

frame절은 윈도우 함수 옵션을 활용해, 각 윈도우에서 어떤 레코드를 포함할지 지정하는 문법

  • frame_end는 아래 중 하나
    • UNBOUNDED PRECEDING - 현재 행 이전의 모든 행
    • offset PRECEDING - 현재 행 이전의 n개 행
    • CURRENT ROW - 현재 행
    • offset FOLLOWING - 현재 행 이후의 n개 행
    • UNBOUNDED FOLOWING - 현재 행 이후의 모든 행
1
{range | row | group} BETWEEN frame_start AND frame_end
1
2
3
4
5
6
7
8
SELECT
	sales_month,
	avg(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS moving_avg,
	COUNT(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS records
FROM
	retail_sales
WHERE
	kind_of_business = 'Women''s clothing stores'

C. 희소 데이터의 시간 윈도우 롤링

날짜 룩업 테이블 참조하여 쿼리

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
SELECT d.date,
avg(s.sales) AS moving_average,
count(s.sales) AS records,
max(
	CASE WHEN d.date = s.sales_month THEN
		s.sales
	END) AS sales_in_month
FROM
	date_dim d
	JOIN (
		SELECT
			sales_month,
			sales
		FROM
			retail_sales
		WHERE
			kind_of_business = 'Women''s clothing stores'
			AND month(sales_month)
			IN(1, 7)) s ON s.sales_month BETWEEN DATE_SUB(d.date, INTERVAL 11 MONTH)
	AND d.date
WHERE
	d.date = d.first_day_of_month
	AND d.date BETWEEN '1993-01-01'
	AND '2020-12-01'
GROUP BY
	1
ORDER BY
	1

D. 누적값값 계산

D.1. YTD 윈도우 롤링

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
	sales_month,
	sales,
	sum(sales) OVER (PARTITION BY year(sales_month)
		ORDER BY
			sales_month) AS sales_ytd
	FROM
		retail_sales
	WHERE
		kind_of_business = 'Women''s clothing stores'
	ORDER BY
		1

D.2. YTD 서브쿼리

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
	a.sales_month,
	a.sales,
	sum(b.sales) AS YTD_sales
FROM
	retail_sales a
	JOIN retail_sales b ON year(a.sales_month) = year(b.sales_month)
		AND b.sales_month <= a.sales_month
		AND b.kind_of_business = 'Women''s clothing stores'
WHERE
	a.kind_of_business = 'Women''s clothing stores'
GROUP BY
	1,
	2
ORDER BY
	1

E. 계절성 분석

일정한 간격을 두고 규칙적으로 반복되는 패턴을 의미. 연단위, 혹은 분단위로 나타나기도 함.

E.1. lag 함수를 사용한 이전값 대비 비율 변화

  • lag 함수는 이전행의 값을 반환한다. (lead = lag와 반대. 다음 행을 반환)
  • offset 옵션 : 몇 행 이전의 값을 반환할 것인가?
  • default 옵션 : 이전 행이 없는 경우에 반환할 값
  • PARTITION BY 옵션 : 생략 시, 전체 데이터셋에 대한 계산 수행
  • ORDER BY 옵션 : 생략 시, 현재 db 정렬 순서로 계산 수행

구간단위 계산은 변화를 이해하는데 활용됨. 계절성 분석하기에는 적합하지 않음.

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
	kind_of_business,
	sales_month,
	sales,
	lag(sales) OVER (
				PARTITION BY kind_of_business ORDER BY sales_month
				) AS t,
	(sales / (lag(sales) OVER (PARTITION BY kind_of_business ORDER BY sales_month)) - 1) * 100 AS pct_growth_then_prev
FROM
	retail_sales
WHERE
	kind_of_business LIKE 'Book%'

E.2. 구간 비교 : 작년 동월과의 비교

전년대비 성장세를 비교.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
	sales_month,
	sales,
	lag(sales_month) OVER (PARTITION BY month(sales_month)
		ORDER BY
			sales_month) AS prev_month,
		lag(sales) OVER (PARTITION BY month(sales_month)
			ORDER BY
				sales_month) AS prev_sales,
			100 * (sales / (lag(sales) OVER (PARTITION BY month(sales_month)
						ORDER BY
							sales_month)) - 1) AS prev_pct
			FROM
				retail_sales
			WHERE
				kind_of_business LIKE 'Book%'

E.3. 구간 비교 : 동월 비교. 연도별 컬럼 피벗

각 연도별 시각적 비교에 용이한 피벗

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
	month(sales_month) as month_num, 
	monthname(sales_month) as month_char,
	max(case when year(sales_month) = 1992 then sales end) as '1992 sales',
	max(case when year(sales_month) = 1993 then sales end) as '1993 sales',
	max(case when year(sales_month) = 1994 then sales end) as '1994 sales'
FROM
	retail_sales
WHERE
	kind_of_business like 'book%'
	and sales_month between date('1992-01-01') and date('1994-12-01')
group by 1,2

E.4. 다중구간 비교 서브쿼리

계산에 하나의 행이라도 null 이면 null로 처리됨을 유의.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT
	sales_month,sales,
	sales/((a.prev_sales_1 + prev_sales_2 + prev_sales_3)/3) * 100 as pct
FROM
	(
		SELECT
			sales_month,
			sales,
			lag(sales,1) over (partition by month(sales_month) 
							   order by sales_month
							  ) as prev_sales_1,
			lag(sales,2) over (partition by month(sales_month) 
							   order by sales_month
							  ) as prev_sales_2,
			lag(sales,3) over (partition by month(sales_month) 
							   order by sales_month
							  ) as prev_sales_3

		FROM
			retail_sales
		WHERE
			kind_of_business like 'Book%'
	) as a

E.5. 다중구간 비교 프레임절

서브쿼리와 결과는 같으나, 1개라도 null이 아니라면 계산 수행됨을 유의

1
2
3
4
5
6
7
8
9
10
11
SELECT
	sales_month,
	sales,
	sales / avg(sales) OVER (PARTITION BY month(sales_month)
		ORDER BY
			sales_month ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS pct_pre
	FROM
		retail_sales
	WHERE
		kind_of_business LIKE 'Book%'

This post is licensed under CC BY 4.0 by the author.