PostgreSQL로 작성된 서적인 OREILLY의 ‘SQL로 시작하는 데이터 분석’을 MySQL로 변환하며 스터디한 내용.
시계열 분석 파트의 데이터 테이블 정리
쿼리가 실행될 테이블 형태는 아래와 같다.
A. Table Scheme
| sales_month | naics_code | kind_f_business | reason_for_null | sales | 
|---|
| 1992-01-01 | 441 | Motor vehicle and parts dealers |   | 29811.00 | 
| 1992-01-01 | 4411 | Automobile dealers |   | 25800.00 | 
| 1992-01-01 | 4411, 4412 | Automobile and other motor vehicle dealers |   | 26788.00 | 
| 1992-01-01 | 44111 | New car dealers |   | 24056.00 | 
retail_sales 테이블
- 판매 일자, 소매 업종, 판매액이 적재 되어있다.
 
| date | date_key | day_of_month | day_of_year | day_of_week | day_name | day_short_name | week_number | week_of_month | week | month_number | month_name | month+short_name | first_day_of_month | last_day_of_month | quarter_number | quarter_name | first_day_of_quarter | last_day_of_quarter | year | decade | centurays | 
|---|
| 1770-01-01 | 17700101 | 1 | 1 | 1 | Monday | Mon | 1 | 1 | 1770-01-01 | 1 | January | Jan | 1770-01-01 | 1770-01-31 | 1 | Q1 | 1770-01-01 | 1770-03-31 | 1770 | 1770 | 18 | 
| 1770-01-02 | 17700102 | 2 | 2 | 2 | Tuesday | Tue | 1 | 1 | 1770-01-01 | 1 | January | Jan | 1770-01-01 | 1770-01-31 | 1 | Q1 | 1770-01-01 | 1770-03-31 | 1770 | 1770 | 18 | 
date_dim 테이블
- 분석에 활용하고자 하는 테이블에 특정 일자의 레코드가 삭제되어 있다면 집계 시 혼란이 발생할 수 있다.
 - 이러한 사고를 미연에 방지하고자 날짜 룩업 테이블을 정의하고, 필요 시 join하여 결측으로 발생할 수 있는 집계 오류를 방지한다.
 
B. 날짜간 계산과 인터벌 계산을 헷갈리지 말자.
날짜 간 계산과 인터벌 계산 두 타입을 헷갈리지 않도록 주의하자. 날짜 간 계산은 문자 그대로 산술적인 날짜간의 차이를, 인터벌은 관념적인(하이 레벨에서의) 날짜의 개념에서의 차이를 계산한다.
1
  | select date_add(date('2020-01-01'), interval 1 week)
 | 
시계열 분석 - 비율과 차이
A. 차이 계산하기
A.1. 남여 매출 차이 Subquery ver
subquery를 활용한 계산
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  | SELECT
	year(sales_month) AS years,
	a.womens_sales,
	a.mens_sales,
	a.womens_sales - a.mens_sales AS diff_sales
FROM
	retail_sales,
	(
		SELECT
			year(sales_month) AS years,
			sum(
				CASE WHEN kind_of_business = 'Women''s clothing stores' THEN
					sales
				END) AS womens_sales,
			sum(
				CASE WHEN kind_of_business = 'Men''s clothing stores' THEN
					sales
				END) AS mens_sales
		FROM
			retail_sales
		WHERE
			kind_of_business in('Women''s clothing stores', 'Men''s clothing stores')
		GROUP BY
			1) a
 | 
B. 전체 대비 비율을 계산하기
B.1. subquery & self join을 활용한 비율 계산
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  | SELECT
	sales_month,
	kind_of_business,
	sales,
	sales / aa.total_sales * 100 AS pct_total_sales
FROM (
	SELECT
		a.sales_month,
		a.kind_of_business,
		a.sales,
		sum(b.sales) AS total_sales
	FROM
		retail_sales a
		JOIN retail_sales b ON a.sales_month = b.sales_month
			AND b.kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
	WHERE
		a.kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
	GROUP BY
		1,
		2,
		3) aa
ORDER BY
	1,
	2
 | 
B.2. Partition by, sum 윈도우함수 활용한 비율 계산
partition by 절로 함수가 계산하는 테이블의 섹션을 구분
1
2
3
4
5
6
7
8
9
10
11
12
13
  | SELECT
	sales_month,
	kind_of_business,
	sales,
	sum(sales) OVER (PARTITION BY sales_month) AS total_sales,
	round(sales * 100 / sum(sales) OVER (PARTITION BY sales_month), 2) AS pct
FROM
	retail_sales
WHERE
	kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
ORDER BY
	1,
	2
 | 
C. 업종별, 연매출 대비 월간 매출
C.1. 서브쿼리 활용
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  | SELECT
	rs.sales_month,
	rs.kind_of_business,
	SUM(rs.sales) AS sales_m,
	aa.sales_y,
	round(SUM(rs.sales) / aa.sales_y, 2) AS ratio_m_y
FROM
	retail_sales rs
	JOIN (
		SELECT
			YEAR(sales_month) AS sales_year,
			SUM(sales) AS sales_y,
			kind_of_business
		FROM
			retail_sales
		WHERE
			kind_of_business IN('Men''s clothing stores', 'Women''s clothing stores')
		GROUP BY
			sales_year,
			kind_of_business) aa ON YEAR(rs.sales_month) = aa.sales_year
	AND rs.kind_of_business = aa.kind_of_business
WHERE
	rs.kind_of_business IN('Men''s clothing stores', 'Women''s clothing stores')
GROUP BY
	rs.sales_month,
	rs.kind_of_business,
	aa.sales_y
ORDER BY
	1
 | 
C.2. 윈도우 함수 활용
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  | SELECT
	sales_month,
	kind_of_business,
	sales,
	sum(sales) OVER (PARTITION BY year(sales_month),
		kind_of_business) AS yearly_sales,
	sales * 100 / sum(sales) OVER (PARTITION BY year(sales_month),
		kind_of_business) AS pct_yearly
FROM
	retail_sales
WHERE
	kind_of_business in('Men''s clothing stores', 'Women''s clothing stores')
ORDER BY
	1,
	2;
 | 
D. 인덱싱으로 시계열 데이터 변화 이해하기
데이터 인덱싱은 시계열에서 베이스 구간(시작 지점)을 기준으로 데이터의 변화량을 이해하는 방법. 집계함수와 윈도우 함수를 조합하거나 self-join을 활용함. 윈도우 함수가 매우 유연한 사용성을 보이기에, 이에 익숙해지는게 좋다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  | SELECT
	sales_year,
	sales,
	first_value(sales) OVER (ORDER BY sales_year) AS index_sales
FROM (
	SELECT
		year(sales_month) AS sales_year,
		sum(sales) AS sales
	FROM
		retail_sales
	WHERE
		kind_of_business in('Women''s clothing stores')
	GROUP BY
		1) a
 | 
D.1. 윈도우 함수를 활용한 인덱싱
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  | SELECT
	sales_year,
	sales,
	(sales / FIRST_VALUE(sales) OVER (ORDER BY sales_year) - 1) * 100 AS index_sales
FROM (
	SELECT
		year(sales_month) AS sales_year,
		sum(sales) AS sales
	FROM
		retail_sales
	WHERE
		kind_of_business in('Women''s clothing stores')
	GROUP BY
		1) a
 | 
시계열 분석 - 시간 윈도우 롤링
집계와는 또 다른 노이즈 제거 방법. 여러 구간을 설정해 트렌드를 분석하는 시간 윈도우 롤링 (=이동 계산) 시계열 롤링 계산의 요소는 3가지. 1) 윈도우 사이즈, 2) 집계 함수, 3) 윈도우 내 데이터의 분할 혹은 그룹화 셀프조인과 윈도우 함수를 활용
A. 시간 윈도우 롤링 누적 합
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  | SELECT
	a.sales_month,
	a.sales,
	b.sales_month AS rolling_sales_month,
	b.sales AS rolling_sales
FROM
	retail_sales a
	JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
		AND b.sales_month BETWEEN date_sub(a.sales_month, interval 11 MONTH)
		AND a.sales_month
		AND b.kind_of_business = 'Women''s clothing stores'
WHERE
	a.kind_of_business = 'Women''s clothing stores'
	AND a.sales_month = date('2019-12-01')
 | 
B. 이동평균선
B.1. self join
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  | SELECT
	a.sales_month,
	a.sales,
	avg(b.sales) AS ma_1y,
	count(b.sales) AS records
FROM
	retail_sales a
	JOIN retail_sales b ON a.kind_of_business = b.kind_of_business
		AND b.sales_month BETWEEN date_sub(a.sales_month, interval 11 month)
		AND a.sales_month
		AND b.kind_of_business = 'Women''s clothing stores'
WHERE
	a.kind_of_business = 'Women''s clothing stores'
	AND a.sales_month >= '1993-01-01'
GROUP BY
	1,
	2
ORDER BY
	1
  | 
B.2. Frame 문법
frame절은 윈도우 함수 옵션을 활용해, 각 윈도우에서 어떤 레코드를 포함할지 지정하는 문법
- frame_end는 아래 중 하나
- UNBOUNDED PRECEDING - 현재 행 이전의 모든 행
 - offset PRECEDING - 현재 행 이전의 n개 행
 - CURRENT ROW - 현재 행
 - offset FOLLOWING - 현재 행 이후의 n개 행
 - UNBOUNDED FOLOWING - 현재 행 이후의 모든 행
 
 
1
  | {range | row | group} BETWEEN frame_start AND frame_end
 | 
1
2
3
4
5
6
7
8
  | SELECT
	sales_month,
	avg(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS moving_avg,
	COUNT(sales) OVER (ORDER BY sales_month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS records
FROM
	retail_sales
WHERE
	kind_of_business = 'Women''s clothing stores'
  | 
C. 희소 데이터의 시간 윈도우 롤링
날짜 룩업 테이블 참조하여 쿼리
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  | SELECT d.date,
avg(s.sales) AS moving_average,
count(s.sales) AS records,
max(
	CASE WHEN d.date = s.sales_month THEN
		s.sales
	END) AS sales_in_month
FROM
	date_dim d
	JOIN (
		SELECT
			sales_month,
			sales
		FROM
			retail_sales
		WHERE
			kind_of_business = 'Women''s clothing stores'
			AND month(sales_month)
			IN(1, 7)) s ON s.sales_month BETWEEN DATE_SUB(d.date, INTERVAL 11 MONTH)
	AND d.date
WHERE
	d.date = d.first_day_of_month
	AND d.date BETWEEN '1993-01-01'
	AND '2020-12-01'
GROUP BY
	1
ORDER BY
	1
  | 
D. 누적값값 계산
D.1. YTD 윈도우 롤링
1
2
3
4
5
6
7
8
9
10
11
12
  | SELECT
	sales_month,
	sales,
	sum(sales) OVER (PARTITION BY year(sales_month)
		ORDER BY
			sales_month) AS sales_ytd
	FROM
		retail_sales
	WHERE
		kind_of_business = 'Women''s clothing stores'
	ORDER BY
		1
  | 
D.2. YTD 서브쿼리
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  | SELECT
	a.sales_month,
	a.sales,
	sum(b.sales) AS YTD_sales
FROM
	retail_sales a
	JOIN retail_sales b ON year(a.sales_month) = year(b.sales_month)
		AND b.sales_month <= a.sales_month
		AND b.kind_of_business = 'Women''s clothing stores'
WHERE
	a.kind_of_business = 'Women''s clothing stores'
GROUP BY
	1,
	2
ORDER BY
	1
  | 
E. 계절성 분석
일정한 간격을 두고 규칙적으로 반복되는 패턴을 의미. 연단위, 혹은 분단위로 나타나기도 함.
E.1. lag 함수를 사용한 이전값 대비 비율 변화
- lag 함수는 이전행의 값을 반환한다. (lead = lag와 반대. 다음 행을 반환)
 - offset 옵션 : 몇 행 이전의 값을 반환할 것인가?
 - default 옵션 : 이전 행이 없는 경우에 반환할 값
 - PARTITION BY 옵션 : 생략 시, 전체 데이터셋에 대한 계산 수행
 - ORDER BY 옵션 : 생략 시, 현재 db 정렬 순서로 계산 수행
 
구간단위 계산은 변화를 이해하는데 활용됨. 계절성 분석하기에는 적합하지 않음.
1
2
3
4
5
6
7
8
9
10
11
12
  | SELECT
	kind_of_business,
	sales_month,
	sales,
	lag(sales) OVER (
				PARTITION BY kind_of_business ORDER BY sales_month
				) AS t,
	(sales / (lag(sales) OVER (PARTITION BY kind_of_business ORDER BY sales_month)) - 1) * 100 AS pct_growth_then_prev
FROM
	retail_sales
WHERE
	kind_of_business LIKE 'Book%'
  | 
E.2. 구간 비교 : 작년 동월과의 비교
전년대비 성장세를 비교.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  | SELECT
	sales_month,
	sales,
	lag(sales_month) OVER (PARTITION BY month(sales_month)
		ORDER BY
			sales_month) AS prev_month,
		lag(sales) OVER (PARTITION BY month(sales_month)
			ORDER BY
				sales_month) AS prev_sales,
			100 * (sales / (lag(sales) OVER (PARTITION BY month(sales_month)
						ORDER BY
							sales_month)) - 1) AS prev_pct
			FROM
				retail_sales
			WHERE
				kind_of_business LIKE 'Book%'
  | 
E.3. 구간 비교 : 동월 비교. 연도별 컬럼 피벗
각 연도별 시각적 비교에 용이한 피벗
1
2
3
4
5
6
7
8
9
10
11
12
  | SELECT
	month(sales_month) as month_num, 
	monthname(sales_month) as month_char,
	max(case when year(sales_month) = 1992 then sales end) as '1992 sales',
	max(case when year(sales_month) = 1993 then sales end) as '1993 sales',
	max(case when year(sales_month) = 1994 then sales end) as '1994 sales'
FROM
	retail_sales
WHERE
	kind_of_business like 'book%'
	and sales_month between date('1992-01-01') and date('1994-12-01')
group by 1,2
 | 
E.4. 다중구간 비교 서브쿼리
계산에 하나의 행이라도 null 이면 null로 처리됨을 유의.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  | SELECT
	sales_month,sales,
	sales/((a.prev_sales_1 + prev_sales_2 + prev_sales_3)/3) * 100 as pct
FROM
	(
		SELECT
			sales_month,
			sales,
			lag(sales,1) over (partition by month(sales_month) 
							   order by sales_month
							  ) as prev_sales_1,
			lag(sales,2) over (partition by month(sales_month) 
							   order by sales_month
							  ) as prev_sales_2,
			lag(sales,3) over (partition by month(sales_month) 
							   order by sales_month
							  ) as prev_sales_3
		FROM
			retail_sales
		WHERE
			kind_of_business like 'Book%'
	) as a
  | 
E.5. 다중구간 비교 프레임절
서브쿼리와 결과는 같으나, 1개라도 null이 아니라면 계산 수행됨을 유의
1
2
3
4
5
6
7
8
9
10
11
  | SELECT
	sales_month,
	sales,
	sales / avg(sales) OVER (PARTITION BY month(sales_month)
		ORDER BY
			sales_month ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) AS pct_pre
	FROM
		retail_sales
	WHERE
		kind_of_business LIKE 'Book%'
  |