<style type="text/css">
 .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 { 
    text-transform: none; 
    font-size: .9em;
    text-align: left;}
</style>

<style type="text/css">
 p,li {
    text-align: left; 
    font-size: 0.8em; 
 }
</style>
</style>

<style>
    .custom-small table {
    font-size: .6em
    }
</style>
<style>
    .custom-ultra-small table {
    font-size: .4em
    }
</style>

<style>
.container{
    display: flex;
}
.col{
    flex: 1;
}
</style>

<link href="https://use.fontawesome.com/releases/v6.6.0/css/all.css" rel="stylesheet">

<style> .reveal i.fa { font-family:FontAwesome; font-style: normal; } </style>

# データエンジニアになるためのテクニカルハイライトとデータ分析実践試験学習方法

<div style="text-align:right; font-size:0.6em">
Python/データ分析人材育成者向け勉強会と交流会<br>6/11, 2025<br>辻真吾（www.tsjshg.info）
</div>

---

# 自己紹介
- Pythonを使ったデータサイエンスが得意
    - 最近は大規模言語モデル（LLM）を使った研究も

- 東京大学先端科学技術研究センター・先端データサイエンス分野

- Python、データサイエンス、アルゴリズムに関する著書が10冊（共著や監修を含む）

- 毎月1回オンラインで『みんなんのPython勉強会』をやっています

- www.tsjshg.info

数学の章を担当しました。

[翔泳社のサイト](https://www.seshop.com/product/detail/26879)

[2025年版：独断と偏見で選ぶ、データ分析職の方々にお薦めしたい定番の書籍リスト](https://tjo.hatenablog.com/entry/2025/02/26/181918)で紹介いただきました。[講談社から発売中](https://bookclub.kodansha.co.jp/product?item=0000275788)

[今年2月に改訂版が出ました](https://bookclub.kodansha.co.jp/product?item=0000374026)

---

# なにごとも実践あるのみ

- 実際のデータを解析しながらライブラリの使い方を学ぶとよい

- 手頃なデータがない場合はサンプルデータを探す

- Kaggleがおすすめ
    - https://www.kaggle.com/
    - アカウントを作成（Googleなどと連携が楽）

# Datasets

- 検索機能もあるので自分が詳しい分野のデータを探すと良いかも

# revenueで検索

- 「いいね」数とUsabilityに注目

# 本日のデータ

- Licenseにすこしだけ注意
- CSV形式のファイルなどをダウンロード
    - ライブラリを使ってKaggleのサイトから直接データを取り込むことも可能

# コードを見る

- Codeタブに他の人が解析したコードがあるので、どのような解析ができるのか参考になる

# データの項目を確認

- Name: レストランの名前（番号）    
- Location: 場所（e.g., 田舎、繁華街）
- Cuisine: 料理の種類 (e.g., 日本料理、メキシコ料理、イタリア料理）
- Rating: 平均評価
- Seating Capacity: 座席数
- Average Meal Price: 一皿の平均価格
- Marketing Budget: マーケティング予算
- Social Media Followers: SNSのフォロワー数
- Chef Experience Years: 料理長の経験年数
- Number of Reviews: レビューの総数
- Avg Review Length: レビューの平均長
- Ambience Score: 雰囲気を表すスコア
- Service Quality Score: サービスの質を表すスコア
- Parking Availability: 駐車場の有無 (Yes/No).
- Weekend Reservations: 週末の予約数
- Weekday Reservations: 平日の予約数
- Revenue: （目的変数）総収入

</span>

# データの読み込み

```python
import pandas as pd

data = pd.read_csv('restaurant_data.csv')
data.info()
```
<span style="font-size: 50%">

```python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8368 entries, 0 to 8367
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Name                    8368 non-null   object 
 1   Location                8368 non-null   object 
 2   Cuisine                 8368 non-null   object 
 3   Rating                  8368 non-null   float64
 4   Seating Capacity        8368 non-null   int64  
 5   Average Meal Price      8368 non-null   float64
 6   Marketing Budget        8368 non-null   int64  
 7   Social Media Followers  8368 non-null   int64  
 8   Chef Experience Years   8368 non-null   int64  
 9   Number of Reviews       8368 non-null   int64  
 10  Avg Review Length       8368 non-null   float64
 11  Ambience Score          8368 non-null   float64
 12  Service Quality Score   8368 non-null   float64
 13  Parking Availability    8368 non-null   object 
 14  Weekend Reservations    8368 non-null   int64  
 15  Weekday Reservations    8368 non-null   int64  
 16  Revenue                 8368 non-null   float64
dtypes: float64(6), int64(7), object(4)
memory usage: 1.1+ MB
```
</span>

- データ型の確認は重要
    - 数値の型と文字列（object）

---

# EDA（Explanatory Data Analysis）

- 探索的データ解析
    - 先入観なくデータを解析し全体像を把握する作業
- `pandas`を使った表データの処理や`Matplotlib`などを使った可視化
- データがどれくらい整っているかを確認できる意味もある

# 文字列型の列を確認

```python
# 田舎、繁華街、郊外の3種類
data['Location'].unique()
```
```
array(['Rural', 'Downtown', 'Suburban'], dtype=object)
```

```python
data['Cuisine'].unique()
```
```
array(['Japanese', 'Mexican', 'Italian', 'Indian', 'French', 'American'], dtype=object)
```

```python
data['Parking Availability'].unique()
```
```
array(['Yes', 'No'], dtype=object)
```
'Yes', 'yes', 'No'などになっていないか確認

# 連続値はヒストグラム

```python
data['Rating'].hist()
```

- 異常値の有無とデータの分布の確認
- （できるだけ）全列確認

# 散布図行列

```python
_ = pd.plotting.scatter_matrix(data)
```

- 対角成分にヒストグラム、非対角成分に散布図
- 列数が多いとこんな感じに・・・
- レコード数が多いと描画に時間がかかる可能性

# 変数間の関係

```python
import seaborn as sns
# 数値のデータだけを抜き出す
num_data = data.select_dtypes(include='number')
# corr()メソッドで相関係数（ピアソンの積率相関係数）を計算
sns.heatmap(num_data.corr(), cmap='coolwarm', linewidths=0.5)
```

# plotlyを使ってHTMLファイルを作る

```python
import plotly.express as px

fig = px.imshow(num_data.corr())
fig.write_html('corr_heatmap.html')
```

[結果はこちら](corr_heatmap.html)

# seabornの活用

```python
# 引数hueを使った層別化
sns.barplot(x='Cuisine', y='Revenue', data=data, hue='Location')
```

繁華街の日本料理屋が儲かっている

---

# 次元削減手法

- レストランの名前を除くと16項目ある
    - 3つは文字列
- 16次元のデータとなりプロットできない
- 次元削減手法を使って16次元を2次元に変換して平面にプロットしてみる

# まずは文字列の数値化

```python
from sklearn.preprocessing import LabelEncoder

le_data = data.copy().drop('Name', axis=1)
loc_le = LabelEncoder()
le_data['Location'] = loc_le.fit_transform(le_data['Location'])
cui_le = LabelEncoder()
le_data['Cuisine'] = cui_le.fit_transform(le_data['Cuisine'])
park_le = LabelEncoder()
le_data['Parking Availability'] = park_le.fit_transform(le_data['Parking Availability'])
```

# ラベルの確認

```python
le_data['Location'].unique()
```
```
array([1, 0, 2])
```

もとにもどす
```python
loc_le.inverse_transform(le_data['Location'])
```
```
array(['Rural', 'Downtown', 'Rural', ..., 'Downtown', 'Rural', 'Rural'], shape=(8368,), dtype=object)
```

# PCA（主成分分析）

```python
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
# 各レストランについて16次元→2次元になる
pca_emb = pca.fit_transform(le_data)
# 1列目をX軸、2列目をY軸（NumPyのちょっとした知識）
plt.scatter(pca_emb[:,0], pca_emb[:,1])
```

# Plotlyを使った可視化

```python
data_with_pca = data.copy()
data_with_pca['PCA1'] = pca_emb[:,0]
data_with_pca['PCA2'] = pca_emb[:,1]

fig = px.scatter(data_with_pca, x='PCA1', y='PCA2',
                 color='Cuisine', size='Revenue',
                 hover_data=['Location'])
fig.write_html('pca_plot.html')
```

[結果はこちら](pca_plot.html)

# 次元削減の方法はいろいろ

- PCAは線形代数（特異値分解）を用いた古典的な方法
- 今世紀に入ってt-SNE(t-distributed Stochastic Neighbor Embedding)やUMAPなど新たな方法が提案されている

```python
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
tsne_emb = tsne.fit_transform(le_data)
data_with_tsne = data.copy()
data_with_tsne['tSNE1'] = tsne_emb[:,0]
data_with_tsne['tSNE2'] = tsne_emb[:,1]

fig = px.scatter(data_with_tsne, x='tSNE1', y='tSNE2',
                 color='Cuisine', size='Revenue',
                 hover_data=['Location'])
fig.write_html('tsne_plot.html')
```
[結果はこちら](tsne_plot.html)

---

# 教師あり学習

- レストランの総収入（Revenue）を目的変数として予測する回帰モデルを作る
    - 残りの15項目を説明変数にする
- 線形回帰
    - 変数の標準化
    - 多重共線性の問題を少し
- Random Forests
    - 手軽で性能が良い

# データの準備

```python
# 目的変数と説明変数に分離
y = le_data['Revenue']
X_data = le_data.drop('Revenue', axis=1)
```

```python
from sklearn.model_selection import train_test_split
# 訓練用とテスト用のデータに分割
X_train, X_test, y_train, X_test, y_test = train_test_split(
                                            　　　　　　　　　　　　　　X_data, y, test_size=0.25)
```

# データの標準化

各説明変数に関して、平均値を引いて標準偏差で割る計算をする。
```python
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss_X_train = ss.fit_transform(X_train)
ss_X_test = ss.fit_transform(X_test)
```

- 線形回帰を利用する場合は、標準化をした方が結果の解釈の際に便利だったり、Lassoなどを利用しやすい面がある
- 後述するRandom Forestsでは不要

# モデルの構築と予測

```python
from sklearn.linear_model import LinearRegression
# 線形回帰モデルを準備
lr = LinearRegression()
# 訓練データで学習
lr.fit(ss_X_train, y_train)
# 予測値の計算
lr_pred = lr.predict(ss_X_test)
# プロット
pd.DataFrame({'true': y_test, 'predicted': lr_pred}).plot.scatter('true', 'predicted')
```

# 性能評価

正解との誤差を2乗して足し合わせたものの正の平方根
```python
from sklearn.metrics import root_mean_squared_error

root_mean_squared_error(y_test, lr_pred)
```
```output
55172.43111226483
```
後ほど、Random Forestsと比較

# 係数をみる

```python
pd.DataFrame({"coef": X_data.columns, "value": lr.coef_}).sort_values('value', ascending=False)
```
<span style="font-size: 40%">

|    | coef                   |       value |
|---:|:-----------------------|------------:|
|  4 | Average Meal Price     | 189231      |
|  3 | Seating Capacity       | 184701      |
|  6 | Social Media Followers |   6291.03   |
|  7 | Chef Experience Years  |   5448.68   |
|  0 | Location               |    971.391  |
|  2 | Rating                 |    497.25   |
| 10 | Ambience Score         |    365.018  |
| 11 | Service Quality Score  |    283.57   |
|  8 | Number of Reviews      |    203.14   |
|  1 | Cuisine                |     86.7572 |
|  9 | Avg Review Length      |    -17.7084 |
| 14 | Weekday Reservations   |    -69.392  |
| 13 | Weekend Reservations   |   -368.546  |
| 12 | Parking Availability   |   -700.881  |
|  5 | Marketing Budget       |  -2258.17   |

</span>

# 多重共線性

Marketing BudgetとSocial Media Followersはほとんど同じ変数

このような変数を線形回帰モデルの入力にするのはダメ

# そんなあなたにRandom Forests

基本的には、変数の標準化や変数間の相関を気にする必要はない

```python
from sklearn.ensemble import RandomForestRegressor
# Random Forestsの回帰モデルを準備
rfr = RandomForestRegressor()
# 訓練データで学習
rfr.fit(X_train, y_train)
# 予測
rfr_pred = rfr.predict(X_test)
# 線形回帰は55172.4
root_mean_squared_error(y_test, rfr_pred)
```
```output
7706.090207362632
```

# ほとんど当たる・・・

```python
plt.scatter(y_test, rfr_pred)
```

# モデルに寄与する変数

```python
pd.DataFrame({'feature': X_data.columns, 'importance': rfr.feature_importances_}).sort_values('importance', ascending=False)
```

|    | feature                |   importance |
|---:|:-----------------------|-------------:|
|  3 | Seating Capacity       |  0.500778    |
|  4 | Average Meal Price     |  0.496983    |
|  0 | Location               |  0.000573404 |
|  1 | Cuisine                |  0.000273175 |
|  7 | Chef Experience Years  |  0.000219338 |
|  9 | Avg Review Length      |  0.000136851 |
| 10 | Ambience Score         |  0.000134595 |
| 13 | Weekend Reservations   |  0.000133325 |
|  6 | Social Media Followers |  0.000130593 |
|  8 | Number of Reviews      |  0.000129901 |
| 14 | Weekday Reservations   |  0.000127822 |
| 11 | Service Quality Score  |  0.000126139 |
|  5 | Marketing Budget       |  0.00012486  |
|  2 | Rating                 |  0.000107127 |
| 12 | Parking Availability   |  2.22524e-05 |

<span>

---

# まとめ

- 実際にデータを解析しながらライブラリの使い方を覚えるのがおすすめ
- Kaggleのサイトに豊富なデータがある
    - そのデータを使ったコードのサンプルも豊富
    - これを使わない手は無い
- 教科書と実践を行き来しながら学ぶと効率的と思われる
    - 分類と回帰の違いなど基礎的な項目は教科書でチェック
- いろいろなデータに触れる
    - 時系列データや自然言語処理のためのデータなど

# 質疑応答

聞きたかった話題などあれば是非ご発言ください