5. 用 HistoryPanel 研究橫截面選股因子

橫向截面選股：多因子配比 -> 分組組合收益與 CAGR 對比

做橫向截面選股的時候，我們經常遇到一種“很真實但很彆扭”的情況：

我們手裏明明有一堆股票（幾十只、上百隻），也知道要用多個指標做篩選（PE、PB、EBITDA、動量、波動率……）
但一旦把條件寫成代碼，很容易出現“形狀對不上”“條件廣播錯了”“篩出來的股票不穩定”
最後得到一條組合曲線，卻說不清楚：到底是哪幾個條件在驅動收益差異

本篇教程就按“先跑通，再增強”的節奏，把這條鏈路搭成一個可複用的研究流程： 多因子（橫向） -> 截面條件 -> where mask -> portfolio + benchmark -> normalize/cum_return + CAGR -> plot/highlight 解釋。

同樣先強調定位：這裏依然是研究向粗聚合，不是交易回測引擎。

5.1. 0. 开场：先跑通一个“横向筛选 -> 组合曲线对比”的最小版本

我們先不追求因子多、也不追求參數精細。最小可跑證明只要做到兩件事：

多 shares 能跑通橫向篩選；
最終能畫出一條組合曲線，並和 000300.SH 做對比。

import qteasy as qt

benchmark = '000300.SH'
shares = [
    '000001.SZ', '600519.SH', '300750.SZ', '000333.SZ', '600036.SH',
    '601318.SH', '002415.SZ', '000858.SZ', '600276.SH', '000725.SZ',
    benchmark,
]

hp = qt.get_kline(
    shares=shares,
    start='20220101',
    end='20221231',
    freq='D',
    as_panel=True,
)

fig = hp.plot(interactive=True)
fig

不過，僅僅“能畫出來”還遠遠不夠。如果我們真的要把它當成日常選股研究工具來用，至少會遇到下面這些問題：

多因子條件怎麼對齊：PE/PB/動量/波動這些概念很好理解，但一寫代碼就會踩坑：有的因子是 (M,L)，有的被你寫成 (M,L,1)，再加上 HistoryPanel 自帶的第三維字段列，最終很容易在廣播時“對上了但不是你以爲的那個對”。最麻煩的是，這類錯往往不報錯，只會讓結果看起來“怪”。
選股是截面決策：橫向選股不是“對每隻股票單獨下判斷”，而是“同一天從一堆股票裏挑一批”。這意味着你的持倉集合每天都可能變：今天 10 只，明天 3 只，後天可能 0 只。如果你不把“每天入選籃子”顯式固化下來（例如用 mask），你根本沒法覆盤“那天到底選了誰”。
沒有 benchmark 對比就沒結論：組合曲線看起來不錯並不等於有效，它可能只是踩中了市場 beta。把 000300.SH 拉進來做對照，至少能回答一個最關鍵的問題：這套篩選是在創造超額，還是隻是跟隨大盤？
有收益沒解釋：橫向篩選很容易“只剩一條曲線”。但真正的研究需要解釋：收益差異最大的那幾天，是不是恰好發生了風格切換？是不是篩選條件把我們帶到了高波動/高回撤的角落？能不能快速定位到“關鍵分化段”，再決定下一步該加強哪條因子？

好在這些能力都可以一步步補齊。接下來我們就從“因子構造”開始。

5.2. 0.5 先貼最終效果（我們最後會得到什麼）

按本文做完，你會得到三個非常“研究友好”的產出：

一張組合曲線對比圖：LONG / SHORT / 000300.SH（或至少 LONG / 000300.SH）統一從 1.0 起點歸一化，肉眼就能看出篩選規則在樣本期內是否有區分度。
一張 CAGR 摘要表：把累計收益折算成年化口徑，方便不同區間橫向比較。
一張“關鍵差異日/分化段”高亮圖：把 long 組合與 benchmark 分化最明顯的點標出來，便於覆盤與解釋。

5.3. 1. 目标（我们这篇文章要完成什么）

獲取多 shares + 000300.SH 的 HistoryPanel
構造多因子（兩條路線並列：代理因子固定閾值版 + 可選真實估值版）
把多因子條件組合成 (M,L) bool，並用 where() 生成研究 mask
用 portfolio(mask=...) 得到 long/short 兩組組合曲線，並與 benchmark 對比
用 normalize/cum_return 推導 CAGR 表
用 plot(highlight=...) 高亮“差異最大日/關鍵分化段”，做到可解釋
文末給出一段“單函數可跑”的完整代碼

5.4. 2. 准备数据：多 shares + benchmark（控制数量，让图可读）

2.1 本節要解決什麼

橫向截面研究最常見的陷阱之一是： 股票太多，一次性畫圖讀不動；股票太少，又體現不出“橫向篩選”的意義。

所以我們建議先把樣例控制在 30~80 只以內（實際你可以替換成自己的股票池）。本節只做一件事：把數據拿到手並確認 close 存在。

2.2 最小必要原理

後面所有組合聚合默認都圍繞 close 做，因此只要 close 在 hp.htypes 中，我們就能跑通完整閉環。 OHLC 是否齊全決定我們能不能畫 K 線、能不能做事件型解釋；本篇以橫向篩選爲主，OHLC 可作爲可選項。

2.3 可運行代碼 + 預期效果

import qteasy as qt

benchmark = '000300.SH'
shares = [
    # 这里用少量示意；正式可替换成你自己的一篮子股票池
    '000001.SZ', '600519.SH', '300750.SZ', '000333.SZ', '600036.SH',
    '601318.SH', '002415.SZ', '000858.SZ', '600276.SH', '000725.SZ',
    benchmark,
]

hp = qt.get_kline(
    shares=shares,
    start='20220101',
    end='20221231',
    freq='D',
    as_panel=True,
)

print('hp.shape:', hp.shape)
print('hp.shares count:', len(hp.shares))
print('hp.htypes:', hp.htypes)
if 'close' not in hp.htypes:
    raise ValueError(f'Missing close column, htypes: {hp.htypes}')

# 避免“其实没数据还继续往下跑”
if hp.shape[1] < 50:
    raise ValueError(
        'Not enough data points loaded (too few hdates). '
        'Please check your local datasource and date range.'
    )

5.5. 3. 多因子构造（两条路线并列）：固定阈值代理版 + 可选真实估值版

3.1 本節要解決什麼

這一步我們要得到的是：能夠直接組合成篩選條件的因子矩陣。在“橫向截面”裏，我們希望每一個因子最終都能落到同一個形狀：(M, L)。

爲簡單起見，這裏我們分成兩條路線：

路線 A（入門推薦）：完全用行情/技術指標派生“代理因子”，配合固定閾值，保證所有人都能跑通
路線 B（可選增強）：如果你的本地數據源裏已經有 PE/PB/EBITDA 等估值字段，就把它們替換進來

兩條路線最終都輸出同樣的 cond_long/cond_short，後面的流程完全一致。

3.2 最小必要原理

HistoryPanel.kline.* 會返回帶新增列的新面板，例如：

sma_20
macd_hist_12_26_9
bbands_upper_20_2_2 等

這些列都能通過 hp.htypes.index(name) 定位成 (M, L) 的二維矩陣。然後我們就可以寫出固定閾值的多因子條件，並確保最終條件是 (M, L) 的 bool。

3.3 路線 A：代理因子（固定閾值，推薦先跑通）

3.3.1 要解決什麼

我們用三個“非常直覺”的代理因子，讓讀者快速進入狀態：

價值代理：close / sma20（偏離均線越低越“便宜”）
動量代理：macd_hist（強弱）
風險代理：布林帶帶寬（波動大小）

這裏要強調兩點“研究口徑上的誠實”：

這些都是代理，不是財務意義上的“真估值”。它們的價值在於：人人都能從行情數據派生出來，並且在某些風格下確實能形成分層效果。
它們也有明顯侷限：close/sma20 可能把趨勢當成便宜/貴；MACD 可能在震盪區來回打臉；帶寬過濾會把“波動帶來的機會”也一起過濾掉。本文的重點是把鏈路跑通，閾值本身只是一個可復現的起點。

3.3.2 可運行代碼 + 預期效果

import numpy as np

hp2 = hp.kline.sma(window=20, price_htype='close')       # sma_20
hp2 = hp2.kline.macd(price_htype='close')                # macd_hist_12_26_9
hp2 = hp2.kline.bbands(window=20, price_htype='close')   # bbands_*_20_2_2

vals = hp2.values.astype(float)

close = vals[:, :, hp2.htypes.index('close')]
sma20 = vals[:, :, hp2.htypes.index('sma_20')]
macd_hist = vals[:, :, hp2.htypes.index('macd_hist_12_26_9')]

upper = vals[:, :, hp2.htypes.index('bbands_upper_20_2_2')]
mid   = vals[:, :, hp2.htypes.index('bbands_middle_20_2_2')]
lower = vals[:, :, hp2.htypes.index('bbands_lower_20_2_2')]

value_proxy = close / sma20
momentum_proxy = macd_hist
risk_proxy = (upper - lower) / mid

# 固定阈值（入门优先：简单、可跑、可理解）
A_VALUE = 1.02     # 强一点：价格明显强于均线才算“强势”
C_MOM = 0.0        # MACD 柱 > 0 视为偏多
B_RISK = 0.18      # 带宽太大视为波动过强，先过滤掉

cond_long = (value_proxy > A_VALUE) & (momentum_proxy > C_MOM) & (risk_proxy < B_RISK)
cond_short = (value_proxy < 1.0 / A_VALUE) & (momentum_proxy < -C_MOM) & (risk_proxy < B_RISK)

print('cond_long shape:', cond_long.shape)   # 期望 (M, L)
print('cond_short shape:', cond_short.shape)
print('selected_count_last_day(long):', int(cond_long[:, -1].sum()))

到這裏我們就拿到了橫向篩選的“核心原料”：cond_long/cond_short (M,L bool)。

下一步，我們還需要做一個非常實用的 sanity check：每天到底選中了多少隻股票。如果這個數量經常是 0（空籃子），你的組合曲線會斷斷續續，結論也很不穩定；如果這個數量幾乎總是全部股票，那篩選就沒有意義了。

你可以加上這段檢查（不改邏輯，只幫我們判斷閾值是否“太嚴/太鬆”）：

selected_count_by_day = cond_long.sum(axis=0)  # (L,)
print('selected_count stats (long):')
print('  min/max:', int(selected_count_by_day.min()), int(selected_count_by_day.max()))
print('  mean:', float(selected_count_by_day.mean()))
print('  p10/p50/p90:', np.quantile(selected_count_by_day.astype(float), [0.1, 0.5, 0.9]))

3.4 路線 B：真實估值因子（可選增強，字段名以 `hp.htypes` 爲準）

3.4.1 要解決什麼

如果你的本地數據源裏已經下載過估值字段（例如 PE/PB/EBITDA），那我們就可以把它們替換進來。這一節我們不寫死字段名，因爲不同數據源/數據類型的 htype 命名可能不一樣。最穩妥的方式是：

先列印 hp.htypes
找到你本地實際列名
再按固定閾值寫條件

3.4.2 可運行代碼（示意）

print('available htypes:', hp.htypes)

# 假设你在 htypes 里找到了这三个字段（名称以你的本地为准）
# pe_name = 'pe' or 'pe_ttm' ...
# pb_name = 'pb' ...
# ebitda_name = 'ebitda' ...

# pe = hp.values[:, :, hp.htypes.index(pe_name)]
# pb = hp.values[:, :, hp.htypes.index(pb_name)]
# ebitda = hp.values[:, :, hp.htypes.index(ebitda_name)]

# 固定阈值示例（仅示意，阈值需要你按资产池与口径调整）
# cond_long = (pe < 15.0) & (pb < 2.0) & (ebitda > 1e9)

只要最終你仍然得到 (M, L) 的 cond_long/cond_short，後面流程就與路線 A 完全一致。

如果你發現自己本地根本沒有這些字段，也沒關係：這正是我們把“代理因子版”放在正文主線的原因。路線 B 作爲增強分支存在，但不依賴它也能完整跑通研究閉環。

5.6. 4. 横向筛选：条件组合 -> `where()` 研究 mask

4.1 本節要解決什麼

我們把 cond_long/cond_short 變成可直接餵給 portfolio(mask=...) 的研究 mask。

4.2 最小必要原理

hp.where() 支持把 (M, L) 的 bool 條件規整成 (M, L, N) 的 mask。這一步把“截面篩選規則”固化爲“研究口徑”，後面所有聚合與收益計算都以此爲準。

這也是橫向研究裏最值得養成的一個習慣：永遠不要只保留一條“組合曲線”，要把“每天入選籃子”的規則也顯式保留下來。 mask 就是這條規則。它讓你能回答“那天到底選了誰”這種覆盤問題。

4.3 可運行代碼 + 預期效果

mask_long = hp2.where(cond_long)
mask_short = hp2.where(cond_short)

print('mask_long shape:', mask_long.shape)   # 期望 (M,L,N)
print('mask_long dtype:', mask_long.dtype)

5.7. 5. 两组组合曲线 + benchmark：`portfolio(mask=...)`

5.1 本節要解決什麼

對每個交易日，把滿足條件的股票集合聚合成一條組合曲線（long/short），並與 000300.SH 對比。

5.2 最小必要原理

portfolio 是研究粗聚合，不是交易執行
benchmark_output='tag_along' 會把基準行追加進輸出，便於同圖對比

你可以把它理解成：對每一天，把“當日入選的股票集合”做等權平均，得到一條曲線。由於集合每天會變，所以這條曲線本質上是在回答一個問題：

如果我每天都只持有“符合條件的那一批”，這個動態籃子長期表現如何？

5.3 可運行代碼 + 預期效果

benchmark = '000300.SH'

pf_long = hp2.portfolio(
    htypes='close',
    mode='equal',
    mask=mask_long,
    benchmark=benchmark,
    benchmark_output='tag_along',
    new_share_name='LONG',
)

pf_short = hp2.portfolio(
    htypes='close',
    mode='equal',
    mask=mask_short,
    benchmark=benchmark,
    benchmark_output='tag_along',
    new_share_name='SHORT',
)

print('pf_long.shares:', pf_long.shares)
print('pf_long.shape:', pf_long.shape)

5.8. 6. `normalize / cum_return` + CAGR：给出可比较的年化摘要表

6.1 本節要解決什麼

我們既想“看曲線”，也想“有一個能複用的數字總結”。所以我們做兩件事：

normalize：統一起點，便於目視
cum_return + CAGR：輸出表格摘要

6.2 最小必要原理

cum_return 輸出累計收益 cumret_*；用末值配合年數可以推導 CAGR。（具體 CAGR 定義見前文討論；這裏不再展開推導。）

6.3 可運行代碼（示意）

import pandas as pd

def _years_between(hdates) -> float:
    idx = pd.DatetimeIndex(hdates)
    days = (idx[-1] - idx[0]).days
    return max(1e-9, days / 365.25)

def _cagr_from_cumret(cumret_end: float, years: float) -> float:
    return (1.0 + cumret_end) ** (1.0 / years) - 1.0

years = _years_between(pf_long.hdates)
cr_long = pf_long.cum_return(htypes='close', method='simple')

cumret_long_end = float(cr_long.values[cr_long.shares.index('LONG'), -1, 0])
cumret_bm_end = float(cr_long.values[cr_long.shares.index('000300.SH'), -1, 0])

print('CAGR(long):', _cagr_from_cumret(cumret_long_end, years))
print('CAGR(bm):', _cagr_from_cumret(cumret_bm_end, years))

建議你把它整理成一個小表（至少包含 LONG/SHORT/benchmark 三行），這樣讀者一眼就能比較：

cr_short = pf_short.cum_return(htypes='close', method='simple')
cumret_short_end = float(cr_short.values[cr_short.shares.index('SHORT'), -1, 0])

summary = pd.DataFrame(
    {
        'cum_return_end': [cumret_long_end, cumret_short_end, cumret_bm_end],
        'CAGR': [
            _cagr_from_cumret(cumret_long_end, years),
            _cagr_from_cumret(cumret_short_end, years),
            _cagr_from_cumret(cumret_bm_end, years),
        ],
    },
    index=['LONG', 'SHORT', '000300.SH'],
)
print('\\n[CAGR summary]')
print(summary)

5.9. 7. 可视化与解释：用 `plot(highlight=...)` 高亮“关键差异日”

7.1 本節要解決什麼

橫向截面研究做到最後，最需要解釋的一類問題是： long 和 benchmark 差異最大的一段，到底發生了什麼？

因此我們用 highlight 高亮一段“關鍵差異日”（例如累計收益最大點/最小點，或你自己挑的區間），把讀者的注意力拉回圖上。

7.2 最小必要原理

highlight 支持 'max'/'min' 的簡寫，也支持 1D bool 條件。爲了讓教程穩定，我們先演示簡寫版（最不容易踩坑），後續再給 1D bool 的寫法。

7.3 可運行代碼 + 預期效果

fig = pf_long.plot(interactive=True, highlight='max')
fig

預期效果：圖上會把 LONG 組合曲線的最大點（或某類圖表定義的最大點）標出來。在橫向研究裏，我們通常把這當成一個“提醒”：最大點附近往往是 long 相對市場最順風的一段，值得回頭看當時篩選條件是否把我們帶進了某種風格（例如強趨勢、低波動等）。

如果你想更貼近“long vs benchmark 的分化”，可以用一個更實用的做法：先算超額曲線（long 與 benchmark 的差值），再把超額曲線的最大點位置提出來，做成 1D bool 的 highlight 條件。（這裏作爲可選增強，不強制你在正文主線展開。）

5.10. 8. 完整代码（单函数可跑版本）

下面給出一段“單函數可跑”的完整版本，方便你複製進 Notebook 一鍵運行。它覆蓋：

路線 A（代理因子固定閾值）主線；
每天入選數量的 sanity check；
where -> portfolio -> cum_return -> CAGR 的完整閉環；
plot(highlight=...) 的一個穩定演示。

import numpy as np
import pandas as pd
import qteasy as qt


def demo_horizontal_multifactor(
        shares: list,
        benchmark: str = '000300.SH',
        start: str = '20220101',
        end: str = '20221231',
):
    \"\"\"演示横向多因子截面筛选：代理因子 -> 每日篮子 -> portfolio -> CAGR -> 高亮解释。

    Parameters
    ----------
    shares : list
        股票池（必须包含 benchmark；建议 30~80 只更像“横向筛选”）。
    benchmark : str, default '000300.SH'
        基准指数代码。
    start : str, default '20220101'
        起始日期（YYYYMMDD）。
    end : str, default '20221231'
        结束日期（YYYYMMDD）。

    Returns
    -------
    dict
        包含 hp/hp2/pf_long/pf_short/summary/fig 等结果对象。
    \"\"\"
    if benchmark not in shares:
        raise ValueError('benchmark must be included in shares')

    hp = qt.get_kline(
        shares=shares,
        start=start,
        end=end,
        freq='D',
        as_panel=True,
    )
    if 'close' not in hp.htypes:
        raise ValueError('Missing close column in htypes')
    if hp.shape[1] < 50:
        raise ValueError(
            'Not enough data points loaded (too few hdates). '
            'Please check your local datasource and date range.'
        )

    # 1) 路线 A：代理因子（固定阈值）
    hp2 = hp.kline.sma(window=20, price_htype='close')
    hp2 = hp2.kline.macd(price_htype='close')
    hp2 = hp2.kline.bbands(window=20, price_htype='close')

    vals = hp2.values.astype(float)
    close = vals[:, :, hp2.htypes.index('close')]
    sma20 = vals[:, :, hp2.htypes.index('sma_20')]
    macd_hist = vals[:, :, hp2.htypes.index('macd_hist_12_26_9')]
    upper = vals[:, :, hp2.htypes.index('bbands_upper_20_2_2')]
    mid = vals[:, :, hp2.htypes.index('bbands_middle_20_2_2')]
    lower = vals[:, :, hp2.htypes.index('bbands_lower_20_2_2')]

    value_proxy = close / sma20
    momentum_proxy = macd_hist
    risk_proxy = (upper - lower) / mid

    A_VALUE = 1.02
    C_MOM = 0.0
    B_RISK = 0.18

    cond_long = (value_proxy > A_VALUE) & (momentum_proxy > C_MOM) & (risk_proxy < B_RISK)
    cond_short = (value_proxy < 1.0 / A_VALUE) & (momentum_proxy < -C_MOM) & (risk_proxy < B_RISK)

    # 2) sanity check：每天入选数量
    selected_count_by_day = cond_long.sum(axis=0)
    print('\\n[Selection count stats]')
    print('  min/max:', int(selected_count_by_day.min()), int(selected_count_by_day.max()))
    print('  mean:', float(selected_count_by_day.mean()))
    print('  p10/p50/p90:', np.quantile(selected_count_by_day.astype(float), [0.1, 0.5, 0.9]))

    # 3) 条件 -> mask（研究口径）
    mask_long = hp2.where(cond_long)
    mask_short = hp2.where(cond_short)

    # 4) 组合聚合 + benchmark
    pf_long = hp2.portfolio(
        htypes='close',
        mode='equal',
        mask=mask_long,
        benchmark=benchmark,
        benchmark_output='tag_along',
        new_share_name='LONG',
    )
    pf_short = hp2.portfolio(
        htypes='close',
        mode='equal',
        mask=mask_short,
        benchmark=benchmark,
        benchmark_output='tag_along',
        new_share_name='SHORT',
    )

    # 5) cum_return -> CAGR
    def _years_between(hdates) -> float:
        idx = pd.DatetimeIndex(hdates)
        days = (idx[-1] - idx[0]).days
        return max(1e-9, days / 365.25)

    def _cagr_from_cumret(cumret_end: float, years: float) -> float:
        return (1.0 + cumret_end) ** (1.0 / years) - 1.0

    years = _years_between(pf_long.hdates)
    cr_long = pf_long.cum_return(htypes='close', method='simple')
    cr_short = pf_short.cum_return(htypes='close', method='simple')

    cumret_long_end = float(cr_long.values[cr_long.shares.index('LONG'), -1, 0])
    cumret_short_end = float(cr_short.values[cr_short.shares.index('SHORT'), -1, 0])
    cumret_bm_end = float(cr_long.values[cr_long.shares.index(benchmark), -1, 0])

    summary = pd.DataFrame(
        {
            'cum_return_end': [cumret_long_end, cumret_short_end, cumret_bm_end],
            'CAGR': [
                _cagr_from_cumret(cumret_long_end, years),
                _cagr_from_cumret(cumret_short_end, years),
                _cagr_from_cumret(cumret_bm_end, years),
            ],
        },
        index=['LONG', 'SHORT', benchmark],
    )
    print('\\n[CAGR summary]')
    print(summary)

    # 6) 图：组合对比（归一化更直观）
    fig = pf_long.normalize(htypes='close', base_index=0).plot(interactive=True, highlight='max')
    return {
        'hp': hp,
        'hp2': hp2,
        'pf_long': pf_long,
        'pf_short': pf_short,
        'summary': summary,
        'fig': fig,
    }


res = demo_horizontal_multifactor(
    shares=[
        '000001.SZ', '600519.SH', '300750.SZ', '000333.SZ', '600036.SH',
        '601318.SH', '002415.SZ', '000858.SZ', '600276.SH', '000725.SZ',
        '000300.SH',
    ],
    benchmark='000300.SH',
    start='20220101',
    end='20221231',
)
res['fig']

5.11. 9. 小结与边界

到這裏，我們已經把“橫向截面多因子選股”的研究閉環跑通了。需要再次強調：portfolio/cum_return 是研究向粗聚合，不含真實交易執行語義。如果你要把這個篩選邏輯遷移到策略回測，建議把條件/因子輸出成策略信號，並交給 Operator/Backtester 去處理交易層細節。

5.12. 附錄：插圖索引（建議你在 Notebook 裏生成/截圖）

插圖	建議放置位置	你將看到什麼
`img/3.2_minimal_run.png`	§0	多 shares 的最小可跑出圖
`img/3.2_pf_compare.png`	§0.5 或 §6	`LONG/SHORT/000300.SH` 歸一化曲線對比
`img/3.2_highlight_key_day.png`	§7	高亮“關鍵點”的示例（例如 max 點）