5. Using HistoryPanel to study cross-sectional stock-picking factors

前置：教程 2.0 最小数据集 + 教程 2.5（建议先完成 2.5 §0）。

Cross-sectional stock selection: multi-factor weighting -> compare grouped portfolio returns and CAGR

When doing cross-sectional stock selection, we often run into a situation that feels “very real but very awkward”:

We clearly have a bunch of stocks (dozens, hundreds), and we also know we should use multiple metrics to screen (PE, PB, EBITDA, momentum, volatility…)
But once you write the conditions into code, it’s easy to run into “shapes don’t match,” “condition broadcasting is wrong,” or “the screened stocks are unstable”
In the end you get a single portfolio curve, yet you can’t clearly explain: which conditions are actually driving the return differences

This tutorial follows a “get it working first, then enhance it” rhythm to build this pipeline into a reusable research workflow: multi-factor (cross-sectional) -> cross-sectional conditions -> where mask -> portfolio + benchmark -> normalize/cum_return + CAGR -> plot/highlight explanation.

Again, to clarify the positioning: this is still research-oriented coarse aggregation, not a trading backtest engine.

5.1. 0. 开场：先跑通一个“横向筛选 -> 组合曲线对比”的最小版本

We won’t pursue lots of factors for now, nor fine-tuned parameters. A minimal runnable proof only needs to do two things:

With multiple shares, you can get cross-sectional screening to run through;
In the end, you’ll be able to plot a combined curve and compare it with 000300.SH.

import qteasy as qt

benchmark = '000300.SH'
shares = [
    '000001.SZ', '600519.SH', '300750.SZ', '000333.SZ', '600036.SH',
    '601318.SH', '002415.SZ', '000858.SZ', '600276.SH', '000725.SZ',
    benchmark,
]

hp = qt.get_kline(
    shares=shares,
    start='20220101',
    end='20221231',
    freq='D',
    as_panel=True,
)

fig = hp.plot(interactive=True)
fig

However, merely “being able to plot it” is far from enough. If we really want to use it as a day-to-day stock selection research tool, we’ll run into at least the following issues:

How to align multi-factor conditions: concepts like PE/PB/momentum/volatility are easy to understand, but once you write code you’ll hit pitfalls: some factors are (M,L), some you wrote as (M,L,1), and on top of that HistoryPanel comes with its own third dimension for field columns—so it’s easy for broadcasting to “match up, but not in the way you think.” The worst part is that these mistakes often don’t throw errors; they just make the results look “off.”
Stock selection is a cross-sectional decision: cross-sectional picking isn’t “making a judgment on each stock individually,” but “on the same day, choosing a subset from a bunch of stocks.” This means your holdings set can change every day: 10 stocks today, 3 tomorrow, and possibly 0 the day after. If you don’t explicitly solidify “the selected basket each day” (e.g., with a mask), you simply can’t review “which ones were actually selected that day.”
Without a benchmark comparison, there’s no conclusion: a combined curve looking good doesn’t mean it’s effective—it may just be riding market beta. Bring 000300.SH in as a reference; at minimum it answers the most critical question: is this screen creating alpha, or just following the broad market?
Returns without an explanation: cross-sectional screening can easily end up with “only one curve left.” But real research needs an explanation: on the days with the largest return differences, did a style rotation happen to occur? Did the screening criteria push us into a corner with high volatility/high drawdown? Can we quickly pinpoint the “key divergence segment,” and then decide which factor to strengthen next?

Fortunately, these capabilities can all be filled in step by step. Next, we’ll start with “factor construction”.

5.2. 0.5 First, show the final result (what we’ll end up with)

After following this article, you’ll get three very “research-friendly” outputs:

A portfolio curve comparison chart: LONG / SHORT / 000300.SH (or at least LONG / 000300.SH) normalized to start from 1.0, so you can tell at a glance whether the screening rule has discriminative power within the sample period.
A CAGR summary table: convert cumulative returns into an annualized metric, making it easy to compare across different periods.
A single “key divergence day/divergence segment” highlight chart: mark the points where the long portfolio diverges most clearly from the benchmark, making it easier to review and explain.

5.3. 1. 目标（我们这篇文章要完成什么）

Fetch the HistoryPanel for multiple shares + 000300.SH
Construct multi-factors (two routes in parallel: a proxy-factor fixed-threshold version + an optional true-valuation version)
Combine multi-factor conditions into a (M,L) bool, and use where() to generate the research mask
Use portfolio(mask=...) to get the long/short portfolio curves, and compare them with the benchmark
Use normalize/cum_return to derive the CAGR table
Use plot(highlight=...) to highlight the “max-difference day/key divergence segment” and make it explainable
At the end, we provide a complete piece of code that “runs as a single function”

5.4. 2. 准备数据：多 shares + benchmark（控制数量，让图可读）

2.1 What this section aims to solve

One of the most common traps in cross-sectional research is: too many stocks, and you can’t digest the plots in one go; too few stocks, and you can’t reflect the meaning of “cross-sectional screening.”

So we suggest keeping the sample within 30–80 stocks first (in practice you can replace it with your own stock pool). This section does only one thing: get the data in hand and confirm that close exists.

2.2 Minimum necessary principles

All subsequent portfolio aggregation defaults to being centered around close, so as long as close is in hp.htypes, we can run the full closed loop end-to-end. Whether OHLC is complete determines whether we can plot candlesticks and whether we can do event-style explanations; this article focuses mainly on cross-sectional screening, and OHLC can be optional.

2.3 Runnable code + expected results

import qteasy as qt

benchmark = '000300.SH'
shares = [
    # 这里用少量示意；正式可替换成你自己的一篮子股票池
    '000001.SZ', '600519.SH', '300750.SZ', '000333.SZ', '600036.SH',
    '601318.SH', '002415.SZ', '000858.SZ', '600276.SH', '000725.SZ',
    benchmark,
]

hp = qt.get_kline(
    shares=shares,
    start='20220101',
    end='20221231',
    freq='D',
    as_panel=True,
)

print('hp.shape:', hp.shape)
print('hp.shares count:', len(hp.shares))
print('hp.htypes:', hp.htypes)
if 'close' not in hp.htypes:
    raise ValueError(f'Missing close column, htypes: {hp.htypes}')

# 避免“其实没数据还继续往下跑”
if hp.shape[1] < 50:
    raise ValueError(
        'Not enough data points loaded (too few hdates). '
        'Please check your local datasource and date range.'
    )

5.5. 3. 多因子构造（两条路线并列）：固定阈值代理版 + 可选真实估值版

3.1 What this section aims to solve

In this step, what we want is: a factor matrix that can be directly combined into screening conditions. In a “cross section,” we want each factor to ultimately land in the same shape: (M, L).

For simplicity, here we split into two tracks:

Route A (recommended for beginners): derive “proxy factors” entirely from market data/technical indicators, paired with fixed thresholds, to ensure everyone can run it through
Track B (optional enhancement): If your local data source already has valuation fields like PE/PB/EBITDA, replace them in here

Both routes ultimately output the same cond_long/cond_short, and the subsequent workflow is exactly the same.

3.2 Minimum necessary principles

HistoryPanel.kline.* will return a new panel with additional columns, for example:

sma_20
macd_hist_12_26_9
bbands_upper_20_2_2 etc.

All of these columns can be located via hp.htypes.index(name) into a 2D matrix of (M, L). Then we can write multi-factor conditions with fixed thresholds, and make sure the final condition is a (M, L) bool.

3.3 Track A: Proxy factors (fixed thresholds; recommended to get this working first)

3.3.1 What to solve

We use three “very intuitive” proxy factors to help readers get into the flow quickly:

Value proxy: close / sma20 (the lower the deviation from the moving average, the more “cheap” it is)
Momentum proxy: macd_hist (strength/weakness)
Risk proxy: Bollinger Band width (volatility magnitude)

Two points of “honesty in research conventions” are worth emphasizing here:

These are all proxies, not “true valuation” in the financial sense. Their value lies in: everyone can derive them from market data, and under certain styles they can indeed produce a stratification effect.
They also have clear limitations: close/sma20 may mistake a trend for cheap/expensive; MACD may keep getting slapped in a choppy range; a bandwidth filter will also filter out “opportunities created by volatility.” The focus of this article is to get the pipeline running end to end; the thresholds themselves are just a reproducible starting point.

3.3.2 Runnable code + expected results

import numpy as np

hp2 = hp.kline.sma(window=20, price_htype='close')       # sma_20
hp2 = hp2.kline.macd(price_htype='close')                # macd_hist_12_26_9
hp2 = hp2.kline.bbands(window=20, price_htype='close')   # bbands_*_20_2_2

vals = hp2.values.astype(float)

close = vals[:, :, hp2.htypes.index('close')]
sma20 = vals[:, :, hp2.htypes.index('sma_20')]
macd_hist = vals[:, :, hp2.htypes.index('macd_hist_12_26_9')]

upper = vals[:, :, hp2.htypes.index('bbands_upper_20_2_2')]
mid   = vals[:, :, hp2.htypes.index('bbands_middle_20_2_2')]
lower = vals[:, :, hp2.htypes.index('bbands_lower_20_2_2')]

value_proxy = close / sma20
momentum_proxy = macd_hist
risk_proxy = (upper - lower) / mid

# 固定阈值（入门优先：简单、可跑、可理解）
A_VALUE = 1.02     # 强一点：价格明显强于均线才算“强势”
C_MOM = 0.0        # MACD 柱 > 0 视为偏多
B_RISK = 0.18      # 带宽太大视为波动过强，先过滤掉

cond_long = (value_proxy > A_VALUE) & (momentum_proxy > C_MOM) & (risk_proxy < B_RISK)
cond_short = (value_proxy < 1.0 / A_VALUE) & (momentum_proxy < -C_MOM) & (risk_proxy < B_RISK)

print('cond_long shape:', cond_long.shape)   # 期望 (M, L)
print('cond_short shape:', cond_short.shape)
print('selected_count_last_day(long):', int(cond_long[:, -1].sum()))

At this point we’ve obtained the “core ingredients” for cross-sectional filtering: cond_long/cond_short (M,L bool).

Next, we still need to do a very practical sanity check: how many stocks are selected each day. If this number is often 0 (an empty basket), your portfolio curve will be intermittent and your conclusions will be unstable; if this number is almost always all stocks, then the screening is meaningless.

You can add this check (it doesn’t change the logic; it just helps us judge whether the thresholds are “too strict/too loose”):

selected_count_by_day = cond_long.sum(axis=0)  # (L,)
print('selected_count stats (long):')
print('  min/max:', int(selected_count_by_day.min()), int(selected_count_by_day.max()))
print('  mean:', float(selected_count_by_day.mean()))
print('  p10/p50/p90:', np.quantile(selected_count_by_day.astype(float), [0.1, 0.5, 0.9]))

3.4 Route B: true valuation factors (optional enhancement; field names are subject to `hp.htypes`)

3.4.1 What needs to be solved

If your local data source has already downloaded valuation fields (e.g., PE/PB/EBITDA), then we can swap them in. In this section we won’t hard-code field names, because the htype naming may differ across data sources/data types. The most robust approach is:

First print hp.htypes
Find the actual column names on your local setup
Then write the conditions using fixed thresholds

3.4.2 Runnable code (illustrative)

print('available htypes:', hp.htypes)

# 假设你在 htypes 里找到了这三个字段（名称以你的本地为准）
# pe_name = 'pe' or 'pe_ttm' ...
# pb_name = 'pb' ...
# ebitda_name = 'ebitda' ...

# pe = hp.values[:, :, hp.htypes.index(pe_name)]
# pb = hp.values[:, :, hp.htypes.index(pb_name)]
# ebitda = hp.values[:, :, hp.htypes.index(ebitda_name)]

# 固定阈值示例（仅示意，阈值需要你按资产池与口径调整）
# cond_long = (pe < 15.0) & (pb < 2.0) & (ebitda > 1e9)

As long as you still end up with (M, L) cond_long/cond_short, the rest of the workflow is exactly the same as Route A.

If you find that your local data simply doesn’t have these fields, that’s fine: that’s exactly why we put the “proxy-factor version” on the main narrative path. Route B exists as an enhanced branch, but you can still run the full research loop end-to-end without relying on it.

5.6. 4. 横向筛选：条件组合 -> `where()` 研究 mask

4.1 What this section aims to solve

We turn cond_long/cond_short into a research mask that can be fed directly into portfolio(mask=...).

4.2 Minimum necessary principles

hp.where() supports reshaping (M, L) boolean conditions into an (M, L, N) mask. This step solidifies the “cross-sectional screening rules” into a “research definition,” and all subsequent aggregation and return calculations follow this as the standard.

This is also one of the most valuable habits to build in cross-sectional research: never keep only a single “portfolio curve”—explicitly keep the rule for “which names enter the basket each day” as well. The mask is that rule. It lets you answer review questions like “who exactly did we pick that day?”

4.3 Runnable code + expected result

mask_long = hp2.where(cond_long)
mask_short = hp2.where(cond_short)

print('mask_long shape:', mask_long.shape)   # 期望 (M,L,N)
print('mask_long dtype:', mask_long.dtype)

5.7. 5. 两组组合曲线 + benchmark：`portfolio(mask=...)`

5.1 What this section aims to solve

For each trading day, aggregate the set of stocks that meet the conditions into a portfolio curve (long/short), and compare it with 000300.SH.

5.2 The minimum necessary principles

portfolio is for coarse research aggregation, not trade execution
benchmark_output='tag_along' will append the benchmark row to the output, making it easy to compare on the same chart

You can think of it as: for each day, take an equal-weight average of the “set of stocks selected that day” to get a curve. Because the set changes every day, this curve is essentially answering a question:

If every day I only hold “the batch that meets the criteria,” how does this dynamic basket perform over the long run?

5.3 Runnable code + expected results

benchmark = '000300.SH'

pf_long = hp2.portfolio(
    htypes='close',
    mode='equal',
    mask=mask_long,
    benchmark=benchmark,
    benchmark_output='tag_along',
    new_share_name='LONG',
)

pf_short = hp2.portfolio(
    htypes='close',
    mode='equal',
    mask=mask_short,
    benchmark=benchmark,
    benchmark_output='tag_along',
    new_share_name='SHORT',
)

print('pf_long.shares:', pf_long.shares)
print('pf_long.shape:', pf_long.shape)

5.8. 6. `normalize / cum_return` + CAGR：给出可比较的年化摘要表

6.1 What this section aims to solve

We want to both “look at the curve” and “have a reusable numeric summary.” So we do two things:

normalize: align the starting point (for easier visual comparison)
cum_return + CAGR: output a summary table

6.2 The principle of minimum necessity

cum_return outputs cumulative return cumret_*; using the ending value together with the number of years lets you derive CAGR. (For the exact definition of CAGR, see the earlier discussion; we won’t expand the derivation here.)

6.3 Runnable code (illustrative)

import pandas as pd

def _years_between(hdates) -> float:
    idx = pd.DatetimeIndex(hdates)
    days = (idx[-1] - idx[0]).days
    return max(1e-9, days / 365.25)

def _cagr_from_cumret(cumret_end: float, years: float) -> float:
    return (1.0 + cumret_end) ** (1.0 / years) - 1.0

years = _years_between(pf_long.hdates)
cr_long = pf_long.cum_return(htypes='close', method='simple')

cumret_long_end = float(cr_long.values[cr_long.shares.index('LONG'), -1, 0])
cumret_bm_end = float(cr_long.values[cr_long.shares.index('000300.SH'), -1, 0])

print('CAGR(long):', _cagr_from_cumret(cumret_long_end, years))
print('CAGR(bm):', _cagr_from_cumret(cumret_bm_end, years))

I suggest you organize it into a small table (at least including three rows: LONG/SHORT/benchmark), so readers can compare at a glance:

cr_short = pf_short.cum_return(htypes='close', method='simple')
cumret_short_end = float(cr_short.values[cr_short.shares.index('SHORT'), -1, 0])

summary = pd.DataFrame(
    {
        'cum_return_end': [cumret_long_end, cumret_short_end, cumret_bm_end],
        'CAGR': [
            _cagr_from_cumret(cumret_long_end, years),
            _cagr_from_cumret(cumret_short_end, years),
            _cagr_from_cumret(cumret_bm_end, years),
        ],
    },
    index=['LONG', 'SHORT', '000300.SH'],
)
print('\\n[CAGR summary]')
print(summary)

5.9. 7. 可视化与解释：用 `plot(highlight=...)` 高亮“关键差异日”

7.1 What this section aims to solve

When you take cross-sectional research all the way to the end, the type of question that most needs explaining is: in the stretch where long and the benchmark diverge the most, what exactly happened?

So we use highlight to highlight a “key divergence day” segment (e.g., the point of maximum/minimum cumulative return, or an interval you pick yourself), pulling the reader’s attention back to the chart.

7.2 The principle of minimum necessity

highlight supports the shorthand 'max'/'min', and also supports a 1D bool condition. To keep the tutorial stable, we’ll first demonstrate the shorthand version (least likely to trip you up), and later provide the 1D bool form.

7.3 Runnable code + expected output

fig = pf_long.plot(interactive=True, highlight='max')
fig

Expected result: the chart will mark the maximum point of the LONG portfolio curve (or the maximum point as defined by a given chart). In cross-sectional research, we usually treat this as a “reminder”: around the maximum point is often the period when long had the strongest tailwind relative to the market, and it’s worth looking back to see whether the screening criteria led us into a particular style (e.g., strong trend, low volatility, etc.).

If you want to align more closely with “the divergence between long vs benchmark,” you can use a more practical approach: first compute the excess-return curve (the difference between long and benchmark), then extract the position of the maximum point on that excess curve and turn it into a 1D bool highlight condition. (This is an optional enhancement; you don’t have to expand on it in the main narrative.)

5.10. 8. 完整代码（单函数可跑版本）

Below is a complete “single-function runnable” version for you to copy into a Notebook and run with one click. It covers:

Track A (proxy factor with a fixed threshold) as the main line;
A sanity check on the number selected each day;
A complete closed loop of where -> portfolio -> cum_return -> CAGR;
A stable demo of plot(highlight=...).

import numpy as np
import pandas as pd
import qteasy as qt


def demo_horizontal_multifactor(
        shares: list,
        benchmark: str = '000300.SH',
        start: str = '20220101',
        end: str = '20221231',
):
    \"\"\"演示横向多因子截面筛选：代理因子 -> 每日篮子 -> portfolio -> CAGR -> 高亮解释。

    Parameters
    ----------
    shares : list
        股票池（必须包含 benchmark；建议 30~80 只更像“横向筛选”）。
    benchmark : str, default '000300.SH'
        基准指数代码。
    start : str, default '20220101'
        起始日期（YYYYMMDD）。
    end : str, default '20221231'
        结束日期（YYYYMMDD）。

    Returns
    -------
    dict
        包含 hp/hp2/pf_long/pf_short/summary/fig 等结果对象。
    \"\"\"
    if benchmark not in shares:
        raise ValueError('benchmark must be included in shares')

    hp = qt.get_kline(
        shares=shares,
        start=start,
        end=end,
        freq='D',
        as_panel=True,
    )
    if 'close' not in hp.htypes:
        raise ValueError('Missing close column in htypes')
    if hp.shape[1] < 50:
        raise ValueError(
            'Not enough data points loaded (too few hdates). '
            'Please check your local datasource and date range.'
        )

    # 1) 路线 A：代理因子（固定阈值）
    hp2 = hp.kline.sma(window=20, price_htype='close')
    hp2 = hp2.kline.macd(price_htype='close')
    hp2 = hp2.kline.bbands(window=20, price_htype='close')

    vals = hp2.values.astype(float)
    close = vals[:, :, hp2.htypes.index('close')]
    sma20 = vals[:, :, hp2.htypes.index('sma_20')]
    macd_hist = vals[:, :, hp2.htypes.index('macd_hist_12_26_9')]
    upper = vals[:, :, hp2.htypes.index('bbands_upper_20_2_2')]
    mid = vals[:, :, hp2.htypes.index('bbands_middle_20_2_2')]
    lower = vals[:, :, hp2.htypes.index('bbands_lower_20_2_2')]

    value_proxy = close / sma20
    momentum_proxy = macd_hist
    risk_proxy = (upper - lower) / mid

    A_VALUE = 1.02
    C_MOM = 0.0
    B_RISK = 0.18

    cond_long = (value_proxy > A_VALUE) & (momentum_proxy > C_MOM) & (risk_proxy < B_RISK)
    cond_short = (value_proxy < 1.0 / A_VALUE) & (momentum_proxy < -C_MOM) & (risk_proxy < B_RISK)

    # 2) sanity check：每天入选数量
    selected_count_by_day = cond_long.sum(axis=0)
    print('\\n[Selection count stats]')
    print('  min/max:', int(selected_count_by_day.min()), int(selected_count_by_day.max()))
    print('  mean:', float(selected_count_by_day.mean()))
    print('  p10/p50/p90:', np.quantile(selected_count_by_day.astype(float), [0.1, 0.5, 0.9]))

    # 3) 条件 -> mask（研究口径）
    mask_long = hp2.where(cond_long)
    mask_short = hp2.where(cond_short)

    # 4) 组合聚合 + benchmark
    pf_long = hp2.portfolio(
        htypes='close',
        mode='equal',
        mask=mask_long,
        benchmark=benchmark,
        benchmark_output='tag_along',
        new_share_name='LONG',
    )
    pf_short = hp2.portfolio(
        htypes='close',
        mode='equal',
        mask=mask_short,
        benchmark=benchmark,
        benchmark_output='tag_along',
        new_share_name='SHORT',
    )

    # 5) cum_return -> CAGR
    def _years_between(hdates) -> float:
        idx = pd.DatetimeIndex(hdates)
        days = (idx[-1] - idx[0]).days
        return max(1e-9, days / 365.25)

    def _cagr_from_cumret(cumret_end: float, years: float) -> float:
        return (1.0 + cumret_end) ** (1.0 / years) - 1.0

    years = _years_between(pf_long.hdates)
    cr_long = pf_long.cum_return(htypes='close', method='simple')
    cr_short = pf_short.cum_return(htypes='close', method='simple')

    cumret_long_end = float(cr_long.values[cr_long.shares.index('LONG'), -1, 0])
    cumret_short_end = float(cr_short.values[cr_short.shares.index('SHORT'), -1, 0])
    cumret_bm_end = float(cr_long.values[cr_long.shares.index(benchmark), -1, 0])

    summary = pd.DataFrame(
        {
            'cum_return_end': [cumret_long_end, cumret_short_end, cumret_bm_end],
            'CAGR': [
                _cagr_from_cumret(cumret_long_end, years),
                _cagr_from_cumret(cumret_short_end, years),
                _cagr_from_cumret(cumret_bm_end, years),
            ],
        },
        index=['LONG', 'SHORT', benchmark],
    )
    print('\\n[CAGR summary]')
    print(summary)

    # 6) 图：组合对比（归一化更直观）
    fig = pf_long.normalize(htypes='close', base_index=0).plot(interactive=True, highlight='max')
    return {
        'hp': hp,
        'hp2': hp2,
        'pf_long': pf_long,
        'pf_short': pf_short,
        'summary': summary,
        'fig': fig,
    }


res = demo_horizontal_multifactor(
    shares=[
        '000001.SZ', '600519.SH', '300750.SZ', '000333.SZ', '600036.SH',
        '601318.SH', '002415.SZ', '000858.SZ', '600276.SH', '000725.SZ',
        '000300.SH',
    ],
    benchmark='000300.SH',
    start='20220101',
    end='20221231',
)
res['fig']

5.11. 9. 小结与边界

At this point, we’ve completed the research loop for “cross-sectional multi-factor stock selection.” Need to emphasize again: portfolio/cum_return is a research-oriented coarse aggregation and does not include real trade-execution semantics. If you want to migrate this screening logic into a strategy backtest, it’s recommended to output the conditions/factors as strategy signals and hand them off to Operator/Backtester to handle the trading-layer details.

5.12. 附录：插图索引（可选：在 Notebook 中生成后截图）

下列文件名为建议命名；仓库内未必已包含对应 png。在 Notebook 跑通各节后自行截图即可。

建议文件名	Suggested placement	What you’ll see
`3.2_minimal_run.png`	§0	The minimal runnable chart with more shares
`3.2_pf_compare.png`	§0.5 or §6	Normalized curve comparison of `LONG/SHORT/000300.SH`
`3.2_highlight_key_day.png`	§7	Example of highlighting a “key point” (e.g., the max point)