Tinker: 后训练 MaaS
tinker:是训练的 SDK,用户通过调用 API 可以发起训练的各种请求,底层的训练逻辑由 tinker 后端实现tinker-cookbook:包含了实际 fine-tuning LLM 的例子,基于 Tinker API 构建,并且提供了通用的抽象层
Tinker
|
|
Sampling from an image
|
|
Loss Function
执行 forward_backward 的时候,指定 loss function
- Input:
forward_backwardexpects a certain set of input tensors, passed in viadatum.loss_fn_inputs, which is a dict mappingstrto either a numpy or torch tensor - Output:
forward_backwardreturns aForwardBackwardOutput, which has a set of output tensors infwd_bwd_result.loss_fn_outputs
|
|
SFT: Cross Entropy
For SL, we implement the standard cross-entropy loss (i.e., negative log-likelihood), which optimizes the policy $p_{\theta}$ to maximize the log-probability of the tokens $x$:
$$\mathcal{L}(\theta) = -\mathbb{E}x \left[ \log p\theta(x) \right]$$
where weights is either 0 or 1, typically generated from renderer.build_supervised_example() which returns (model_input, weights) (i.e., to specify the desired assistant turns to train on).
|
|
- Input tensors:
target_tokens: array[(N,), int]- Target token IDsweights: array[(N,), float]- Token-level loss weights (typically from the renderer)
- Output tensors:
logprobs: array[(N,), float]- Log probabilities of predicted tokens
- Output diagnostics:
loss:sum(scalar) - Sum of weighted cross-entropy losses
Policy Gradient: importance sampling
For RL, we implement a common variant of the policy gradient objective, used in practical settings where the learner policy $p$ may differ from the sampling policy $q$, which is common due to, e.g., non-determinism. The issue is that if these policies differ, then the objective: $$\mathcal{L}(\theta) = \mathbb{E}{x \sim p\theta} \left[ A(x) \right] $$ is not computed in an unbiased why due to $x \sim q$ (sampler) not exactly matching the desired $x \sim p_\theta$ (learner). To correct the bias, we use a modified “importance sampling” objective: $$ \mathcal{L}{\text{IS}}(\theta) = \mathbb{E}{x \sim q} \left[ \frac{p_\theta(x)}{q(x)} A(x) \right], $$ which yields the correct expected reward. In the formula above:
- $\log p_\theta(x)$ –
target_logprobsis from the learner, on the forward part of theforward_backwardpass. - $\log q(x)$ –
sampling_logprobsis from the sampler, recorded during sampling as a correction term.
|
|
- Input tensors:
-
target_tokens: array[(N,), int]- Target token IDs (from the sampler qq) -logprobs: array[(N,), float]-sampling_logprobsfor the tokens -advantages: array[(N,), float]- Advantage values for RL (positive to reinforce, negative to discourage - Output tensors:
-
logprobs: array[(N,), float]-target_logprobsfor the tokens - Output diagnostics:
loss:sum(scalar) - Sum of importance-weighted policy gradient losses
PPO
|
|
自定义 clip threshold
|
|
CISPO
DRO
Saving and loading weights
主要的 API
save_weights_for_sampler(): saves a copy of the model weights that can be used for sampling.save_state(): saves the weights and the optimizer state. You can fully resume training from this checkpoint.load_state(): load the weights and the optimizer state. You can fully resume training from this checkpoint.
|
|
Downloading Weights
|
|
Tinker Cookbook
参考资料
-
No backlinks found.