mac m1 & win10 intel Nvida2070 模型回复乱码

运行环境

Mac m1 Pro 16 GB
Windows10
intel I7-8700
RTX 2070 8G

运行日志如图

之前运行的是 3B 模型,功能正常;换成 7B 模型之后,开始输出乱码,不清楚是什么问题

请问下模型是直接下载的st模型还是自己转的?是不是用可执行文件转的?如果是的话,请用py脚本转模型到st,或者到 https://huggingface.co/cgisky/ai00_rwkv_x060/tree/main https://huggingface.co/cgisky/RWKV-x070-Ai00/tree/main/world_v3 下载转换好的模型

模型是直接从 hf 下载的 st 文件

是否尝试更新显卡驱动?此外,能否看一下你的配置文件

请问 Ai00 版本号是?

Ai00 版本为 最新的 0.5.13

config 如下

[model]
embed_device = "Cpu"                                   # Device to put the embed tensor ("Cpu" or "Gpu").
max_batch = 8                                          # The maximum batches that are cached on GPU.
name = "RWKV-x060-World-7B-v3-20241112-ctx4096.st" # Name of the model.
path = "assets/models"                                 # Path to the folder containing all models.
precision = "Fp16"                                     # Precision for intermediate tensors ("Fp16" or "Fp32"). "Fp32" yields better outputs but slower.
quant = 0                                              # Layers to be quantized.
quant_type = "Int8"                                    # Quantization type ("Int8" or "NF4").
stop = ["\n\n"]                                        # Additional stop words in generation.
token_chunk_size = 128                                 # Size of token chunk that is inferred at once. For high end GPUs, this could be 64 or 128 (faster).

# [[state]] # State-tuned initial state.
# id = "fd7a60ed-7807-449f-8256-bccae3246222"                      # UUID for this state, which is used to specify which one to use in the APIs.
# name = "x060-3B"                                                 # Given name for this state (optional).
# path = "rwkv-x060-chn_single_round_qa-3B-20240505-ctx1024.state"

# [[state]] # Load another initial state.
# id = "6a9c60a4-0f4c-40b1-a31f-987f73e20315"                      # UUID for this state.
# path = "rwkv-x060-chn_single_round_qa-3B-20240502-ctx1024.state"

# [[lora]] # LoRA and blend factors.
# alpha = 192
# path = "assets/models/rwkv-x060-3b.lora"

[tokenizer]
path = "assets/tokenizer/rwkv_vocab_v20230424.json" # Path to the tokenizer.

[bnf]
enable_bytes_cache = true   # Enable the cache that accelerates the expansion of certain short schemas.
start_nonterminal = "start" # The initial nonterminal of the BNF schemas.

[adapter]
Auto = {} # Choose the best GPU.
# Manual = 0 # Manually specify which GPU to use.

[listen]
acme = false
domain = "local"
ip = "0.0.0.0"   # Use IpV4.
# ip = "::"        # Use IpV6.
force_pass = true
port = 65530
slot = "permisionkey"
tls = false

[[listen.app_keys]] # Allow mutiple app keys.
app_id = "admin"
secret_key = "ai00_is_good"

[web] # Remove this to disable WebUI.
path = "assets/www/index.zip" # Path to the WebUI.

# [embed] # Uncomment to enable embed models (via fast-embedding onnx models).
# endpoint = "https://hf-mirror.com"
# home = "assets/models/hf"
# lib = "assets/ort/onnxruntime.dll"  # Only used under windows.
# name = { MultilingualE5Small = {} }

config 见楼上,应该和显卡驱动没关系,运行 3B 模型没有问题

看起来加载的模型远大于你的显存容量,这可能是计算出错的原因。可否尝试 NF4 量化 32 层?

确实,我在 mac 上调整了配置参数使用NF4 量化 32 层,输出的内容已经可以理解了,稍后我在 Windows10 环境尝试一下