关于用binidx_tool把jsonl转化成bin时的条目数量的问题

RainPanda · 2024 年12 月 17 日 12:06

各位大佬，求问。
我json2binidx_tool-main，把jsonl格式转化为bin了。根据日志显示，该jsonl条目是1874500 documents，但我用“wc -l largefile.jsonl”计算，得到的结果是1874538。请问，哪个才是正确的条目数？还是说“json2binidx_tool-main”把最后两位省略了？我这边因为jsonl文件太大，所以无法打开详细查看，看到数据不同，非常担心是不是中间出了什么错。

这是日志：

#######################################################################################################################

This tokenizer is not used in any RWKV models yet. I plan to use it for the future multilang RWKV models.

Benefits:

* Good support of most languages, from European to CJK to Arabic and Hindi and more.

* Clean vocab. Good for code too. Vocab size = 65525 (use 0 for <|endoftext|>).

* Good at numbers: the numerical tokens are '0'~'9', '10'~'99', ' 0'~' 9', ' 10'~' 99'.

* Very easy tokenization:

** The input text must be in UTF-8.

** Greedy encoding: always pick the longest (in bytes) token (with the highest id) that matches your UTF-8 bytes.

* The tokenization result is surprisingly good, because the vocab respects word boundaries and UTF-8 boundaries.

For 10x faster speed:
mypyc rwkv_tokenizer.py
python3 -c "import rwkv_tokenizer"

#######################################################################################################################

> building RWKVTokenizer tokenizer ...
 > padded vocab (size: 65525) with 11 dummy tokens (new size: 65536)
Vocab size: 65525
Output prefix: ./data/0.merged_output
> building RWKVTokenizer tokenizer ...
 > padded vocab (size: 65525) with 11 dummy tokens (new size: 65536)
Processed 1874500 documents (1513.87 docs/s, 2.02 MB/s).: : 1874500it [20:38, 1513.65it/s]
PS F:\json2binidx_tool-main>

Seikaijyu · 2024 年12 月 17 日 13:57

要不要试试我的tokenize：GitHub - Seikaijyu/RWKV-PEFT-Simple: 更简单的微调，提供便捷脚本，微调说明

RainPanda · 2024 年12 月 18 日 01:53

好的，谢谢大佬！

JL717 · 2024 年12 月 18 日 05:57

没错，他是以100为单位的