关于用binidx_tool把jsonl转化成bin时的条目数量的问题

各位大佬,求问。
我json2binidx_tool-main,把jsonl格式转化为bin了。根据日志显示,该jsonl条目是1874500 documents,但我用“wc -l largefile.jsonl”计算,得到的结果是1874538。请问,哪个才是正确的条目数?还是说“json2binidx_tool-main”把最后两位省略了?我这边因为jsonl文件太大,所以无法打开详细查看,看到数据不同,非常担心是不是中间出了什么错。

这是日志:

#######################################################################################################################

This tokenizer is not used in any RWKV models yet. I plan to use it for the future multilang RWKV models.

Benefits:

* Good support of most languages, from European to CJK to Arabic and Hindi and more.

* Clean vocab. Good for code too. Vocab size = 65525 (use 0 for <|endoftext|>).

* Good at numbers: the numerical tokens are '0'~'9', '10'~'99', ' 0'~' 9', ' 10'~' 99'.

* Very easy tokenization:

** The input text must be in UTF-8.

** Greedy encoding: always pick the longest (in bytes) token (with the highest id) that matches your UTF-8 bytes.

* The tokenization result is surprisingly good, because the vocab respects word boundaries and UTF-8 boundaries.

For 10x faster speed:
mypyc rwkv_tokenizer.py
python3 -c "import rwkv_tokenizer"

#######################################################################################################################

> building RWKVTokenizer tokenizer ...
 > padded vocab (size: 65525) with 11 dummy tokens (new size: 65536)
Vocab size: 65525
Output prefix: ./data/0.merged_output
> building RWKVTokenizer tokenizer ...
 > padded vocab (size: 65525) with 11 dummy tokens (new size: 65536)
Processed 1874500 documents (1513.87 docs/s, 2.02 MB/s).: : 1874500it [20:38, 1513.65it/s]
PS F:\json2binidx_tool-main>

要不要试试我的tokenize:GitHub - Seikaijyu/RWKV-PEFT-Simple: 更简单的微调,提供便捷脚本,微调说明

1 Like

好的,谢谢大佬!

没错,他是以100为单位的

1 Like