各位大佬,求问。
我json2binidx_tool-main,把jsonl格式转化为bin了。根据日志显示,该jsonl条目是1874500 documents,但我用“wc -l largefile.jsonl”计算,得到的结果是1874538。请问,哪个才是正确的条目数?还是说“json2binidx_tool-main”把最后两位省略了?我这边因为jsonl文件太大,所以无法打开详细查看,看到数据不同,非常担心是不是中间出了什么错。
这是日志:
#######################################################################################################################
This tokenizer is not used in any RWKV models yet. I plan to use it for the future multilang RWKV models.
Benefits:
* Good support of most languages, from European to CJK to Arabic and Hindi and more.
* Clean vocab. Good for code too. Vocab size = 65525 (use 0 for <|endoftext|>).
* Good at numbers: the numerical tokens are '0'~'9', '10'~'99', ' 0'~' 9', ' 10'~' 99'.
* Very easy tokenization:
** The input text must be in UTF-8.
** Greedy encoding: always pick the longest (in bytes) token (with the highest id) that matches your UTF-8 bytes.
* The tokenization result is surprisingly good, because the vocab respects word boundaries and UTF-8 boundaries.
For 10x faster speed:
mypyc rwkv_tokenizer.py
python3 -c "import rwkv_tokenizer"
#######################################################################################################################
> building RWKVTokenizer tokenizer ...
> padded vocab (size: 65525) with 11 dummy tokens (new size: 65536)
Vocab size: 65525
Output prefix: ./data/0.merged_output
> building RWKVTokenizer tokenizer ...
> padded vocab (size: 65525) with 11 dummy tokens (new size: 65536)
Processed 1874500 documents (1513.87 docs/s, 2.02 MB/s).: : 1874500it [20:38, 1513.65it/s]
PS F:\json2binidx_tool-main>