4Tb assert #260

AskAlexSharov · 2022-01-13T10:26:25Z

Hi. Looks like at 4Tb threshold mdbx getting next assert:
Assertion failed: ((pl)[1]) < limit (mdbx: mdbx_pnl_check: 6368)

erthink · 2022-01-13T11:54:21Z

Could you provide the coredump, or at least a stack backtrace?

AskAlexSharov · 2022-01-13T13:55:49Z

coredump will come tomorrow

flywukong · 2022-01-17T07:43:39Z

Hi, I test it on bsc-erigon and get a error when the mdbx.dat. file reach 4T , the errors looks like blow

then I notice that you have fix the issue and merge the commit into devel branch , so I have updated the go package into devel branch , and then recompile erigon and restart it (continue syncing), but I still get errors below .

you can get log here ,https://transfer.toolsfdg.net/ySKaN/nohup.out, I wonder if the issue is completely repaired

erthink · 2022-01-17T08:17:25Z

@flywukong, this issue not fixed for now, but I made some changes to dig it.

The line number for this assertion is differ for current code. Please use current the devel or issue-260 branch for your test(s).
As I noted abote the coredump or at least a stack backtrace is recuired, if problem still.
The link you provided to the log is inaccessible.

flywukong · 2022-01-17T08:38:59Z

@erhink thanks for your reply , I seems have already changed to package to devel by run command like" go get github.com/erthink/libmdbx@devel", the go.mod file changed and the id is the latest commit id in devel branch.

but the log shows that it is not devel branch ? I will try to use go replace to update package intead of this way.
you can see log here
log.zip

AskAlexSharov · 2022-01-17T09:34:34Z

@flywukong please don’t be confused - erigon and mdbx are not related projects. mdbx is C language project and has no go.mod (can’t “go get” it).

There are several steps to get another version of mdbx into erigon (if you need another version of erigon - better ask about it in erigon’s channel/repo).

There is same name branch in erigon “issue-260” - with right version of mdbx. What need to do now - run it on existing db and get core dump. Such core dump can be attached here.

flywukong · 2022-01-17T10:26:52Z

@AskAlexSharov thanks for your advice, so I think ignore the level of mdbx-go , if this part have not been fixed , the erigon can not work well . But I was testing on bsc branch of erigon , so I wonder I can not just use “issue-260” of erigon， may be I can merge the change of this branch to fix the problem

AskAlexSharov · 2022-01-17T11:25:46Z

@AskAlexSharov thanks for your advice, so I think ignore the level of mdbx-go , if this part have not been fixed , the erigon can not work well . But I was testing on bsc branch of erigon , so I wonder I can not just use “issue-260” of erigon， may be I can merge the change of this branch to fix the problem

It’s not fix yet, it’s debug branch to get coredump - which will help us understand root cause and fix.

flywukong · 2022-01-17T11:59:14Z

@AskAlexSharov thanks , I am not sure if this commit have solved the problem , 1813bf9 , it is merged into devel branch , I think may be we can update mdbx-go code to downloaded and called this branch for testing . If problem is solved in test, we just need to wait for this commit to be merged into master of libmdbx. Syncing 4T data from scratch get coredump would take too much time , this way may be faster

erthink · 2022-01-17T12:46:00Z

@flywukong, AFAIK erigon uses DB with default 4K pages.
If so, then the mentioned commit is relevant to the issues, but does not fix it.
Because with 4K page size, such arithmetic overflow will happen when the size is 8T, not 4T.
For overflow at 4T the 2K sized page should be used.

So the output of mdbx_chk -vv for your 4T DB will help to understand situation, and if we see the 2K sized pages then we could assume that issue is fixed.

flywukong · 2022-01-18T02:33:12Z

@erthink thanks , the pagesize in our DB is also 4K

by the way, our bug doesn't seem to be triggered by 4T. When it reachded 4T, it was an error indicating that the mapsize was not enough（the mapsize we configured is 4T）. When I tried to adjust the mapsize, I kept restarting erigon, and then triggered the core

AskAlexSharov · 2022-01-18T05:31:26Z

@flywukong on which Erigon's branch? if on "bsc" or "devel" - try "issue-260" branch. If you still see crush - please send us coredump. thank.

flywukong · 2022-01-18T07:24:17Z

@AskAlexSharov we are using "bsc" branch which has just merged into devel some , I have tried to merge the commits about issue-260 in "issue-260" branch yesterday but it has some problems when compling erigon . I will re-run this branch issue-260 directly

AskAlexSharov · 2022-01-18T11:07:15Z

@erthink I have 1 person confirmation that “issue-260” branch solved problem.

flywukong · 2022-01-18T11:30:12Z

@AskAlexSharov the previous branch ran for less than 20 minutes, the core occurred. After I used this “issue-260” branch for nearly three hours ，the core still occurs . For various reasons, the core file was not generated successfully, and I will continue to test until I get the core file

AskAlexSharov · 2022-01-18T11:31:28Z

Thank you

flywukong · 2022-01-18T16:27:55Z

@AskAlexSharov Crash occurs again after the process runs for more than an hour, but the strange thing is that no core file is generated after I repeated the test twice. I'm sure the branch I'm using is correct the “issue-260”

And I have carefully checked the corefile-related system configuration and tested it. It should be able to generate core normally.
here is related log file
log (1).zip

AskAlexSharov · 2022-01-19T02:33:47Z

Try GOTRACEBACK=crash
See more: https://stackoverflow.com/questions/44430117/go-gotraceback-crash-with-no-core-file
https://stackoverflow.com/questions/45855414/unwind-stack-for-goroutine-in-gdb-for-a-golang-exes-core-dump

flywukong · 2022-01-19T03:42:03Z

@AskAlexSharov it works , the corefile is 9G , I have sent it to you gmail , please check your gmail

AskAlexSharov · 2022-01-19T09:25:27Z

Thank you

erthink · 2022-01-19T11:08:49Z

@flywukong, the following files are required from your build and/or system to analyze the core(s):

/server/bsc-erigon/test-node/erigon
/usr/lib64/libnss_files-2.26.so
/usr/lib64/libc-2.26.so
/usr/lib64/libpthread-2.26.so
/usr/lib64/librt-2.26.so
/usr/lib64/ld-2.26.so

erthink · 2022-01-19T20:10:59Z

@flywukong, the circumstances are such that I need to address this issue in the very near future or postpone it for a long time.
Therefore, it would be nice if you provide the necessary files today, or provide remote ssh access for a debugging session using gdb (for details please contact me through the telegram group libmdbx).

flywukong · 2022-01-20T02:29:07Z

@AskAlexSharov @erthink ok , I will sent this files today to your emails

flywukong · 2022-01-20T07:27:33Z

@erthink @AskAlexSharov email sended, you can aslo download by this linkhttps://drive.google.com/file/d/1b-34gnU3JK4OfkE-wcvcDLk721ep1MVd/view

erthink · 2022-01-20T11:04:59Z

The backtrace:

...
#5  0x00007fd290c60c20 in raise () from /lib64/libc.so.6
#6  0x00007fd290c620c8 in abort () from /lib64/libc.so.6
#7  0x00007fd290c599ca in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007fd290c59a42 in __assert_fail () from /lib64/libc.so.6
#9  0x00000000004061e3 in mdbx_assert_fail (env=<optimized out>, msg=<optimized out>, func=<optimized out>, line=<optimized out>) at mdbx.c:26342
#10 0x00000000012575d6 in mdbx_pnl_check (pl=pl@entry=0x73d157796014, limit=limit@entry=2147483648) at mdbx.c:6367
#11 0x0000000000421543 in mdbx_pnl_sort (pnl=0x73d157796014) at mdbx.c:6477
#12 0x0000000001269189 in mdbx_txn_spill (txn=<optimized out>, m0=m0@entry=0x7fd22c104de0, need=89) at mdbx.c:8773
#13 0x0000000001269ebf in mdbx_cursor_spill (mc=mc@entry=0x7fd22c104de0, key=key@entry=0x7fd244ff8df0, data=<optimized out>) at mdbx.c:8838
#14 0x000000000127f60c in mdbx_cursor_put (mc=0x7fd22c104de0, key=key@entry=0x7fd244ff8df0, data=data@entry=0x7fd244ff8e00, flags=131072) at mdbx.c:18422
#15 0x000000000128d017 in mdbxgo_cursor_put2 (cur=<optimized out>, kdata=<optimized out>, kn=<optimized out>, vdata=<optimized out>, vn=<optimized out>, flags=<optimized out>) at mdbxgo.c:61
#16 0x000000000125444b in _cgo_afc3699e7033_Cfunc_mdbxgo_cursor_put2 (v=0xc05014abb8) at cgo-gcc-prolog:272
...

erthink · 2022-01-20T11:32:08Z

The problem arises due to excessive/too-strict checking the PNL of pages-to-spill with a left-shifted numbers.

So this bug triggered only in the DEBUG builds or when the assertion checking is forcibly enabled.
It does not affect any core logic and cannot lead to DB corruption, data loss, and so on.

Hopefully I'll fix it today, but as temporary workaround you can just use non-DEBUG and without the -DMDBX_FORCE_ASSERTIONS option builds.

AskAlexSharov · 2022-01-20T12:33:16Z

nice.
@flywukong I created branch "mdbx_no_assert" which must fix 4Tb issue, please try

flywukong · 2022-01-21T03:17:13Z

@AskAlexSharov ok , it is testing now

easeev · 2022-01-21T03:22:42Z

nice. @flywukong I created branch "mdbx_no_assert" which must fix 4Tb issue, please try

Two nodes are running with 4TB+ ledger on this branch without crashes so far

koen84 · 2022-01-21T12:12:45Z

I got past the crashloop and reached chainhead with mdbx_no_assert branch.

AskAlexSharov · 2022-01-23T06:42:26Z

I have 1 report that, last mdbx master with enabled assert still asserting on 4Tb.

erigon: mdbx:6375: mdbx_pnl_check: Assertion `((pl)[1]) < limit' failed.

mdbx src: https://github.com/torquem-ch/mdbx-go/blob/v0.22.6/mdbx/mdbx.c
erigon’s branch: ledgerwatch/erigon#3324

erthink · 2022-01-23T09:09:58Z

Earlier I reproduced the previous case by internally overriding MAX_PAGENO (to reduce required DB size, i.e. required RAM volume and disk space), and the test of the fix provided is still running successfully in a continually loop.

Seems this is other case that I unable to reproduce yet.
So the backtrace and/or coredump is needed.

flywukong · 2022-01-23T11:08:31Z

@AskAlexSharov Runs for 60 hours with no problems with mdbx_no_assert branch ， my data reach 4.3T

AskAlexSharov · 2022-01-23T11:13:25Z

@flywukong because no assert means “disabled asserts” :-) here is the branch where I switched to latest mdbx and enabled asserts: ledgerwatch/erigon#3324 (likely you will get error here). You eunning ok, because bug is not in mdbx logic but in assert (invariant check) logic.

erthink · 2022-01-23T14:52:39Z

Please provide stack backtrace, core dump or ssh access for remote debugging.

flywukong · 2022-01-24T07:29:21Z

@AskAlexSharov I can also see the "disabled asserts" is merged into devel branch
, should I revert the changes from this commit before running?

AskAlexSharov · 2022-01-24T07:59:13Z

@flywukong depends what you need - if you need working version - just use devel without any actions. If you need to create coredump on crush - use ledgerwatch/erigon#3324 (withou any actions).

erthink · 2022-01-24T11:44:07Z

To clarify the current status:

It looks like we have at least two cases of this issue.
I reproduced the first case, fix and checked it, both the fact that the particular issue was present, and the fact that it has been fixed.
As I noted the cause of the first case was a bug in a PNL (page number list) checking code, but not in a core logic. So it is safe just to disable assertions checking to avoid exactly/only this case.
Beside the first case, I saw and fix a minor bug due to which the page with the maximum number (0x7FFFffff) could not be used. For now this fix was also checked by tests.
The cause for the second case in uknown for now. Therefore, it cannot be said that it is safe to disable the assertions for this.
I'm still waiting for a backtrace, core dump or remote ssh+gdb access to investigate the second case.

flywukong · 2022-01-25T09:50:06Z

@AskAlexSharov @erthink I got a core file after disable assertions
https://drive.google.com/file/d/1aKog0n9Su1-w-DydTFE7x-6UpdDapUHi/view?usp=sharing

here is lib files:
https://drive.google.com/file/d/1BbkB29cQpbTvz7OuY0HrSxCj-L2ODdUW/view?usp=sharing

I used devel branch and git revert to these commit

erthink · 2022-01-25T10:17:52Z

@AskAlexSharov @erthink I got a core file after disable assertions https://drive.google.com/file/d/1aKog0n9Su1-w-DydTFE7x-6UpdDapUHi/view?usp=sharing

here is lib files: https://drive.google.com/file/d/1BbkB29cQpbTvz7OuY0HrSxCj-L2ODdUW/view?usp=sharing

No access granted for there files.

I used devel branch and git revert to these commit.

But why?
For digging/investigation this issue I need a coredump from current master branch of libmdbx with enabled assertion checks.

flywukong · 2022-01-25T10:47:59Z

@erthink you can download now , link permissions have been updated.

I mean the devel branch of erigon , not libmdbx . the commit information that I took of erigon should have enabled current master branch of libmdbx with enabled assertion checks. you check this by ledgerwatch/erigon#3324

erthink · 2022-01-25T11:24:20Z

@flywukong, the /server/bsc-erigon/test-node/erigon is absent.

flywukong · 2022-01-25T12:50:25Z

@erthink sorry , here is erigon
https://drive.google.com/file/d/1uEsHU6Q28M3DdABSjzzJrTPHbUqw_G99/view?usp=sharing

erthink · 2022-01-25T13:26:19Z

The backtrace of the last coredump:

#9  0x0000000000405d39 in mdbx_assert_fail (msg=<optimized out>, func=<optimized out>, line=<optimized out>, env=0x0) at mdbx.c:26368
#10 0x00000000004203e5 in mdbx_pnl_check (limit=<optimized out>, pl=<optimized out>) at mdbx.c:6374
#11 mdbx_pnl_check4assert (limit=<optimized out>, pl=<optimized out>, pl@entry=0x332b010) at mdbx.c:6401
#12 mdbx_pnl_search (pnl=pnl@entry=0x734349c967d4, pgno=pgno@entry=2032894574) at mdbx.c:6494
#13 0x0000000000422b6f in mdbx_pnl_exist (pgno=2032894574, pnl=0x734349c967d4) at mdbx.c:6507
#14 mdbx_page_get_ex (front=897, pgno=1016447287, mc=0x7f449c096120) at mdbx.c:16693
#15 mdbx_page_get (front=897, mp=<synthetic pointer>, pgno=1016447287, mc=0x7f449c096120) at mdbx.c:7041
#16 mdbx_page_search_root (mc=mc@entry=0x7f449c096120, key=key@entry=0x7f44b5c40c20, flags=flags@entry=0) at mdbx.c:16802
#17 0x00000000004231e2 in mdbx_page_search (mc=mc@entry=0x7f449c096120, key=key@entry=0x7f44b5c40c20, flags=flags@entry=0) at mdbx.c:17012
#18 0x0000000001277312 in mdbx_cursor_set (mc=mc@entry=0x7f449c096120, key=key@entry=0x7f44b5c40dd0, data=data@entry=0x7f44b5c40d10, op=op@entry=MDBX_SET) at mdbx.c:17536
#19 0x0000000001281341 in mdbx_cursor_put (mc=0x7f449c096120, key=key@entry=0x7f44b5c40dd0, data=data@entry=0x7f44b5c40de0, flags=16) at mdbx.c:18374

erthink · 2022-01-25T13:35:47Z

The last stack backtrace shown the same bug as noted above but in another execution path. So we can ignore it by disable assertion checking.

However, I need to understand why the problem was not reproduced in the tests, improve ones for reproducibility of this case and only then fix the issue.

erthink · 2022-01-26T12:45:00Z

I think the issue has been fixed completely and the code is ready for testing.

I also found out the reason why the second case was not reproduced by the tests.
Briefly the tests cases were "too stochastic", thus a too low probability of some states and transitions between ones within the narrowed page numbers range configuration, which is required for testing this issue on a hardware with less than 512 Gb RAM.

In particular, the tests used earlier were more likely to end due to the exhaustion of the available range of pages before a enough number of stochastic iterations were performed using more than 50% of the page rang, which is required to reproduce the problem.

AskAlexSharov · 2022-01-26T15:24:42Z

@flywukong I created mdbx_4tb_fix branch in erigon's repo. Feel free to try.

erthink · 2022-01-27T18:40:42Z

Any new info?

AskAlexSharov · 2022-01-28T04:29:40Z

I have 1 confirmation that issue fixed

allada mentioned this issue Jan 21, 2022

erigon: mdbx:6368: mdbx_pnl_check: Assertion `((pl)[1]) < limit' failed bnb-chain/bsc-erigon#17

Open

4Tb assert #260

4Tb assert #260

Comments

AskAlexSharov commented Jan 13, 2022

erthink commented Jan 13, 2022

AskAlexSharov commented Jan 13, 2022

flywukong commented Jan 17, 2022

erthink commented Jan 17, 2022

flywukong commented Jan 17, 2022 • edited

AskAlexSharov commented Jan 17, 2022

flywukong commented Jan 17, 2022

AskAlexSharov commented Jan 17, 2022

flywukong commented Jan 17, 2022 • edited

erthink commented Jan 17, 2022 • edited

flywukong commented Jan 18, 2022 • edited

AskAlexSharov commented Jan 18, 2022

flywukong commented Jan 18, 2022 • edited

AskAlexSharov commented Jan 18, 2022

flywukong commented Jan 18, 2022 • edited

AskAlexSharov commented Jan 18, 2022

flywukong commented Jan 18, 2022 • edited

AskAlexSharov commented Jan 19, 2022

flywukong commented Jan 19, 2022 • edited

AskAlexSharov commented Jan 19, 2022

erthink commented Jan 19, 2022

erthink commented Jan 19, 2022

flywukong commented Jan 20, 2022

flywukong commented Jan 20, 2022

erthink commented Jan 20, 2022

erthink commented Jan 20, 2022

AskAlexSharov commented Jan 20, 2022

flywukong commented Jan 21, 2022

easeev commented Jan 21, 2022

koen84 commented Jan 21, 2022 • edited

AskAlexSharov commented Jan 23, 2022

erthink commented Jan 23, 2022 • edited

flywukong commented Jan 23, 2022

AskAlexSharov commented Jan 23, 2022 • edited

erthink commented Jan 23, 2022

flywukong commented Jan 24, 2022

AskAlexSharov commented Jan 24, 2022

erthink commented Jan 24, 2022

flywukong commented Jan 25, 2022 • edited

erthink commented Jan 25, 2022

flywukong commented Jan 25, 2022 • edited

erthink commented Jan 25, 2022 • edited

flywukong commented Jan 25, 2022

erthink commented Jan 25, 2022

erthink commented Jan 25, 2022

erthink commented Jan 26, 2022 • edited

AskAlexSharov commented Jan 26, 2022

erthink commented Jan 27, 2022

AskAlexSharov commented Jan 28, 2022

flywukong commented Jan 17, 2022 •

edited

flywukong commented Jan 17, 2022 •

edited

erthink commented Jan 17, 2022 •

edited

flywukong commented Jan 18, 2022 •

edited

flywukong commented Jan 18, 2022 •

edited

flywukong commented Jan 18, 2022 •

edited

flywukong commented Jan 18, 2022 •

edited

flywukong commented Jan 19, 2022 •

edited

koen84 commented Jan 21, 2022 •

edited

erthink commented Jan 23, 2022 •

edited

AskAlexSharov commented Jan 23, 2022 •

edited

flywukong commented Jan 25, 2022 •

edited

flywukong commented Jan 25, 2022 •

edited

erthink commented Jan 25, 2022 •

edited

erthink commented Jan 26, 2022 •

edited