4Tb assert #260
Comments
Could you provide the coredump, or at least a stack backtrace? |
coredump will come tomorrow |
Hi, I test it on bsc-erigon and get a error when the mdbx.dat. file reach 4T , the errors looks like blow
|
@flywukong, this issue not fixed for now, but I made some changes to dig it.
|
@erhink thanks for your reply , I seems have already changed to package to devel by run command like" go get github.com/erthink/libmdbx@devel", the go.mod file changed and the id is the latest commit id in devel branch. but the log shows that it is not devel branch ? I will try to use go replace to update package intead of this way. |
@flywukong please don’t be confused - erigon and mdbx are not related projects. mdbx is C language project and has no go.mod (can’t “go get” it). There are several steps to get another version of mdbx into erigon (if you need another version of erigon - better ask about it in erigon’s channel/repo). There is same name branch in erigon “issue-260” - with right version of mdbx. What need to do now - run it on existing db and get core dump. Such core dump can be attached here. |
@AskAlexSharov thanks for your advice, so I think ignore the level of mdbx-go , if this part have not been fixed , the erigon can not work well . But I was testing on bsc branch of erigon , so I wonder I can not just use “issue-260” of erigon, may be I can merge the change of this branch to fix the problem |
It’s not fix yet, it’s debug branch to get coredump - which will help us understand root cause and fix. |
@AskAlexSharov thanks , I am not sure if this commit have solved the problem , 1813bf9 , it is merged into devel branch , I think may be we can update mdbx-go code to downloaded and called this branch for testing . If problem is solved in test, we just need to wait for this commit to be merged into master of libmdbx. Syncing 4T data from scratch get coredump would take too much time , this way may be faster |
@flywukong, AFAIK erigon uses DB with default 4K pages. So the output of |
@erthink thanks , the pagesize in our DB is also 4K |
@flywukong on which Erigon's branch? if on "bsc" or "devel" - try "issue-260" branch. If you still see crush - please send us coredump. thank. |
@AskAlexSharov we are using "bsc" branch which has just merged into devel some , I have tried to merge the commits about issue-260 in "issue-260" branch yesterday but it has some problems when compling erigon . I will re-run this branch issue-260 directly |
@erthink I have 1 person confirmation that “issue-260” branch solved problem. |
@AskAlexSharov the previous branch ran for less than 20 minutes, the core occurred. After I used this “issue-260” branch for nearly three hours ,the core still occurs . For various reasons, the core file was not generated successfully, and I will continue to test until I get the core file |
Thank you |
@AskAlexSharov Crash occurs again after the process runs for more than an hour, but the strange thing is that no core file is generated after I repeated the test twice. I'm sure the branch I'm using is correct the “issue-260” And I have carefully checked the corefile-related system configuration and tested it. It should be able to generate core normally. |
@AskAlexSharov it works , the corefile is 9G , I have sent it to you gmail , please check your gmail |
Thank you |
@flywukong, the following files are required from your build and/or system to analyze the core(s):
|
@flywukong, the circumstances are such that I need to address this issue in the very near future or postpone it for a long time. |
@AskAlexSharov @erthink ok , I will sent this files today to your emails |
@erthink @AskAlexSharov email sended, you can aslo download by this linkhttps://drive.google.com/file/d/1b-34gnU3JK4OfkE-wcvcDLk721ep1MVd/view |
The backtrace:
|
The problem arises due to excessive/too-strict checking the PNL of pages-to-spill with a left-shifted numbers. So this bug triggered only in the DEBUG builds or when the assertion checking is forcibly enabled. Hopefully I'll fix it today, but as temporary workaround you can just use non-DEBUG and without the |
nice. |
@AskAlexSharov ok , it is testing now |
Two nodes are running with 4TB+ ledger on this branch without crashes so far |
I got past the crashloop and reached chainhead with |
I have 1 report that, last mdbx master with enabled assert still asserting on 4Tb. erigon: mdbx:6375: mdbx_pnl_check: Assertion `((pl)[1]) < limit' failed. mdbx src: https://github.com/torquem-ch/mdbx-go/blob/v0.22.6/mdbx/mdbx.c |
Earlier I reproduced the previous case by internally overriding Seems this is other case that I unable to reproduce yet. |
@AskAlexSharov Runs for 60 hours with no problems with mdbx_no_assert branch , my data reach 4.3T |
@flywukong because no assert means “disabled asserts” :-) here is the branch where I switched to latest mdbx and enabled asserts: ledgerwatch/erigon#3324 (likely you will get error here). You eunning ok, because bug is not in mdbx logic but in assert (invariant check) logic. |
Please provide stack backtrace, core dump or ssh access for remote debugging. |
@AskAlexSharov I can also see the "disabled asserts" is merged into devel branch |
@flywukong depends what you need - if you need working version - just use devel without any actions. If you need to create coredump on crush - use ledgerwatch/erigon#3324 (withou any actions). |
To clarify the current status:
|
@AskAlexSharov @erthink I got a core file after disable assertions here is lib files: |
No access granted for there files.
But why? |
@erthink you can download now , link permissions have been updated. I mean the devel branch of erigon , not libmdbx . the commit information that I took of erigon should have enabled current master branch of libmdbx with enabled assertion checks. you check this by ledgerwatch/erigon#3324 |
@flywukong, the |
@erthink sorry , here is erigon |
The backtrace of the last coredump:
|
The last stack backtrace shown the same bug as noted above but in another execution path. So we can ignore it by disable assertion checking. However, I need to understand why the problem was not reproduced in the tests, improve ones for reproducibility of this case and only then fix the issue. |
I think the issue has been fixed completely and the code is ready for testing. I also found out the reason why the second case was not reproduced by the tests. In particular, the tests used earlier were more likely to end due to the exhaustion of the available range of pages before a enough number of stochastic iterations were performed using more than 50% of the page rang, which is required to reproduce the problem. |
@flywukong I created |
Any new info? |
I have 1 confirmation that issue fixed |
AskAlexSharov commentedJan 13, 2022
Hi. Looks like at 4Tb threshold mdbx getting next assert:
Assertion failed: ((pl)[1]) < limit (mdbx: mdbx_pnl_check: 6368)
The text was updated successfully, but these errors were encountered: