Skip to content
This repository has been archived by the owner. It is now read-only.

4Tb assert #260

Closed
AskAlexSharov opened this issue Jan 13, 2022 · 49 comments
Closed

4Tb assert #260

AskAlexSharov opened this issue Jan 13, 2022 · 49 comments
Assignees
Labels

Comments

@AskAlexSharov
Copy link
Contributor

AskAlexSharov commented Jan 13, 2022

Hi. Looks like at 4Tb threshold mdbx getting next assert:
Assertion failed: ((pl)[1]) < limit (mdbx: mdbx_pnl_check: 6368)

@erthink
Copy link
Owner

erthink commented Jan 13, 2022

Could you provide the coredump, or at least a stack backtrace?

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 13, 2022

coredump will come tomorrow

@flywukong
Copy link

flywukong commented Jan 17, 2022

Hi, I test it on bsc-erigon and get a error when the mdbx.dat. file reach 4T , the errors looks like blow
image

image
then I notice that you have fix the issue and merge the commit into devel branch , so I have updated the go package into devel branch , and then recompile erigon and restart it (continue syncing), but I still get errors below .
image
you can get log here ,https://transfer.toolsfdg.net/ySKaN/nohup.out, I wonder if the issue is completely repaired

@erthink
Copy link
Owner

erthink commented Jan 17, 2022

@flywukong, this issue not fixed for now, but I made some changes to dig it.

  1. The line number for this assertion is differ for current code. Please use current the devel or issue-260 branch for your test(s).
  2. As I noted abote the coredump or at least a stack backtrace is recuired, if problem still.
  3. The link you provided to the log is inaccessible.

@flywukong
Copy link

flywukong commented Jan 17, 2022

@erhink thanks for your reply , I seems have already changed to package to devel by run command like" go get github.com/erthink/libmdbx@devel", the go.mod file changed and the id is the latest commit id in devel branch.
image

but the log shows that it is not devel branch ? I will try to use go replace to update package intead of this way.
you can see log here
log.zip

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 17, 2022

@flywukong please don’t be confused - erigon and mdbx are not related projects. mdbx is C language project and has no go.mod (can’t “go get” it).

There are several steps to get another version of mdbx into erigon (if you need another version of erigon - better ask about it in erigon’s channel/repo).

There is same name branch in erigon “issue-260” - with right version of mdbx. What need to do now - run it on existing db and get core dump. Such core dump can be attached here.

@flywukong
Copy link

flywukong commented Jan 17, 2022

@AskAlexSharov thanks for your advice, so I think ignore the level of mdbx-go , if this part have not been fixed , the erigon can not work well . But I was testing on bsc branch of erigon , so I wonder I can not just use “issue-260” of erigon, may be I can merge the change of this branch to fix the problem

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 17, 2022

@AskAlexSharov thanks for your advice, so I think ignore the level of mdbx-go , if this part have not been fixed , the erigon can not work well . But I was testing on bsc branch of erigon , so I wonder I can not just use “issue-260” of erigon, may be I can merge the change of this branch to fix the problem

It’s not fix yet, it’s debug branch to get coredump - which will help us understand root cause and fix.

@flywukong
Copy link

flywukong commented Jan 17, 2022

@AskAlexSharov thanks , I am not sure if this commit have solved the problem , 1813bf9 , it is merged into devel branch , I think may be we can update mdbx-go code to downloaded and called this branch for testing . If problem is solved in test, we just need to wait for this commit to be merged into master of libmdbx. Syncing 4T data from scratch get coredump would take too much time , this way may be faster

@erthink
Copy link
Owner

erthink commented Jan 17, 2022

@flywukong, AFAIK erigon uses DB with default 4K pages.
If so, then the mentioned commit is relevant to the issues, but does not fix it.
Because with 4K page size, such arithmetic overflow will happen when the size is 8T, not 4T.
For overflow at 4T the 2K sized page should be used.

So the output of mdbx_chk -vv for your 4T DB will help to understand situation, and if we see the 2K sized pages then we could assume that issue is fixed.

@flywukong
Copy link

flywukong commented Jan 18, 2022

@erthink thanks , the pagesize in our DB is also 4K
image
by the way, our bug doesn't seem to be triggered by 4T. When it reachded 4T, it was an error indicating that the mapsize was not enough(the mapsize we configured is 4T). When I tried to adjust the mapsize, I kept restarting erigon, and then triggered the core

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 18, 2022

@flywukong on which Erigon's branch? if on "bsc" or "devel" - try "issue-260" branch. If you still see crush - please send us coredump. thank.

@flywukong
Copy link

flywukong commented Jan 18, 2022

@AskAlexSharov we are using "bsc" branch which has just merged into devel some , I have tried to merge the commits about issue-260 in "issue-260" branch yesterday but it has some problems when compling erigon . I will re-run this branch issue-260 directly

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 18, 2022

@erthink I have 1 person confirmation that “issue-260” branch solved problem.

@flywukong
Copy link

flywukong commented Jan 18, 2022

@AskAlexSharov the previous branch ran for less than 20 minutes, the core occurred. After I used this “issue-260” branch for nearly three hours ,the core still occurs . For various reasons, the core file was not generated successfully, and I will continue to test until I get the core file

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 18, 2022

Thank you

@flywukong
Copy link

flywukong commented Jan 18, 2022

@AskAlexSharov Crash occurs again after the process runs for more than an hour, but the strange thing is that no core file is generated after I repeated the test twice. I'm sure the branch I'm using is correct the “issue-260”

And I have carefully checked the corefile-related system configuration and tested it. It should be able to generate core normally.
here is related log file
log (1).zip

@flywukong
Copy link

flywukong commented Jan 19, 2022

@AskAlexSharov it works , the corefile is 9G , I have sent it to you gmail , please check your gmail

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 19, 2022

Thank you

@erthink
Copy link
Owner

erthink commented Jan 19, 2022

@flywukong, the following files are required from your build and/or system to analyze the core(s):

/server/bsc-erigon/test-node/erigon
/usr/lib64/libnss_files-2.26.so
/usr/lib64/libc-2.26.so
/usr/lib64/libpthread-2.26.so
/usr/lib64/librt-2.26.so
/usr/lib64/ld-2.26.so

@erthink
Copy link
Owner

erthink commented Jan 19, 2022

@flywukong, the circumstances are such that I need to address this issue in the very near future or postpone it for a long time.
Therefore, it would be nice if you provide the necessary files today, or provide remote ssh access for a debugging session using gdb (for details please contact me through the telegram group libmdbx).

@flywukong
Copy link

flywukong commented Jan 20, 2022

@AskAlexSharov @erthink ok , I will sent this files today to your emails

@flywukong
Copy link

flywukong commented Jan 20, 2022

@erthink @AskAlexSharov email sended, you can aslo download by this linkhttps://drive.google.com/file/d/1b-34gnU3JK4OfkE-wcvcDLk721ep1MVd/view

@erthink
Copy link
Owner

erthink commented Jan 20, 2022

The backtrace:

...
#5  0x00007fd290c60c20 in raise () from /lib64/libc.so.6
#6  0x00007fd290c620c8 in abort () from /lib64/libc.so.6
#7  0x00007fd290c599ca in __assert_fail_base () from /lib64/libc.so.6
#8  0x00007fd290c59a42 in __assert_fail () from /lib64/libc.so.6
#9  0x00000000004061e3 in mdbx_assert_fail (env=<optimized out>, msg=<optimized out>, func=<optimized out>, line=<optimized out>) at mdbx.c:26342
#10 0x00000000012575d6 in mdbx_pnl_check (pl=pl@entry=0x73d157796014, limit=limit@entry=2147483648) at mdbx.c:6367
#11 0x0000000000421543 in mdbx_pnl_sort (pnl=0x73d157796014) at mdbx.c:6477
#12 0x0000000001269189 in mdbx_txn_spill (txn=<optimized out>, m0=m0@entry=0x7fd22c104de0, need=89) at mdbx.c:8773
#13 0x0000000001269ebf in mdbx_cursor_spill (mc=mc@entry=0x7fd22c104de0, key=key@entry=0x7fd244ff8df0, data=<optimized out>) at mdbx.c:8838
#14 0x000000000127f60c in mdbx_cursor_put (mc=0x7fd22c104de0, key=key@entry=0x7fd244ff8df0, data=data@entry=0x7fd244ff8e00, flags=131072) at mdbx.c:18422
#15 0x000000000128d017 in mdbxgo_cursor_put2 (cur=<optimized out>, kdata=<optimized out>, kn=<optimized out>, vdata=<optimized out>, vn=<optimized out>, flags=<optimized out>) at mdbxgo.c:61
#16 0x000000000125444b in _cgo_afc3699e7033_Cfunc_mdbxgo_cursor_put2 (v=0xc05014abb8) at cgo-gcc-prolog:272
...

@erthink
Copy link
Owner

erthink commented Jan 20, 2022

The problem arises due to excessive/too-strict checking the PNL of pages-to-spill with a left-shifted numbers.

So this bug triggered only in the DEBUG builds or when the assertion checking is forcibly enabled.
It does not affect any core logic and cannot lead to DB corruption, data loss, and so on.

Hopefully I'll fix it today, but as temporary workaround you can just use non-DEBUG and without the -DMDBX_FORCE_ASSERTIONS option builds.

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 20, 2022

nice.
@flywukong I created branch "mdbx_no_assert" which must fix 4Tb issue, please try

@flywukong
Copy link

flywukong commented Jan 21, 2022

@AskAlexSharov ok , it is testing now

@easeev
Copy link

easeev commented Jan 21, 2022

nice. @flywukong I created branch "mdbx_no_assert" which must fix 4Tb issue, please try

Two nodes are running with 4TB+ ledger on this branch without crashes so far

@koen84
Copy link

koen84 commented Jan 21, 2022

I got past the crashloop and reached chainhead with mdbx_no_assert branch.

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 23, 2022

I have 1 report that, last mdbx master with enabled assert still asserting on 4Tb.

erigon: mdbx:6375: mdbx_pnl_check: Assertion `((pl)[1]) < limit' failed.

mdbx src: https://github.com/torquem-ch/mdbx-go/blob/v0.22.6/mdbx/mdbx.c
erigon’s branch: ledgerwatch/erigon#3324

@erthink
Copy link
Owner

erthink commented Jan 23, 2022

Earlier I reproduced the previous case by internally overriding MAX_PAGENO (to reduce required DB size, i.e. required RAM volume and disk space), and the test of the fix provided is still running successfully in a continually loop.

Seems this is other case that I unable to reproduce yet.
So the backtrace and/or coredump is needed.

@flywukong
Copy link

flywukong commented Jan 23, 2022

@AskAlexSharov Runs for 60 hours with no problems with mdbx_no_assert branch , my data reach 4.3T

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 23, 2022

@flywukong because no assert means “disabled asserts” :-) here is the branch where I switched to latest mdbx and enabled asserts: ledgerwatch/erigon#3324 (likely you will get error here). You eunning ok, because bug is not in mdbx logic but in assert (invariant check) logic.

@erthink
Copy link
Owner

erthink commented Jan 23, 2022

Please provide stack backtrace, core dump or ssh access for remote debugging.

@flywukong
Copy link

flywukong commented Jan 24, 2022

@AskAlexSharov I can also see the "disabled asserts" is merged into devel branch
image , should I revert the changes from this commit before running?

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 24, 2022

@flywukong depends what you need - if you need working version - just use devel without any actions. If you need to create coredump on crush - use ledgerwatch/erigon#3324 (withou any actions).

@erthink
Copy link
Owner

erthink commented Jan 24, 2022

To clarify the current status:

  • It looks like we have at least two cases of this issue.
  • I reproduced the first case, fix and checked it, both the fact that the particular issue was present, and the fact that it has been fixed.
  • As I noted the cause of the first case was a bug in a PNL (page number list) checking code, but not in a core logic. So it is safe just to disable assertions checking to avoid exactly/only this case.
  • Beside the first case, I saw and fix a minor bug due to which the page with the maximum number (0x7FFFffff) could not be used. For now this fix was also checked by tests.
  • The cause for the second case in uknown for now. Therefore, it cannot be said that it is safe to disable the assertions for this.
  • I'm still waiting for a backtrace, core dump or remote ssh+gdb access to investigate the second case.

@flywukong
Copy link

flywukong commented Jan 25, 2022

@AskAlexSharov @erthink I got a core file after disable assertions
https://drive.google.com/file/d/1aKog0n9Su1-w-DydTFE7x-6UpdDapUHi/view?usp=sharing

here is lib files:
https://drive.google.com/file/d/1BbkB29cQpbTvz7OuY0HrSxCj-L2ODdUW/view?usp=sharing

I used devel branch and git revert to these commit
image

@erthink
Copy link
Owner

erthink commented Jan 25, 2022

@AskAlexSharov @erthink I got a core file after disable assertions https://drive.google.com/file/d/1aKog0n9Su1-w-DydTFE7x-6UpdDapUHi/view?usp=sharing

here is lib files: https://drive.google.com/file/d/1BbkB29cQpbTvz7OuY0HrSxCj-L2ODdUW/view?usp=sharing

No access granted for there files.

I used devel branch and git revert to these commit.

But why?
For digging/investigation this issue I need a coredump from current master branch of libmdbx with enabled assertion checks.

@flywukong
Copy link

flywukong commented Jan 25, 2022

@erthink you can download now , link permissions have been updated.

I mean the devel branch of erigon , not libmdbx . the commit information that I took of erigon should have enabled current master branch of libmdbx with enabled assertion checks. you check this by ledgerwatch/erigon#3324

@erthink
Copy link
Owner

erthink commented Jan 25, 2022

@flywukong, the /server/bsc-erigon/test-node/erigon is absent.

@flywukong
Copy link

flywukong commented Jan 25, 2022

@erthink
Copy link
Owner

erthink commented Jan 25, 2022

The backtrace of the last coredump:

#9  0x0000000000405d39 in mdbx_assert_fail (msg=<optimized out>, func=<optimized out>, line=<optimized out>, env=0x0) at mdbx.c:26368
#10 0x00000000004203e5 in mdbx_pnl_check (limit=<optimized out>, pl=<optimized out>) at mdbx.c:6374
#11 mdbx_pnl_check4assert (limit=<optimized out>, pl=<optimized out>, pl@entry=0x332b010) at mdbx.c:6401
#12 mdbx_pnl_search (pnl=pnl@entry=0x734349c967d4, pgno=pgno@entry=2032894574) at mdbx.c:6494
#13 0x0000000000422b6f in mdbx_pnl_exist (pgno=2032894574, pnl=0x734349c967d4) at mdbx.c:6507
#14 mdbx_page_get_ex (front=897, pgno=1016447287, mc=0x7f449c096120) at mdbx.c:16693
#15 mdbx_page_get (front=897, mp=<synthetic pointer>, pgno=1016447287, mc=0x7f449c096120) at mdbx.c:7041
#16 mdbx_page_search_root (mc=mc@entry=0x7f449c096120, key=key@entry=0x7f44b5c40c20, flags=flags@entry=0) at mdbx.c:16802
#17 0x00000000004231e2 in mdbx_page_search (mc=mc@entry=0x7f449c096120, key=key@entry=0x7f44b5c40c20, flags=flags@entry=0) at mdbx.c:17012
#18 0x0000000001277312 in mdbx_cursor_set (mc=mc@entry=0x7f449c096120, key=key@entry=0x7f44b5c40dd0, data=data@entry=0x7f44b5c40d10, op=op@entry=MDBX_SET) at mdbx.c:17536
#19 0x0000000001281341 in mdbx_cursor_put (mc=0x7f449c096120, key=key@entry=0x7f44b5c40dd0, data=data@entry=0x7f44b5c40de0, flags=16) at mdbx.c:18374

@erthink
Copy link
Owner

erthink commented Jan 25, 2022

The last stack backtrace shown the same bug as noted above but in another execution path. So we can ignore it by disable assertion checking.

However, I need to understand why the problem was not reproduced in the tests, improve ones for reproducibility of this case and only then fix the issue.

@erthink
Copy link
Owner

erthink commented Jan 26, 2022

I think the issue has been fixed completely and the code is ready for testing.

I also found out the reason why the second case was not reproduced by the tests.
Briefly the tests cases were "too stochastic", thus a too low probability of some states and transitions between ones within the narrowed page numbers range configuration, which is required for testing this issue on a hardware with less than 512 Gb RAM.

In particular, the tests used earlier were more likely to end due to the exhaustion of the available range of pages before a enough number of stochastic iterations were performed using more than 50% of the page rang, which is required to reproduce the problem.

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 26, 2022

@flywukong I created mdbx_4tb_fix branch in erigon's repo. Feel free to try.

@erthink
Copy link
Owner

erthink commented Jan 27, 2022

Any new info?

@AskAlexSharov
Copy link
Contributor Author

AskAlexSharov commented Jan 28, 2022

I have 1 confirmation that issue fixed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants