Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump in complete_aio #390

Open
neverchanje opened this issue Sep 7, 2019 · 1 comment
Open

Coredump in complete_aio #390

neverchanje opened this issue Sep 7, 2019 · 1 comment
Labels
type/bug This issue reports a bug.

Comments

@neverchanje
Copy link
Contributor

neverchanje commented Sep 7, 2019

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

Test bootstrap, after a previous cleanup. Though this is a new bootstrap,
there was still some log files not cleared. The pegasus-server replayed those
log files during initialization, which caused this core dump.

D2019-09-07 16:13:18.846 (1567843998846027369 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:864:replay(): start to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3356.462664068974, offset = [462664068974, 462902133466), size = 238064492
D2019-09-07 16:13:19.873 (1567843999873498904 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:914:replay(): finish to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3356.462664068974, err = ERR_HANDLE_EOF
D2019-09-07 16:13:19.873 (1567843999873593538 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:864:replay(): start to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3357.462902133466, offset = [462902133466, 463173023339), size = 270889873
D2019-09-07 16:13:21.035 (1567844001035586779 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:914:replay(): finish to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3357.462902133466, err = ERR_HANDLE_EOF
D2019-09-07 16:13:21.035 (1567844001035664808 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:864:replay(): start to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3358.463173023339, offset = [463173023339, 463345410168), size = 172386829
D2019-09-07 16:13:21.788 (1567844001788814689 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:914:replay(): finish to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3358.463173023339, err = ERR_HANDLE_EOF
D2019-09-07 16:13:21.789 (1567844001789227538 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:864:replay(): start to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3359.463345410168, offset = [463345410168, 463526005305), size = 180595137
D2019-09-07 16:13:22.568 (1567844002568888601 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:914:replay(): finish to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3359.463345410168, err = ERR_HANDLE_EOF
D2019-09-07 16:13:22.568 (1567844002568946492 1e500) replica.default0.0000e4a500010001: mutation_log.cpp:864:replay(): start to replay mutation log /home/work/app/pegasus/c4tst-performance/replica/replica/slog/log.3360.463526005305, offset = [463526005305, 463665560054), size = 139554749
  1. What did you expect to see?

No coredump.

  1. What did you see instead?
#0  load (__m=std::memory_order_relaxed, this=0x3d343e3c7b553a69) at /home/wutao1/app/include/c++/4.8.2/bits/atomic_base.h:496
#1  signal (this=0x3d343e3c7b553a69) at /home/wutao1/pegasus/rdsn/include/dsn/utility/hpc_locks/autoresetevent.h:33
#2  notify (this=0x3d343e3c7b553a69) at /home/wutao1/pegasus/rdsn/include/dsn/utility/synchronize.h:99
#3  dsn::tools::native_linux_aio_provider::complete_aio (this=this@entry=0x36f6f60, io=0xe462728, bytes=2145385340, err=<optimized out>)
    at /home/wutao1/pegasus/rdsn/src/core/tools/common/native_aio_provider.linux.cpp:160
#4  0x00007f9d5ae50d01 in dsn::tools::native_linux_aio_provider::get_event (this=0x36f6f60) at /home/wutao1/pegasus/rdsn/src/core/tools/common/native_aio_provider.linux.cpp:134
#5  0x00007f9d577e5600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>)
    at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#6  0x00007f9d58ddedc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f9d56f4f73d in clone () from /lib64/libc.so.6

error logs

tcmalloc: large alloc 1502150656 bytes == 0x6dfcc000 @  0x7f9d580f1e07 0x7f9d581141a9 0x5b5d32 0x7f9d5ad89bde 0x7f9d5ad8a2b2 0x7f9d5ad8a50e 0x7f9d5ac68b8d 0x7f9d5ac5ec76 0x7f9d5ac5ef3c 0x7f9d5ac5f47f 0x7f9d5ac60b52 0x7f9d5acd22a2 0x7f9d5acf3bfc 0x5da801 0x7f9d5adbb1e2 0x7f9d5addea7b 0x7f9d5adc6859 0x7f9d5adda50d 0x7f9d5adda6d9 0x7f9d577e5600
tcmalloc: large alloc 1674551296 bytes == 0x1dbd6000 @  0x7f9d580f1e07 0x7f9d581141a9 0x5b5d32 0x7f9d5ad89bde 0x7f9d5ad8a2b2 0x7f9d5ad8a50e 0x7f9d5ac68b8d 0x7f9d5ac5ec76 0x7f9d5ac5ef3c 0x7f9d5ac5f47f 0x7f9d5ac60b52 0x7f9d5acd22a2 0x7f9d5acf3bfc 0x5da801 0x7f9d5adbb1e2 0x7f9d5addea7b 0x7f9d5adc6859 0x7f9d5adda50d 0x7f9d5adda6d9 0x7f9d577e5600
task_spec.cpp:94:register_task_code():overwrite default thread pool for task RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX from THREAD_POOL_META_SERVER to THREAD_POOL_DEFAULT
task_spec.cpp:94:register_task_code():overwrite default thread pool for task RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK from THREAD_POOL_META_SERVER to THREAD_POOL_DEFAULT
main.cpp:92:main():pegasus server starting, pid(122326), version($Version: Pegasus Server 1.12.SNAPSHOT (38dccf6d774751a1f5bd67d6572c2156412979a5) Release, built with rDSN 1.0.0 (f631ed925c291ac40570fdcbd447655650a607cd), built by gcc 4.8.2, built on c3-hadoop-build03.bj, built at Sep  5 2019 16:10:19 $)
got signal id: 11
  1. What version of Pegasus are you using?

Pegasus Server 1.12.SNAPSHOT (38dccf6) release

@neverchanje neverchanje added the type/bug This issue reports a bug. label Sep 7, 2019
@neverchanje
Copy link
Contributor Author

neverchanje commented Sep 7, 2019

Code point

The bug is caused by inaccessible address 0x3d343e3c7b553a69, which is
aio->evt in dsn::tools::native_linux_aio_provider::complete_aio, which indicates memory corruption of linux_disk_aio_context.

void native_linux_aio_provider::complete_aio(struct iocb *io, int bytes, int err)
{
    linux_disk_aio_context *aio = CONTAINING_RECORD(io, linux_disk_aio_context, cb);
    error_code ec;
    if (err != 0) {
        derror("aio error, err = %s", strerror(err));
        ec = ERR_FILE_OPERATION_FAILED;
    } else {
        ec = bytes > 0 ? ERR_OK : ERR_HANDLE_EOF;
    }

    if (!aio->evt) {
        aio_task *aio_ptr(aio->tsk);
        aio->this_->complete_io(aio_ptr, ec, bytes);
    } else {
        aio->err = ec; // ec = 0
        aio->bytes = bytes; // bytes = 2,145,385,340, nearly exceeds int max = 2,147,483,647
        aio->evt->notify(); // coredump point, evt is inaccessible
    }
}

Memory dump of linux_disk_aio_context:

$6 = {
  <dsn::disk_aio> = {
    file = 0x5c3d225c76292532, 
    buffer = 0x316e3a3271492e22, 
    support_write_vec = 92, 
    write_buffer_vec = 0x212f423a774e3673, 
    buffer_size = 1161178450, 
    file_offset = 5346938506896954679, 
    type = (dsn::AIO_Read | unknown: 808398908), 
    engine = 0x6a2530363730614f, 
    file_object = 0x422563473b7c3f3c
  }, 
  members of dsn::tools::native_linux_aio_provider::linux_disk_aio_context: 
  cb = {
    data = 0x343b523e3a2c242f, 
    key = 1178021723, 
    __pad2 = 1029324393, 
    aio_lio_opcode = 19250, 
    aio_reqprio = 13949, 
    aio_fildes = 892874815, 
    u = {
      c = {
        buf = 0x3d24242a39482b68, 
        nbytes = 5278068530193660960, 
        offset = 3475146331730683773, 
        __pad3 = 8806003118902164527, 
        flags = 913847603, 
        resfd = 1546202665
      }, 
      v = {
        vec = 0x3d24242a39482b68, 
        nr = 576486432, 
        offset = 3475146331730683773
      }, 
      poll = {
        events = 961031016, 
        __pad1 = 1025778730
      }, 
      saddr = {
        addr = 0x3d24242a39482b68, 
        len = 576486432
      }
    }
  }, 
  tsk = 0x3353212d5223755c, 
  this_ = 0x472937503a7c3c31, 
  evt = 0x3d343e3c7b553a69, 
  err = {
    _internal_code = 0
  }, 
  bytes = 2145385340
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

1 participant