[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153

loumalouomega · 2024-03-06T11:22:18Z

We need to study the segmentation fault.

Sort values and column indices in the rows of a sparse matrix in-place. For each row of the input matrix, this saves

one allocation
four row-sized copies
one deallocation

loumalouomega · 2024-03-06T11:22:50Z

Yes, a 10% time reduction.

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/improve-f91598bfeb-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  65.3339204788208  seconds
Matrix multiplication execution time:  4.3560473918914795  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo.

Proposal:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/matrix-product-performance-40bb818051-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  65.36043405532837  seconds
Matrix multiplication execution time:  3.974817991256714  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo.

loumalouomega · 2024-03-06T13:24:08Z

I couldn't solve the memory error.

loumalouomega · 2024-03-06T13:24:46Z

I couldn't solve the memory error.

Making draft to avoid merge

RiccardoRossi · 2024-03-06T14:59:44Z

might be of help

https://stackoverflow.com/questions/37368787/c-sort-one-vector-based-on-another-one

loumalouomega · 2024-03-07T09:34:08Z

@matekelemen would you mind to merge #12143 first?

matekelemen · 2024-03-07T10:03:23Z

Give me until Monday to debug the problem.

loumalouomega · 2024-03-07T10:33:07Z

Give me until Monday to debug the problem.

On the other side I will try proposal of @RiccardoRossi

loumalouomega · 2024-03-07T13:04:07Z

Give me until Monday to debug the problem.

On the other side I will try proposal of @RiccardoRossi

Implementing this as:

template <typename TSize, typename Col, typename TIndexType, typename ValueType>
    static inline void SortRows(
        const TIndexType* CPtr,
        const TSize NRows,
        const TSize NCols,
        Col* Columns,
        ValueType* Values
        )
    {
        IndexPartition<TSize>(NRows).for_each([&](TSize i_row) {
            const TIndexType row_begin = CPtr[i_row];
            const TIndexType row_end = CPtr[i_row + 1];
            TSize length = row_end - row_begin;

            // Create and sort a vector of indices based on column indices
            std::vector<IndexType> indices(length);
            std::iota(indices.begin(), indices.end(), row_begin); // Starting from row_begin
            std::sort(indices.begin(), indices.end(), [&](IndexType i, IndexType j) {
                return Columns[i] < Columns[j];
            });

            // Apply the sorted indices to permute Columns and Values in place
            // Note: This requires a careful approach to avoid overwriting
            std::vector<bool> visited(length, false);
            for (TSize i = 0; i < length; ++i) {
                if (visited[i] || indices[i] == row_begin + i)
                    continue; // Already in place or visited

                // Cycle detection in permutation
                TSize cycle_start = i;
                IndexType prev_index = i;
                Col temp_col = Columns[row_begin + i];
                ValueType temp_val = Values[row_begin + i];

                do {
                    IndexType next_index = indices[prev_index] - row_begin;
                    std::swap(Columns[row_begin + prev_index], temp_col);
                    std::swap(Values[row_begin + prev_index], temp_val);

                    visited[prev_index] = true;
                    prev_index = next_index;
                } while (prev_index != cycle_start);
            }
        });
    }

It is slower than this proposal (but faster than #12143):

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/improve-9a1ba2bc3b-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  65.73923110961914  seconds
Matrix multiplication execution time:  4.149796724319458  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo.

It addition it also crashes, so I think the problem is on the std::swap, maybe some values are not defined and induces a memory error.

…efore-sort-matrix

matekelemen · 2024-03-08T10:43:55Z

Ok I found the problem: std::sort doesn't always use std::swap. Specifically on small ranges, it will use the move (or copy) constructor and move (or copy) assignment operator, which is a problem in our case because the value type of our iterator is just a pair of pointers pointing inside the range, and thus they cannot moved (or copied) outside it.

I added some logic to move these values in and out the array, at some extra copying expense. We'll see how it affects performance (I'll run some tests and provide the results here).

P.S.: I didn't see a measurable improvement

loumalouomega · 2024-03-11T10:15:09Z

With teh suggested change:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/avoid-copy-before-sort-matrix-37ffadf0e5-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  67.21308064460754  seconds
Matrix multiplication execution time:  4.146542310714722  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo.

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

loumalouomega · 2024-03-11T10:15:43Z

With teh suggested change:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."3"-core/avoid-copy-before-sort-matrix-37ffadf0e5-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time:  67.21308064460754  seconds
Matrix multiplication execution time:  4.146542310714722  seconds
 *  Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo.

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

Myabe we should test more matrices to actually perform a benchmark

matekelemen · 2024-03-11T11:07:39Z

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

I think the original implementation in #12143 (copy both arrays to a temporary array of pairs and sort that, then copy them back) is the most clean, and we should go with it for now. I ran my test case 5 times for both implementations and couldn't measure a difference in performance.

My specs for running the benchmarks:

11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz
384K L1, 4M L2, 16M L3
OMP_NUM_THREADS=8 (only physical cores)

Myabe we should test more matrices to actually perform a benchmark

That's a good idea, if you're willing to generate some test operands that lead to a variety of row densities in the output matrix.

loumalouomega · 2024-03-11T11:09:51Z

Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler)

I think the original implementation in #12143 (copy both arrays to a temporary array of pairs and sort that, then copy them back) is the most clean, and we should go with it for now. I ran my test case 5 times for both implementations and couldn't measure a difference in performance.

My specs for running the benchmarks:
* 11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz

* 384K L1, 4M L2, 16M L3

* `OMP_NUM_THREADS=8` (only physical cores)
Myabe we should test more matrices to actually perform a benchmark

That's a good idea, if you're willing to generate some test operands that lead to a variety of row densities in the output matrix.

In that case we can merge #12143 and then point this branch to master for merge, and expereiment changes and benchmark there.

loumalouomega · 2024-03-22T14:49:01Z

FYI @RiccardoRossi

loumalouomega · 2024-03-22T15:30:31Z

Copying message from @RiccardoRossi:

this code, taken from:
https://codereview.stackexchange.com/questions/231352/c17-zip-iterator-compatible-with-stdsort

would most probably do what you need without copies. Having said this, admittedly i am not sure wheter it is faster or not, but i would expect that allocation of a lot of pairs is not for free. 
(the main is at the end of the post ... everything at the beginning are the helper functions)

~~~cpp
#include <cstdint>
#include <algorithm>
#include <functional>
#include <tuple>
#include <utility>

template <typename ...T>
class ZipRef {
    std::tuple<T*...> ptr;
public:
    ZipRef() = delete;
    ZipRef(const ZipRef& z) = default;
    ZipRef(ZipRef&& z) = default;
    ZipRef(T* const... p): ptr(p...) {}

    ZipRef& operator=(const ZipRef& z)             { return copy_assign(z); }
    ZipRef& operator=(const std::tuple<T...>& val) { return val_assign(val); }

    template <std::size_t I = 0>
    ZipRef& copy_assign(const ZipRef& z) {
        *(std::get<I>(ptr)) = *(std::get<I>(z.ptr));
        if constexpr(I+1 < sizeof...(T)) return copy_assign<I+1>(z);
        return *this;
    }
    template <std::size_t I = 0>
    ZipRef& val_assign(const std::tuple<T...>& t) {
        *(std::get<I>(ptr)) = std::get<I>(t);
        if constexpr(I+1 < sizeof...(T)) return val_assign<I+1>(t);
        return *this;
    }

    std::tuple<T...> val() const {return std::apply([](auto&&...args) { return std::tuple((*args)...); }, ptr);}
    operator std::tuple<T...>() const { return val(); }

    void swap(const ZipRef& o) const {
        swap_impl<sizeof...(T)-1>(o);
    }

private:
    template <std::size_t I>
    void swap_impl(const ZipRef& o) const {
        using std::swap;
        swap(*std::get<I>(ptr), *std::get<I>(o.ptr));
        if constexpr(I) swap_impl<I-1>(o);
    }

public:
#define OPERATOR(OP)                                                    \
    bool operator OP(const ZipRef & o) const { return val() OP o.val(); } \
    inline friend bool operator OP(const ZipRef& r, const std::tuple<T...>& t) { return r.val() OP t; } \
    inline friend bool operator OP(const std::tuple<T...>& t, const ZipRef& r) { return t OP r.val(); }

    OPERATOR(==) OPERATOR(<=) OPERATOR(>=)
    OPERATOR(!=) OPERATOR(<)  OPERATOR(>)
#undef OPERATOR

};

template<typename ...IT>
class ZipIter {
    std::tuple<IT...> it;

    template<int N, typename... T> using NthTypeOf =
        typename std::tuple_element<N, std::tuple<T...>>::type;
    template<typename... T> using FirstTypeOf = NthTypeOf<0, T...>;

public:
    using iterator_category = typename std::iterator_traits<FirstTypeOf<IT...>>::iterator_category;
    using difference_type   = typename std::iterator_traits<FirstTypeOf<IT...>>::difference_type;
    using value_type        = std::tuple<typename std::iterator_traits<IT>::value_type ...>;
    using pointer           = std::tuple<typename std::iterator_traits<IT>::pointer ...>;
    using reference         = ZipRef<typename std::iterator_traits<IT>::value_type ...>;

    ZipIter() = default;
    ZipIter(const ZipIter &rhs) = default;
    ZipIter(ZipIter&& rhs) = default;
    ZipIter(IT... rhs): it(std::move(rhs)...) {}

    ZipIter& operator=(const ZipIter& rhs) = default;
    ZipIter& operator=(ZipIter&& rhs) = default;

    ZipIter& operator+=(const difference_type d) {
        std::apply([&d](auto&&...args) {((std::advance(args,d)),...);}, it); return *this;
    }
    ZipIter& operator-=(const difference_type d) { return operator+=(-d); }

    reference operator* () const {return std::apply([](auto&&...args) {return reference(&(*(args))...);}, it);}
    pointer   operator->() const {return std::apply([](auto&&...args) {return pointer(&(*(args))...);}, it);}
    reference operator[](difference_type rhs) const {return *(operator+(rhs));}

    ZipIter& operator++() { return operator+=(1); }
    ZipIter& operator--() { return operator+=(-1); }
    ZipIter operator++(int) {ZipIter tmp(*this); operator++(); return tmp;}
    ZipIter operator--(int) {ZipIter tmp(*this); operator--(); return tmp;}

    difference_type operator-(const ZipIter& rhs) const {return std::get<0>(it)-std::get<0>(rhs.it);}
    ZipIter operator+(const difference_type d) const {ZipIter tmp(*this); tmp += d; return tmp;}
    ZipIter operator-(const difference_type d) const {ZipIter tmp(*this); tmp -= d; return tmp;}
    inline friend ZipIter operator+(const difference_type d, const ZipIter& z) {return z+d;}
    inline friend ZipIter operator-(const difference_type d, const ZipIter& z) {return z-d;}

#define OPERATOR(OP)                                                    \
    bool operator OP(const ZipIter& rhs) const {return it OP rhs.it;}
    OPERATOR(==) OPERATOR(<=) OPERATOR(>=)
    OPERATOR(!=) OPERATOR(<)  OPERATOR(>)
#undef OPERATOR
};

template<typename ...Container>
class Zip {
    std::tuple<Container&...> zip;

public:
    Zip() = delete;
    Zip(const Zip& z) = default;
    Zip(Zip&& z) = default;
    Zip(Container&... z)
        : zip {z...}
    {
        auto const len = {(z.size())...,};
        // if (std::adjacent_find(len.begin(), len.end(), std::not_equal_to()) != len.end())
        //     throw std::error("array lengths differ");
    }

#define HELPER(OP)                                                      \
    auto OP() {return std::apply([](auto&&... args) { return ZipIter((args.OP())...);}, zip);}
    HELPER(begin) HELPER(end)
    HELPER(rbegin) HELPER(rend)
#undef HELPER

#define HELPER(OP)                                                      \
    auto OP() const {return std::apply([](auto&&... args) { return ZipIter((args.OP())...);}, zip);}
    HELPER(begin) HELPER(end)
    HELPER(rbegin) HELPER(rend)
    HELPER(cbegin) HELPER(cend)
    HELPER(crbegin) HELPER(crend)
#undef HELPER
};

#include <utility>
using std::swap;
template<typename ...T> void swap(const ZipRef<T...>& a, const ZipRef<T...>& b) { a.swap(b); }

#include <sstream>
template< class Ch, class Tr, class...IT, typename std::enable_if<(sizeof...(IT)>0), int>::type = 0>
auto& operator<<(std::basic_ostream<Ch, Tr>& os, const ZipRef<IT...>& t) {
    std::basic_stringstream<Ch, Tr> ss;
    ss << "[ ";
    std::apply([&ss](auto&&... args) {((ss << args << ", "), ...);}, t.val());
    ss.seekp(-2, ss.cur);
    ss << " ]";
    return os << ss.str();
}


#include <vector>
#include <string>
#include <algorithm>
#include <iostream>

int main() {
    std::vector<int> a {3,1,4,2};
    //std::vector<std::string> b {"Alice","Bob","Charles","David"};
    std::vector<double> b {3000,1000,4000,2000};

    auto const zip = Zip(a,b);

    for (const auto & z: zip) std::cout << z << std::endl;

    std::cout << std::endl;
    std::sort(zip.begin(), zip.end());
    for (const auto & z: zip) std::cout << z << std::endl;
}

alternatively also boost has a zip_iterator which probably would allow us doing the same

loumalouomega · 2024-10-17T12:15:29Z

@matekelemen I don't remember what is this and if we agreed something

matekelemen · 2024-10-17T13:01:14Z

I think the initial version that copied to a vector of pairs, sorted, then copied back was merged instead of in-place sorting, so we can safely close this PR.

sort sparse matrix rows in-place

ee90662

loumalouomega requested a review from matekelemen March 6, 2024 11:22

loumalouomega requested a review from a team as a code owner March 6, 2024 11:22

loumalouomega marked this pull request as draft March 6, 2024 13:24

loumalouomega mentioned this pull request Mar 7, 2024

[Core] In SparseMatrixMultiplicationUtility some modernization + plus SortRows minor optimization + replace omp simple loops with IndexPartition<IndexType> #12143

Merged

matekelemen added 3 commits March 8, 2024 11:36

fix tolerance argument

99f4117

support std::sort on small arrays

78613b5

Merge branch 'core/matrix-product-performance' into core/avoid-copy-b…

37ffadf

…efore-sort-matrix

loumalouomega marked this pull request as ready for review March 11, 2024 09:31

Minor

e354171

Merge branch 'core/improve' into core/avoid-copy-before-sort-matrix

82f8180

Base automatically changed from core/improve to master March 26, 2024 12:21

loumalouomega closed this Oct 17, 2024

loumalouomega deleted the core/avoid-copy-before-sort-matrix branch October 17, 2024 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153

[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153

loumalouomega commented Mar 6, 2024

loumalouomega commented Mar 6, 2024

loumalouomega commented Mar 6, 2024

loumalouomega commented Mar 6, 2024

RiccardoRossi commented Mar 6, 2024

loumalouomega commented Mar 7, 2024

matekelemen commented Mar 7, 2024

loumalouomega commented Mar 7, 2024

loumalouomega commented Mar 7, 2024

matekelemen commented Mar 8, 2024 •

edited

Loading

loumalouomega commented Mar 11, 2024

loumalouomega commented Mar 11, 2024

matekelemen commented Mar 11, 2024 •

edited

Loading

loumalouomega commented Mar 11, 2024

loumalouomega commented Mar 22, 2024

loumalouomega commented Mar 22, 2024

loumalouomega commented Oct 17, 2024

matekelemen commented Oct 17, 2024

[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153

[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153

Conversation

loumalouomega commented Mar 6, 2024

loumalouomega commented Mar 6, 2024

loumalouomega commented Mar 6, 2024

loumalouomega commented Mar 6, 2024

RiccardoRossi commented Mar 6, 2024

loumalouomega commented Mar 7, 2024

matekelemen commented Mar 7, 2024

loumalouomega commented Mar 7, 2024

loumalouomega commented Mar 7, 2024

matekelemen commented Mar 8, 2024 • edited Loading

loumalouomega commented Mar 11, 2024

loumalouomega commented Mar 11, 2024

matekelemen commented Mar 11, 2024 • edited Loading

loumalouomega commented Mar 11, 2024

loumalouomega commented Mar 22, 2024

loumalouomega commented Mar 22, 2024

loumalouomega commented Oct 17, 2024

matekelemen commented Oct 17, 2024

matekelemen commented Mar 8, 2024 •

edited

Loading

matekelemen commented Mar 11, 2024 •

edited

Loading