-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Sort Sparse Matrix Rows In-place (reopen #12151) #12153
Conversation
Yes, a 10% time reduction. | / |
' / __| _` | __| _ \ __|
. \ | ( | | ( |\__ \
_|\_\_| \__,_|\__|\___/ ____/
Multi-Physics 9.4."3"-core/improve-f91598bfeb-Release-x86_64
Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time: 65.3339204788208 seconds
Matrix multiplication execution time: 4.3560473918914795 seconds
* Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. Proposal: | / |
' / __| _` | __| _ \ __|
. \ | ( | | ( |\__ \
_|\_\_| \__,_|\__|\___/ ____/
Multi-Physics 9.4."3"-core/matrix-product-performance-40bb818051-Release-x86_64
Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time: 65.36043405532837 seconds
Matrix multiplication execution time: 3.974817991256714 seconds
* Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. |
I couldn't solve the memory error. |
Making draft to avoid merge |
@matekelemen would you mind to merge #12143 first? |
Give me until Monday to debug the problem. |
On the other side I will try proposal of @RiccardoRossi |
Implementing this as: template <typename TSize, typename Col, typename TIndexType, typename ValueType>
static inline void SortRows(
const TIndexType* CPtr,
const TSize NRows,
const TSize NCols,
Col* Columns,
ValueType* Values
)
{
IndexPartition<TSize>(NRows).for_each([&](TSize i_row) {
const TIndexType row_begin = CPtr[i_row];
const TIndexType row_end = CPtr[i_row + 1];
TSize length = row_end - row_begin;
// Create and sort a vector of indices based on column indices
std::vector<IndexType> indices(length);
std::iota(indices.begin(), indices.end(), row_begin); // Starting from row_begin
std::sort(indices.begin(), indices.end(), [&](IndexType i, IndexType j) {
return Columns[i] < Columns[j];
});
// Apply the sorted indices to permute Columns and Values in place
// Note: This requires a careful approach to avoid overwriting
std::vector<bool> visited(length, false);
for (TSize i = 0; i < length; ++i) {
if (visited[i] || indices[i] == row_begin + i)
continue; // Already in place or visited
// Cycle detection in permutation
TSize cycle_start = i;
IndexType prev_index = i;
Col temp_col = Columns[row_begin + i];
ValueType temp_val = Values[row_begin + i];
do {
IndexType next_index = indices[prev_index] - row_begin;
std::swap(Columns[row_begin + prev_index], temp_col);
std::swap(Values[row_begin + prev_index], temp_val);
visited[prev_index] = true;
prev_index = next_index;
} while (prev_index != cycle_start);
}
});
} It is slower than this proposal (but faster than #12143): | / |
' / __| _` | __| _ \ __|
. \ | ( | | ( |\__ \
_|\_\_| \__,_|\__|\___/ ____/
Multi-Physics 9.4."3"-core/improve-9a1ba2bc3b-Release-x86_64
Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time: 65.73923110961914 seconds
Matrix multiplication execution time: 4.149796724319458 seconds
* Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. It addition it also crashes, so I think the problem is on the |
Ok I found the problem: I added some logic to move these values in and out the array, at some extra copying expense. We'll see how it affects performance (I'll run some tests and provide the results here). P.S.: I didn't see a measurable improvement |
With teh suggested change: | / |
' / __| _` | __| _ \ __|
. \ | ( | | ( |\__ \
_|\_\_| \__,_|\__|\___/ ____/
Multi-Physics 9.4."3"-core/avoid-copy-before-sort-matrix-37ffadf0e5-Release-x86_64
Compiled for GNU/Linux and Python3.10 with GCC-11.4
Compiled with threading and MPI support.
Maximum number of threads: 16.
Running without MPI.
Matrix reading execution time: 67.21308064460754 seconds
Matrix multiplication execution time: 4.146542310714722 seconds
* Las tareas reutilizarán el terminal, presione cualquier tecla para cerrarlo. Similar time to the @RiccardoRossi propossal. Faster than #12143, but much more code required. What option doi you prefer? (I prefer @RiccardoRossi option as it is cleaner and simpler) |
Myabe we should test more matrices to actually perform a benchmark |
I think the original implementation in #12143 (copy both arrays to a temporary array of pairs and sort that, then copy them back) is the most clean, and we should go with it for now. I ran my test case 5 times for both implementations and couldn't measure a difference in performance. My specs for running the benchmarks:
That's a good idea, if you're willing to generate some test operands that lead to a variety of row densities in the output matrix. |
In that case we can merge #12143 and then point this branch to master for merge, and expereiment changes and benchmark there. |
FYI @RiccardoRossi |
Copying message from @RiccardoRossi:
alternatively also boost has a zip_iterator which probably would allow us doing the same
|
@matekelemen I don't remember what is this and if we agreed something |
I think the initial version that copied to a vector of pairs, sorted, then copied back was merged instead of in-place sorting, so we can safely close this PR. |
Reopen #12151
We need to study the segmentation fault.
Sort values and column indices in the rows of a sparse matrix in-place. For each row of the input matrix, this saves
FYI @loumalouomega