From ad4f74dec8c4dda84004004e73fb295067e1b0ab Mon Sep 17 00:00:00 2001 From: Anirban Date: Thu, 21 Mar 2024 14:56:14 -0700 Subject: [PATCH 01/23] Wrote function-specific parallelization details and OpenMP use for between(), CJ(), and fcoalesce() --- man/openmp-utils.Rd | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index df942009c..a8d234661 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -38,8 +38,31 @@ \itemize{ \item\file{between.c} - \code{\link{between}()} + + Parallelism is used here to speed up range checks that are performed over the elements of the supplied vector. OpenMP is used here to parallelize: + \itemize{ + \item The loops that check if each element of the vector provided is between the specified \code{lower} and \code{upper} bounds, for integer (\code{INTSXP}) and real (\code{REALSXP}) types + \item Checking and handling of undefined values (such as NaNs) + } + \item\file{cj.c} - \code{\link{CJ}()} + + Parallelism is used here to speed up the creation of all combinations of the input vectors over the cross-product space. Better speedup can be expected when dealing with large vectors or a multitude of combinations. OpenMP is used here to parallelize: + + \itemize{ + \item Element assignment in vectors + \item Memory copying operations (blockwise replication of data using \code{memcpy}) + } + \item\file{coalesce.c} - \code{\link{fcoalesce}()} + + Parallelism is used here to reduce the time taken to replace NA values with the first non-NA value from other vectors. Operates over the columns provided. OpenMP is used here to parallelize: + \itemize{ + \item The operation that iterates over the rows to coalesce the data (which can be of type integer, real, or complex) + \item The replacement of NAs with non-NA values from subsequent vectors + \item The conditional checks within parallelized loops + } + \item\file{fifelse.c} - \code{\link{fifelse}()} \item\file{fread.c} - \code{\link{fread}()} \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related From f9e29d5bab45c984378f0af3f7b665e465240aa2 Mon Sep 17 00:00:00 2001 From: Anirban Date: Thu, 21 Mar 2024 16:59:29 -0700 Subject: [PATCH 02/23] Wrote about function-specific use of OpenMP in parallelization for fifelse() and nafill() (taking longer to read and understand fread due to its size, and will probably make more changes to these too) --- man/openmp-utils.Rd | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index a8d234661..09cf539f4 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -47,7 +47,7 @@ \item\file{cj.c} - \code{\link{CJ}()} - Parallelism is used here to speed up the creation of all combinations of the input vectors over the cross-product space. Better speedup can be expected when dealing with large vectors or a multitude of combinations. OpenMP is used here to parallelize: + Parallelism is used here to expedite the creation of all combinations of the input vectors over the cross-product space. Better speedup can be expected when dealing with large vectors or a multitude of combinations. OpenMP is used here to parallelize: \itemize{ \item Element assignment in vectors @@ -56,7 +56,7 @@ \item\file{coalesce.c} - \code{\link{fcoalesce}()} - Parallelism is used here to reduce the time taken to replace NA values with the first non-NA value from other vectors. Operates over the columns provided. OpenMP is used here to parallelize: + Parallelism is used here to reduce the time taken to fill NA values with the first non-NA value from other vectors. Operates over columns to replace NAs. OpenMP is used here to parallelize: \itemize{ \item The operation that iterates over the rows to coalesce the data (which can be of type integer, real, or complex) \item The replacement of NAs with non-NA values from subsequent vectors @@ -64,12 +64,18 @@ } \item\file{fifelse.c} - \code{\link{fifelse}()} + + For logical, integer, and real types, OpenMP is being used here to parallelize loops that perform conditional checks along with assignment operations over the elements of the supplied logical vector based on the condition (\code{test}) and values provided for the remaining arguments (\code{yes}, \code{no}, and \code{na}). + \item\file{fread.c} - \code{\link{fread}()} \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family \item\file{fwrite.c} - \code{\link{fwrite}()} \item\file{gsumm.c} - GForce in various places, see \link{GForce} \item\file{nafill.c} - \code{\link{nafill}()} + + Parallelism is used here for faster filling of missing values. OpenMP is being used here to parallelize the loop that achieves the same, over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. + \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting \item\file{types.c} - Internal testing usage } From fe2fad53b1e367c2a77127b733cd128926490ee0 Mon Sep 17 00:00:00 2001 From: Anirban Date: Thu, 21 Mar 2024 17:31:22 -0700 Subject: [PATCH 03/23] Wrote about use of OpenMP in parallelization for types.c (fun to recall schedule(dynamic) collapse(2)) --- man/openmp-utils.Rd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 09cf539f4..75a6f4fe4 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -74,10 +74,12 @@ \item\file{gsumm.c} - GForce in various places, see \link{GForce} \item\file{nafill.c} - \code{\link{nafill}()} - Parallelism is used here for faster filling of missing values. OpenMP is being used here to parallelize the loop that achieves the same, over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. + Parallelism is used here for faster filling of missing values. OpenMP is being used here to parallelize the loop that achieves the same, over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel.() \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting \item\file{types.c} - Internal testing usage + + Parallelism is being used here for enhancing the performance of internal tests (not impacting any user-facing operations or functions). OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. } } \examples{ From f3c53554dcbfe47ae06d8b0655ac54f5d18f1f2c Mon Sep 17 00:00:00 2001 From: Anirban Date: Fri, 22 Mar 2024 15:01:40 -0700 Subject: [PATCH 04/23] Wrote about function-specific use of OpenMP in parallelization for fwrite(), then a bit on gfroce (gsumm.c) and just the basics for subset.c --- man/openmp-utils.Rd | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 75a6f4fe4..d59e3a1c7 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -71,12 +71,21 @@ \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family \item\file{fwrite.c} - \code{\link{fwrite}()} + + OpenMP is primarily used here to parallelize the process of writing rows to the output file. Error handling and compression (if enabled) are also managed within this parallel region, with special attention to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. + \item\file{gsumm.c} - GForce in various places, see \link{GForce} + + Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median. The input data is split into batches (groups), and each thread processes a subset of the data based on them. + \item\file{nafill.c} - \code{\link{nafill}()} - Parallelism is used here for faster filling of missing values. OpenMP is being used here to parallelize the loop that achieves the same, over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel.() + Parallelism is used here for faster filling of missing values. OpenMP is being used here to parallelize the loop that achieves the same, over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting + + Parallelism is used her to expedite the filtering of data. OpenMP is utilized here to parallelize the process of subsetting vectors that have sufficient elements to warrant multi-threaded processing. + \item\file{types.c} - Internal testing usage Parallelism is being used here for enhancing the performance of internal tests (not impacting any user-facing operations or functions). OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. From 79d7b01fa65cc3ddd040a8c0da141841a48e8657 Mon Sep 17 00:00:00 2001 From: Anirban Date: Fri, 22 Mar 2024 15:57:53 -0700 Subject: [PATCH 05/23] Wrote about function-specific use of OpenMP in parallelization for fread(), added a minor bit to the gsumm.c part --- man/openmp-utils.Rd | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index d59e3a1c7..b339c4415 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -68,6 +68,15 @@ For logical, integer, and real types, OpenMP is being used here to parallelize loops that perform conditional checks along with assignment operations over the elements of the supplied logical vector based on the condition (\code{test}) and values provided for the remaining arguments (\code{yes}, \code{no}, and \code{na}). \item\file{fread.c} - \code{\link{fread}()} + + Parallelism is used here to read and process data in chunks (blocks of lines/rows). Expect significant speedup for large files, as I/O operations benefit greatly from parallel processing. OpenMP is used here to: + + \itemize{ + \item Avoid race conditions or concurrent writes to the output \code{data.table} by having atomic operations on the string data + \item Managing synchronized updates to the progress bar and serializing the output to the console + } + There are no explicit pragmas for parallelizing loops, and instead the use of OpenMP here is in controlling access to shared resources (with the use of critical sections, for instance) in a multi-threaded environment. + \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family \item\file{fwrite.c} - \code{\link{fwrite}()} @@ -76,7 +85,7 @@ \item\file{gsumm.c} - GForce in various places, see \link{GForce} - Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median. The input data is split into batches (groups), and each thread processes a subset of the data based on them. + Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median (implying faster computation of \code{sd}, \code{var}, and \code{prod} as well). The input data is split into batches (groups), and each thread processes a subset of the data based on them. \item\file{nafill.c} - \code{\link{nafill}()} From f53c9526877acd40e04e904b41a68926ad4acffb Mon Sep 17 00:00:00 2001 From: Anirban Date: Fri, 22 Mar 2024 17:09:54 -0700 Subject: [PATCH 06/23] Wrote about use of OpenMP in parallelization for fsort() (involving condensed details for forder, fsort and reorder C files after some digging and understanding of the code) --- man/openmp-utils.Rd | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index b339c4415..81ab2fab5 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -56,7 +56,7 @@ \item\file{coalesce.c} - \code{\link{fcoalesce}()} - Parallelism is used here to reduce the time taken to fill NA values with the first non-NA value from other vectors. Operates over columns to replace NAs. OpenMP is used here to parallelize: + Parallelism is used here to reduce the time taken to replace NA values with the first non-NA value from other vectors. OpenMP is used here to parallelize: \itemize{ \item The operation that iterates over the rows to coalesce the data (which can be of type integer, real, or complex) \item The replacement of NAs with non-NA values from subsequent vectors @@ -78,6 +78,16 @@ There are no explicit pragmas for parallelizing loops, and instead the use of OpenMP here is in controlling access to shared resources (with the use of critical sections, for instance) in a multi-threaded environment. \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related + + Parallelism is used here in multiple operations that come together to sort a \code{data.table} using the Radix algorithm. OpenMP is used here to parallelize: + + \itemize{ + \item The counting of unique values and recursively sorting subsets of data across different threads (specific to \file{forder.c}) + \item The process of finding the range and distribution of data for efficient grouping and sorting (applies for both \file{forder.c} and \file{fsort.c}) + \item Creation of histograms which are used to sort data based on significant bits (each thread processes a separate batch of the data, computes the MSB of each element, and then increments the corresponding bins), with the distribution and merging of buckets (specific to \file{fsort.c}) + \item The process of reordering a vector or each column in a list of vectors (such as in a \code{data.table}) based on a given vector that dictates the new ordering of elements (specific to \file{reorder.c}) + } + \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family \item\file{fwrite.c} - \code{\link{fwrite}()} From fed71699c8cd3e2bb99022345672f88c56b441c3 Mon Sep 17 00:00:00 2001 From: Anirban Date: Fri, 22 Mar 2024 20:48:22 -0700 Subject: [PATCH 07/23] Wrote about points to help with speedup gains, finished going through of all the 12 enlisted use cases, made some minor tweaks and corrections to statements --- man/openmp-utils.Rd | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 81ab2fab5..1a43c85ca 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -47,7 +47,7 @@ \item\file{cj.c} - \code{\link{CJ}()} - Parallelism is used here to expedite the creation of all combinations of the input vectors over the cross-product space. Better speedup can be expected when dealing with large vectors or a multitude of combinations. OpenMP is used here to parallelize: + Parallelism is used here to expedite the creation of all combinations of the input vectors over the cross-product space. OpenMP is used here to parallelize: \itemize{ \item Element assignment in vectors @@ -69,29 +69,32 @@ \item\file{fread.c} - \code{\link{fread}()} - Parallelism is used here to read and process data in chunks (blocks of lines/rows). Expect significant speedup for large files, as I/O operations benefit greatly from parallel processing. OpenMP is used here to: + Parallelism is used here to speed up the reading and processing of data in chunks (blocks of lines/rows). OpenMP is used here to: \itemize{ \item Avoid race conditions or concurrent writes to the output \code{data.table} by having atomic operations on the string data - \item Managing synchronized updates to the progress bar and serializing the output to the console + \item Manage synchronized updates to the progress bar and serialize the output to the console } - There are no explicit pragmas for parallelizing loops, and instead the use of OpenMP here is in controlling access to shared resources (with the use of critical sections, for instance) in a multi-threaded environment. + There are no explicit pragmas for parallelizing loops, and instead the use of OpenMP here is mainly in controlling access to shared resources (with the use of critical sections, for instance) in a multi-threaded environment. \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related - Parallelism is used here in multiple operations that come together to sort a \code{data.table} using the Radix algorithm. OpenMP is used here to parallelize: + Parallelism is used here to reduce the time taken in multiple operations that come together to sort a \code{data.table} using the Radix algorithm. OpenMP is used here to parallelize: \itemize{ \item The counting of unique values and recursively sorting subsets of data across different threads (specific to \file{forder.c}) - \item The process of finding the range and distribution of data for efficient grouping and sorting (applies for both \file{forder.c} and \file{fsort.c}) + \item The process of finding the range and distribution of data for efficient grouping and sorting (applies to both \file{forder.c} and \file{fsort.c}) \item Creation of histograms which are used to sort data based on significant bits (each thread processes a separate batch of the data, computes the MSB of each element, and then increments the corresponding bins), with the distribution and merging of buckets (specific to \file{fsort.c}) \item The process of reordering a vector or each column in a list of vectors (such as in a \code{data.table}) based on a given vector that dictates the new ordering of elements (specific to \file{reorder.c}) } \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family + + Parallelism is used here to speed up the computation of rolling statistics. OpenMP is used here to parallelize the loops that compute the rolling means (\code{frollmean}) and sums (\code{frollsum}) over a sliding window for each position in the input vector. + \item\file{fwrite.c} - \code{\link{fwrite}()} - OpenMP is primarily used here to parallelize the process of writing rows to the output file. Error handling and compression (if enabled) are also managed within this parallel region, with special attention to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. + Parallelism is used here to expedite the process of writing rows to the output file. OpenMP is used primarily to achieve the same here, but error handling and compression (if enabled) are also managed within the parallel region. Special attention is paid to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. \item\file{gsumm.c} - GForce in various places, see \link{GForce} @@ -103,12 +106,26 @@ \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting - Parallelism is used her to expedite the filtering of data. OpenMP is utilized here to parallelize the process of subsetting vectors that have sufficient elements to warrant multi-threaded processing. + Parallelism is used here to expedite the process of subsetting vectors. OpenMP is used here to parallelize the loops that achieve the same, with conditional checks and filtering of data. \item\file{types.c} - Internal testing usage Parallelism is being used here for enhancing the performance of internal tests (not impacting any user-facing operations or functions). OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. } + +In general, or as applicable to all the aforementioned use cases, better speedup can be expected when dealing with large datasets. + +Having such data when using \code{\link{fread}()} or \code{\link{fwrite}()} (ones with significant speedups for larger file sizes) also means that while one part of the data is being read from or written to disk (I/O operations), another part can be simultaneously processed using multiple cores (parallel computations). This overlap reduces the total time taken for the read or write operation (as the system can perform computations during otherwise idle I/O time). + +Apart from increasing the size of the input data, function-specific parameters when considered can benefit more from parallelization or lead to an increase in speedup. For instance, these can be: + + \itemize{ + \item Having a large number of groups when using \code{\link{forder}()} or a multitude of combinations when using \code{\link{CJ}()} + \item Having several missing values in your data when using \code{\link{fcoalesce}()} or \code{\link{nafill}()} + \item Using larger window sizes and/or time series data when using \code{\link{froll}()} + \item Having more and/or complex conditional logic when using \code{\link{fifelse}()} or \code{\link{subset}()} + } + } \examples{ getDTthreads(verbose=TRUE) From 9c1d14d5edaf3773f441e41e376e85859c2b0c17 Mon Sep 17 00:00:00 2001 From: Anirban Date: Mon, 25 Mar 2024 10:52:13 -0700 Subject: [PATCH 08/23] Added a note specifying the time (month and year) this got updated, since this is prone to being outdated --- man/openmp-utils.Rd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 1a43c85ca..fd4e093a9 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -126,8 +126,10 @@ Apart from increasing the size of the input data, function-specific parameters w \item Having more and/or complex conditional logic when using \code{\link{fifelse}()} or \code{\link{subset}()} } +Note: The information above is based on implementation-specific details as of March 2024. + } + \examples{ getDTthreads(verbose=TRUE) -} -\keyword{ data } +} \ No newline at end of file From ba96c0757c42f1ee2d25228dc97f45359bc73718 Mon Sep 17 00:00:00 2001 From: Anirban Date: Mon, 25 Mar 2024 11:12:15 -0700 Subject: [PATCH 09/23] Removed 'Parallelism is used here to ...' statements and refactored my text accordingly --- man/openmp-utils.Rd | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index fd4e093a9..f64ed65a2 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -39,24 +39,25 @@ \itemize{ \item\file{between.c} - \code{\link{between}()} - Parallelism is used here to speed up range checks that are performed over the elements of the supplied vector. OpenMP is used here to parallelize: + OpenMP is used here to parallelize: \itemize{ \item The loops that check if each element of the vector provided is between the specified \code{lower} and \code{upper} bounds, for integer (\code{INTSXP}) and real (\code{REALSXP}) types - \item Checking and handling of undefined values (such as NaNs) + \item The checking and handling of undefined values (such as NaNs) } \item\file{cj.c} - \code{\link{CJ}()} - Parallelism is used here to expedite the creation of all combinations of the input vectors over the cross-product space. OpenMP is used here to parallelize: + OpenMP is used here to parallelize: \itemize{ - \item Element assignment in vectors - \item Memory copying operations (blockwise replication of data using \code{memcpy}) + \item The element assignment in vectors + \item The memory copying operations (blockwise replication of data using \code{memcpy}) + \item The creation of all combinations of the input vectors over the cross-product space } \item\file{coalesce.c} - \code{\link{fcoalesce}()} - Parallelism is used here to reduce the time taken to replace NA values with the first non-NA value from other vectors. OpenMP is used here to parallelize: + OpenMP is used here to parallelize: \itemize{ \item The operation that iterates over the rows to coalesce the data (which can be of type integer, real, or complex) \item The replacement of NAs with non-NA values from subsequent vectors @@ -69,7 +70,7 @@ \item\file{fread.c} - \code{\link{fread}()} - Parallelism is used here to speed up the reading and processing of data in chunks (blocks of lines/rows). OpenMP is used here to: + OpenMP is used here to: \itemize{ \item Avoid race conditions or concurrent writes to the output \code{data.table} by having atomic operations on the string data @@ -79,7 +80,7 @@ \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related - Parallelism is used here to reduce the time taken in multiple operations that come together to sort a \code{data.table} using the Radix algorithm. OpenMP is used here to parallelize: + OpenMP is used here to parallelize multiple operations that come together to sort a \code{data.table} using the Radix algorithm. These include: \itemize{ \item The counting of unique values and recursively sorting subsets of data across different threads (specific to \file{forder.c}) @@ -90,11 +91,11 @@ \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family - Parallelism is used here to speed up the computation of rolling statistics. OpenMP is used here to parallelize the loops that compute the rolling means (\code{frollmean}) and sums (\code{frollsum}) over a sliding window for each position in the input vector. + OpenMP is used here to parallelize the loops that compute the rolling means (\code{frollmean}) and sums (\code{frollsum}) over a sliding window for each position in the input vector. \item\file{fwrite.c} - \code{\link{fwrite}()} - Parallelism is used here to expedite the process of writing rows to the output file. OpenMP is used primarily to achieve the same here, but error handling and compression (if enabled) are also managed within the parallel region. Special attention is paid to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. + OpenMP is used here primarily to parallelize the process of writing rows to the output file, but error handling and compression (if enabled) are also managed within the parallel region. Special attention is paid to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. \item\file{gsumm.c} - GForce in various places, see \link{GForce} @@ -102,15 +103,15 @@ \item\file{nafill.c} - \code{\link{nafill}()} - Parallelism is used here for faster filling of missing values. OpenMP is being used here to parallelize the loop that achieves the same, over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. + OpenMP is being used here to parallelize the loop that fills missing values over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting - Parallelism is used here to expedite the process of subsetting vectors. OpenMP is used here to parallelize the loops that achieve the same, with conditional checks and filtering of data. + OpenMP is used here to parallelize the loops that perform the subsetting of vectors, with conditional checks and filtering of data. \item\file{types.c} - Internal testing usage - Parallelism is being used here for enhancing the performance of internal tests (not impacting any user-facing operations or functions). OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. + This caters to internal tests (not impacting any user-facing operations or functions), and OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. } In general, or as applicable to all the aforementioned use cases, better speedup can be expected when dealing with large datasets. From 2005487ee429d46f30d93d2ee3df5d36fa12a76f Mon Sep 17 00:00:00 2001 From: Anirban Date: Thu, 28 Mar 2024 19:21:03 -0700 Subject: [PATCH 10/23] Wrote regarding better speedups for all of those functions in terms of having more rows vs having more columns after some testing (benchmarking) and research --- man/openmp-utils.Rd | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index f64ed65a2..a3410c43c 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -45,6 +45,8 @@ \item The checking and handling of undefined values (such as NaNs) } + Since this function is used to find rows where a column's value falls within a specific range, it benefits more from parallelization when the input data consists of a large number of rows. + \item\file{cj.c} - \code{\link{CJ}()} OpenMP is used here to parallelize: @@ -55,6 +57,8 @@ \item The creation of all combinations of the input vectors over the cross-product space } + Given that the number of combinations increases exponentially as more columns are added, better speedup can be expected when dealing with a large number of columns. + \item\file{coalesce.c} - \code{\link{fcoalesce}()} OpenMP is used here to parallelize: @@ -64,10 +68,14 @@ \item The conditional checks within parallelized loops } + Significant speedup can be expected for more number of columns here, given that this function operates efficiently across multiple columns to find non-NA values. + \item\file{fifelse.c} - \code{\link{fifelse}()} For logical, integer, and real types, OpenMP is being used here to parallelize loops that perform conditional checks along with assignment operations over the elements of the supplied logical vector based on the condition (\code{test}) and values provided for the remaining arguments (\code{yes}, \code{no}, and \code{na}). + Better speedup can be expected for more number of columns here as well, given that this function operates column-wise with independent vector operations. + \item\file{fread.c} - \code{\link{fread}()} OpenMP is used here to: @@ -78,6 +86,8 @@ } There are no explicit pragmas for parallelizing loops, and instead the use of OpenMP here is mainly in controlling access to shared resources (with the use of critical sections, for instance) in a multi-threaded environment. + This function is highly optimized in reading and processing data with both large numbers of rows and columns, but the efficiency is more pronounced across rows. + \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related OpenMP is used here to parallelize multiple operations that come together to sort a \code{data.table} using the Radix algorithm. These include: @@ -88,26 +98,38 @@ \item Creation of histograms which are used to sort data based on significant bits (each thread processes a separate batch of the data, computes the MSB of each element, and then increments the corresponding bins), with the distribution and merging of buckets (specific to \file{fsort.c}) \item The process of reordering a vector or each column in a list of vectors (such as in a \code{data.table}) based on a given vector that dictates the new ordering of elements (specific to \file{reorder.c}) } + + Better speedups can be expected when the input data contains a large number of rows as the sorting complexity increases with more rows. \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family OpenMP is used here to parallelize the loops that compute the rolling means (\code{frollmean}) and sums (\code{frollsum}) over a sliding window for each position in the input vector. + + These functions benefit more in terms of speedup when the data has a large number of columns, primarily due to the efficient memory access patterns (cache-friendly) used when processing the data for each column sequentially in memory to compute the rolling statistic. \item\file{fwrite.c} - \code{\link{fwrite}()} OpenMP is used here primarily to parallelize the process of writing rows to the output file, but error handling and compression (if enabled) are also managed within the parallel region. Special attention is paid to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. + Similar to \code{\link{fread}()}, this function is highly efficient in parallely processing data with large numbers of both rows and columns, but it has more notable speedups with an increased number of rows. + \item\file{gsumm.c} - GForce in various places, see \link{GForce} - Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median (implying faster computation of \code{sd}, \code{var}, and \code{prod} as well). The input data is split into batches (groups), and each thread processes a subset of the data based on them. + Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median (implying faster computation of \code{sd}, \code{var}, and \code{prod} as well). + + These optimized grouping operations benefit more in terms of speedup if the input data contains a large number of rows (since they are often used to aggregate data across groups). \item\file{nafill.c} - \code{\link{nafill}()} - OpenMP is being used here to parallelize the loop that fills missing values over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. + OpenMP is being used here to parallelize the loop that fills missing values over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. + + Given its optimization for column-wise operations, better speedups can be expected when the input data consists of a large number of columns. \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting - OpenMP is used here to parallelize the loops that perform the subsetting of vectors, with conditional checks and filtering of data. + OpenMP is used here to parallelize the loops that perform the subsetting of vectors, with conditional checks and filtering of data. + + Since subset operations tend to be usually row-dependent, better speedups can be expected when dealing with a large number of rows. However, it also depends on whether the computations are focused on rows or columns (as dictated by the subsetting criteria). \item\file{types.c} - Internal testing usage From 653fc7a2c0c843c87133e620586b2ea38371aa19 Mon Sep 17 00:00:00 2001 From: nitish jha Date: Thu, 30 May 2024 04:29:51 +0530 Subject: [PATCH 11/23] added freadR.c in openmp doc --- man/openmp-utils.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index df942009c..ca81622ed 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -41,7 +41,7 @@ \item\file{cj.c} - \code{\link{CJ}()} \item\file{coalesce.c} - \code{\link{fcoalesce}()} \item\file{fifelse.c} - \code{\link{fifelse}()} - \item\file{fread.c} - \code{\link{fread}()} + \item\file{fread.c}, \file{freadR.c} - \code{\link{fread}()} \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family \item\file{fwrite.c} - \code{\link{fwrite}()} From 7a6920e1ee21c1ad013e73a9a7e0076a8681b464 Mon Sep 17 00:00:00 2001 From: Anirban Date: Tue, 18 Jun 2024 14:02:14 -0700 Subject: [PATCH 12/23] Updated the fread part and rm whitespaces added from newlines --- man/openmp-utils.Rd | 66 ++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index a3410c43c..07007e41d 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -38,17 +38,17 @@ \itemize{ \item\file{between.c} - \code{\link{between}()} - + OpenMP is used here to parallelize: \itemize{ \item The loops that check if each element of the vector provided is between the specified \code{lower} and \code{upper} bounds, for integer (\code{INTSXP}) and real (\code{REALSXP}) types \item The checking and handling of undefined values (such as NaNs) } - + Since this function is used to find rows where a column's value falls within a specific range, it benefits more from parallelization when the input data consists of a large number of rows. - + \item\file{cj.c} - \code{\link{CJ}()} - + OpenMP is used here to parallelize: \itemize{ @@ -56,42 +56,42 @@ \item The memory copying operations (blockwise replication of data using \code{memcpy}) \item The creation of all combinations of the input vectors over the cross-product space } - + Given that the number of combinations increases exponentially as more columns are added, better speedup can be expected when dealing with a large number of columns. - + \item\file{coalesce.c} - \code{\link{fcoalesce}()} - + OpenMP is used here to parallelize: \itemize{ \item The operation that iterates over the rows to coalesce the data (which can be of type integer, real, or complex) \item The replacement of NAs with non-NA values from subsequent vectors \item The conditional checks within parallelized loops } - + Significant speedup can be expected for more number of columns here, given that this function operates efficiently across multiple columns to find non-NA values. - + \item\file{fifelse.c} - \code{\link{fifelse}()} - + For logical, integer, and real types, OpenMP is being used here to parallelize loops that perform conditional checks along with assignment operations over the elements of the supplied logical vector based on the condition (\code{test}) and values provided for the remaining arguments (\code{yes}, \code{no}, and \code{na}). - + Better speedup can be expected for more number of columns here as well, given that this function operates column-wise with independent vector operations. - + \item\file{fread.c} - \code{\link{fread}()} - + OpenMP is used here to: \itemize{ + \item Parallelize the reading of data in chunks \item Avoid race conditions or concurrent writes to the output \code{data.table} by having atomic operations on the string data \item Manage synchronized updates to the progress bar and serialize the output to the console } - There are no explicit pragmas for parallelizing loops, and instead the use of OpenMP here is mainly in controlling access to shared resources (with the use of critical sections, for instance) in a multi-threaded environment. - + This function is highly optimized in reading and processing data with both large numbers of rows and columns, but the efficiency is more pronounced across rows. - + \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related - + OpenMP is used here to parallelize multiple operations that come together to sort a \code{data.table} using the Radix algorithm. These include: - + \itemize{ \item The counting of unique values and recursively sorting subsets of data across different threads (specific to \file{forder.c}) \item The process of finding the range and distribution of data for efficient grouping and sorting (applies to both \file{forder.c} and \file{fsort.c}) @@ -100,11 +100,11 @@ } Better speedups can be expected when the input data contains a large number of rows as the sorting complexity increases with more rows. - + \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family - + OpenMP is used here to parallelize the loops that compute the rolling means (\code{frollmean}) and sums (\code{frollsum}) over a sliding window for each position in the input vector. - + These functions benefit more in terms of speedup when the data has a large number of columns, primarily due to the efficient memory access patterns (cache-friendly) used when processing the data for each column sequentially in memory to compute the rolling statistic. \item\file{fwrite.c} - \code{\link{fwrite}()} @@ -112,30 +112,30 @@ OpenMP is used here primarily to parallelize the process of writing rows to the output file, but error handling and compression (if enabled) are also managed within the parallel region. Special attention is paid to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. Similar to \code{\link{fread}()}, this function is highly efficient in parallely processing data with large numbers of both rows and columns, but it has more notable speedups with an increased number of rows. - + \item\file{gsumm.c} - GForce in various places, see \link{GForce} - + Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median (implying faster computation of \code{sd}, \code{var}, and \code{prod} as well). - + These optimized grouping operations benefit more in terms of speedup if the input data contains a large number of rows (since they are often used to aggregate data across groups). - + \item\file{nafill.c} - \code{\link{nafill}()} - + OpenMP is being used here to parallelize the loop that fills missing values over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. - + Given its optimization for column-wise operations, better speedups can be expected when the input data consists of a large number of columns. - + \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting - + OpenMP is used here to parallelize the loops that perform the subsetting of vectors, with conditional checks and filtering of data. - + Since subset operations tend to be usually row-dependent, better speedups can be expected when dealing with a large number of rows. However, it also depends on whether the computations are focused on rows or columns (as dictated by the subsetting criteria). - + \item\file{types.c} - Internal testing usage - + This caters to internal tests (not impacting any user-facing operations or functions), and OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. } - + In general, or as applicable to all the aforementioned use cases, better speedup can be expected when dealing with large datasets. Having such data when using \code{\link{fread}()} or \code{\link{fwrite}()} (ones with significant speedups for larger file sizes) also means that while one part of the data is being read from or written to disk (I/O operations), another part can be simultaneously processed using multiple cores (parallel computations). This overlap reduces the total time taken for the read or write operation (as the system can perform computations during otherwise idle I/O time). From 4e121d1d6de45c4d0b46a000df80623de9d8b361 Mon Sep 17 00:00:00 2001 From: Anirban Date: Tue, 18 Jun 2024 14:27:39 -0700 Subject: [PATCH 13/23] Updated the GForce part --- man/openmp-utils.Rd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 07007e41d..100b24e54 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -117,7 +117,7 @@ Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median (implying faster computation of \code{sd}, \code{var}, and \code{prod} as well). - These optimized grouping operations benefit more in terms of speedup if the input data contains a large number of rows (since they are often used to aggregate data across groups). + These optimized grouping operations benefit more in terms of speedup if the input data contains a large number of groups since they leverage parallelization more efficiently by eliminating the overhead of individual group evaluations. \item\file{nafill.c} - \code{\link{nafill}()} From 42718f4794a0b5a541ca2399f4fe35f764c87623 Mon Sep 17 00:00:00 2001 From: Anirban Date: Tue, 18 Jun 2024 14:43:02 -0700 Subject: [PATCH 14/23] Added a statement on throttling as per Ben's suggestion --- man/openmp-utils.Rd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 100b24e54..94773288c 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -149,6 +149,8 @@ Apart from increasing the size of the input data, function-specific parameters w \item Having more and/or complex conditional logic when using \code{\link{fifelse}()} or \code{\link{subset}()} } +\code{data.table} also incorporates throttling mechanisms to prevent the excessive creation of threads (can lead to significant overhead) for small data tasks. + Note: The information above is based on implementation-specific details as of March 2024. } From a56c7a9b5014dc4f31d99ffa7c210ffef66b7aaf Mon Sep 17 00:00:00 2001 From: Anirban Date: Wed, 19 Jun 2024 04:29:37 -0700 Subject: [PATCH 15/23] rm throttle statement as the argument to setDTthreads and getDTthreads already does a better job at explaining than what I wrote --- man/openmp-utils.Rd | 2 -- 1 file changed, 2 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 94773288c..100b24e54 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -149,8 +149,6 @@ Apart from increasing the size of the input data, function-specific parameters w \item Having more and/or complex conditional logic when using \code{\link{fifelse}()} or \code{\link{subset}()} } -\code{data.table} also incorporates throttling mechanisms to prevent the excessive creation of threads (can lead to significant overhead) for small data tasks. - Note: The information above is based on implementation-specific details as of March 2024. } From eda8d959bf24eb94f67c53f7ca86377a1b163a81 Mon Sep 17 00:00:00 2001 From: Michael Chirico Date: Mon, 2 Sep 2024 17:45:37 +0000 Subject: [PATCH 16/23] fwrite respects dec=',' for time columns when needed --- NEWS.md | 2 ++ inst/tests/other.Rraw | 5 +++++ inst/tests/tests.Rraw | 4 ++++ src/fwrite.c | 9 ++++++--- 4 files changed, 17 insertions(+), 3 deletions(-) diff --git a/NEWS.md b/NEWS.md index a21546903..48f384ba9 100644 --- a/NEWS.md +++ b/NEWS.md @@ -10,6 +10,8 @@ 1. Using `print.data.table()` with character truncation using `datatable.prettyprint.char` no longer errors with `NA` entries, [#6441](https://github.com/Rdatatable/data.table/issues/6441). Thanks to @r2evans for the bug report, and @joshhwuu for the fix. +2. `fwrite()` respects `dec=','` for timestamp columns (`POSIXct` or `nanotime`) with sub-second accuracy, [#6446](https://github.com/Rdatatable/data.table/issues/6446). Thanks @kav2k for pointing out the inconsistency and @MichaelChirico for the PR. + ## NOTES 1. Tests run again when some Suggests packages are missing, [#6411](https://github.com/Rdatatable/data.table/issues/6411). Thanks @aadler for the note and @MichaelChirico for the fix. diff --git a/inst/tests/other.Rraw b/inst/tests/other.Rraw index eb3e461f7..044d82cfa 100644 --- a/inst/tests/other.Rraw +++ b/inst/tests/other.Rraw @@ -761,3 +761,8 @@ if (loaded[["dplyr"]]) { DT = data.table(a = 1, b = 2, c = '1,2,3,4', d = 4) test(30, DT[, c := strsplit(c, ',', fixed = TRUE) %>% lapply(as.integer) %>% as.list]$c, list(1:4)) # nolint: pipe_call_linter. Mimicking MRE as filed. } + +if (loaded[["nanotime"]]) { + # respect dec=',' for nanotime, related to #6446, corresponding to tests 2281.* + test(31, fwrite(data.table(as.nanotime(.POSIXct(0))), dec=',', sep=';'), output="1970-01-01T00:00:00,000000000Z") +} diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index bc06b4517..e0566fc9b 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -19059,3 +19059,7 @@ test(2280.1, internal_error("broken"), error="Internal error.*broken") test(2280.2, internal_error("broken %d%s", 1, "2"), error="Internal error.*broken 12") foo = function(...) internal_error("broken") test(2280.3, foo(), error="Internal error in foo: broken") + +# fwrite respects dec=',' for sub-second timestamps, #6446 +test(2281.1, fwrite(data.table(a=.POSIXct(0.001)), dec=',', sep=';'), output="1970-01-01T00:00:00,001Z") +test(2281.2, fwrite(data.table(a=.POSIXct(0.0001)), dec=',', sep=';'), output="1970-01-01T00:00:00,000100Z") diff --git a/src/fwrite.c b/src/fwrite.c index 94f2f2323..0720d5b83 100644 --- a/src/fwrite.c +++ b/src/fwrite.c @@ -438,19 +438,21 @@ void writePOSIXct(const void *col, int64_t row, char **pch) ch -= squashDateTime; write_time(t, &ch); if (squashDateTime || (m && m%1000==0)) { + Rprintf("here i\n"); // when squashDateTime always write 3 digits of milliseconds even if 000, for consistent scale of squash integer64 // don't use writeInteger() because it doesn't 0 pad which we need here // integer64 is big enough for squash with milli but not micro; trunc (not round) micro when squash m /= 1000; - *ch++ = '.'; + *ch++ = dec; ch -= squashDateTime; *(ch+2) = '0'+m%10; m/=10; *(ch+1) = '0'+m%10; m/=10; *ch = '0'+m; ch += 3; } else if (m) { + Rprintf("here ii\n"); // microseconds are present and !squashDateTime - *ch++ = '.'; + *ch++ = dec; *(ch+5) = '0'+m%10; m/=10; *(ch+4) = '0'+m%10; m/=10; *(ch+3) = '0'+m%10; m/=10; @@ -472,6 +474,7 @@ void writeNanotime(const void *col, int64_t row, char **pch) if (x == INT64_MIN) { write_chars(na, &ch); } else { + Rprintf("here iii\n"); int d/*days*/, s/*secs*/, n/*nanos*/; n = x % 1000000000; x /= 1000000000; @@ -488,7 +491,7 @@ void writeNanotime(const void *col, int64_t row, char **pch) *ch++ = 'T'; ch -= squashDateTime; write_time(s, &ch); - *ch++ = '.'; + *ch++ = dec; ch -= squashDateTime; for (int i=8; i>=0; i--) { *(ch+i) = '0'+n%10; n/=10; } // always 9 digits for nanoseconds ch += 9; From 1c3fd8b78496419e0a54edc9cb62cdf09f6b843d Mon Sep 17 00:00:00 2001 From: Michael Chirico Date: Mon, 2 Sep 2024 17:46:51 +0000 Subject: [PATCH 17/23] rm debug markers --- src/fwrite.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/fwrite.c b/src/fwrite.c index 0720d5b83..bd2898388 100644 --- a/src/fwrite.c +++ b/src/fwrite.c @@ -438,7 +438,6 @@ void writePOSIXct(const void *col, int64_t row, char **pch) ch -= squashDateTime; write_time(t, &ch); if (squashDateTime || (m && m%1000==0)) { - Rprintf("here i\n"); // when squashDateTime always write 3 digits of milliseconds even if 000, for consistent scale of squash integer64 // don't use writeInteger() because it doesn't 0 pad which we need here // integer64 is big enough for squash with milli but not micro; trunc (not round) micro when squash @@ -450,7 +449,6 @@ void writePOSIXct(const void *col, int64_t row, char **pch) *ch = '0'+m; ch += 3; } else if (m) { - Rprintf("here ii\n"); // microseconds are present and !squashDateTime *ch++ = dec; *(ch+5) = '0'+m%10; m/=10; @@ -474,7 +472,6 @@ void writeNanotime(const void *col, int64_t row, char **pch) if (x == INT64_MIN) { write_chars(na, &ch); } else { - Rprintf("here iii\n"); int d/*days*/, s/*secs*/, n/*nanos*/; n = x % 1000000000; x /= 1000000000; From dc947380075c6c3483a957f170fd7f4aef03fa7e Mon Sep 17 00:00:00 2001 From: Ani Date: Mon, 2 Sep 2024 13:44:57 -0700 Subject: [PATCH 18/23] Get pinged for comments on this file and such, as and when changes are needed --- CODEOWNERS | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CODEOWNERS b/CODEOWNERS index d90dc24d6..8855db941 100644 --- a/CODEOWNERS +++ b/CODEOWNERS @@ -60,3 +60,6 @@ # performance testing /.ci/atime/tests.R @tdhock @Anirban166 /.github/workflows/performance-tests.yaml @Anirban166 + +# openmp docs +/man/openmp-utils.Rd @Anirban166 From 1b5b1ebde2af9c56735e24c5091302aa4e5b3dea Mon Sep 17 00:00:00 2001 From: Ani Date: Mon, 2 Sep 2024 13:46:27 -0700 Subject: [PATCH 19/23] Update CODEOWNERS --- CODEOWNERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CODEOWNERS b/CODEOWNERS index 8855db941..af741a6a4 100644 --- a/CODEOWNERS +++ b/CODEOWNERS @@ -61,5 +61,5 @@ /.ci/atime/tests.R @tdhock @Anirban166 /.github/workflows/performance-tests.yaml @Anirban166 -# openmp docs +# docs /man/openmp-utils.Rd @Anirban166 From fdeb4c38cb35eeb49580b67ba08ab7cc70529179 Mon Sep 17 00:00:00 2001 From: Michael Chirico Date: Tue, 3 Sep 2024 10:32:48 -0700 Subject: [PATCH 20/23] Distribute changes to the src/ files as comments --- man/openmp-utils.Rd | 109 ++------------------------------------------ src/between.c | 6 +++ src/cj.c | 6 +++ src/coalesce.c | 6 +++ src/fifelse.c | 6 +++ src/forder.c | 13 ++++++ src/fread.c | 8 ++++ src/froll.c | 9 ++++ src/fsort.c | 4 ++ src/fwrite.c | 8 ++++ src/gsumm.c | 5 ++ src/nafill.c | 5 ++ src/reorder.c | 5 ++ src/subset.c | 3 +- src/types.c | 7 +++ 15 files changed, 95 insertions(+), 105 deletions(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index 100b24e54..aa5315bf2 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -38,121 +38,22 @@ \itemize{ \item\file{between.c} - \code{\link{between}()} - - OpenMP is used here to parallelize: - \itemize{ - \item The loops that check if each element of the vector provided is between the specified \code{lower} and \code{upper} bounds, for integer (\code{INTSXP}) and real (\code{REALSXP}) types - \item The checking and handling of undefined values (such as NaNs) - } - - Since this function is used to find rows where a column's value falls within a specific range, it benefits more from parallelization when the input data consists of a large number of rows. - \item\file{cj.c} - \code{\link{CJ}()} - - OpenMP is used here to parallelize: - - \itemize{ - \item The element assignment in vectors - \item The memory copying operations (blockwise replication of data using \code{memcpy}) - \item The creation of all combinations of the input vectors over the cross-product space - } - - Given that the number of combinations increases exponentially as more columns are added, better speedup can be expected when dealing with a large number of columns. - \item\file{coalesce.c} - \code{\link{fcoalesce}()} - - OpenMP is used here to parallelize: - \itemize{ - \item The operation that iterates over the rows to coalesce the data (which can be of type integer, real, or complex) - \item The replacement of NAs with non-NA values from subsequent vectors - \item The conditional checks within parallelized loops - } - - Significant speedup can be expected for more number of columns here, given that this function operates efficiently across multiple columns to find non-NA values. - \item\file{fifelse.c} - \code{\link{fifelse}()} - - For logical, integer, and real types, OpenMP is being used here to parallelize loops that perform conditional checks along with assignment operations over the elements of the supplied logical vector based on the condition (\code{test}) and values provided for the remaining arguments (\code{yes}, \code{no}, and \code{na}). - - Better speedup can be expected for more number of columns here as well, given that this function operates column-wise with independent vector operations. - - \item\file{fread.c} - \code{\link{fread}()} - - OpenMP is used here to: - - \itemize{ - \item Parallelize the reading of data in chunks - \item Avoid race conditions or concurrent writes to the output \code{data.table} by having atomic operations on the string data - \item Manage synchronized updates to the progress bar and serialize the output to the console - } - - This function is highly optimized in reading and processing data with both large numbers of rows and columns, but the efficiency is more pronounced across rows. - + \item\file{fread.c} - \code{\link{fread}(). Parallelized across row-based chunks of the file.} \item\file{forder.c}, \file{fsort.c}, and \file{reorder.c} - \code{\link{forder}()} and related - - OpenMP is used here to parallelize multiple operations that come together to sort a \code{data.table} using the Radix algorithm. These include: - - \itemize{ - \item The counting of unique values and recursively sorting subsets of data across different threads (specific to \file{forder.c}) - \item The process of finding the range and distribution of data for efficient grouping and sorting (applies to both \file{forder.c} and \file{fsort.c}) - \item Creation of histograms which are used to sort data based on significant bits (each thread processes a separate batch of the data, computes the MSB of each element, and then increments the corresponding bins), with the distribution and merging of buckets (specific to \file{fsort.c}) - \item The process of reordering a vector or each column in a list of vectors (such as in a \code{data.table}) based on a given vector that dictates the new ordering of elements (specific to \file{reorder.c}) - } - - Better speedups can be expected when the input data contains a large number of rows as the sorting complexity increases with more rows. - \item\file{froll.c}, \file{frolladaptive.c}, and \file{frollR.c} - \code{\link{froll}()} and family - - OpenMP is used here to parallelize the loops that compute the rolling means (\code{frollmean}) and sums (\code{frollsum}) over a sliding window for each position in the input vector. - - These functions benefit more in terms of speedup when the data has a large number of columns, primarily due to the efficient memory access patterns (cache-friendly) used when processing the data for each column sequentially in memory to compute the rolling statistic. - - \item\file{fwrite.c} - \code{\link{fwrite}()} - - OpenMP is used here primarily to parallelize the process of writing rows to the output file, but error handling and compression (if enabled) are also managed within the parallel region. Special attention is paid to thread safety and synchronization, especially in the ordered sections where output to the file and handling of errors is serialized to maintain the correct sequence of rows. - - Similar to \code{\link{fread}()}, this function is highly efficient in parallely processing data with large numbers of both rows and columns, but it has more notable speedups with an increased number of rows. - - \item\file{gsumm.c} - GForce in various places, see \link{GForce} - - Functions with GForce optimization are internally parallelized to speed up grouped summaries over a large \code{data.table}. OpenMP is used here to parallelize operations involved in calculating group-wise statistics like sum, mean, and median (implying faster computation of \code{sd}, \code{var}, and \code{prod} as well). - - These optimized grouping operations benefit more in terms of speedup if the input data contains a large number of groups since they leverage parallelization more efficiently by eliminating the overhead of individual group evaluations. - + \item\file{fwrite.c} - \code{\link{fwrite}(). Parallelized across rows.} + \item\file{gsumm.c} - GForce in various places, see \link{GForce}. Parallelized across groups. \item\file{nafill.c} - \code{\link{nafill}()} - - OpenMP is being used here to parallelize the loop that fills missing values over columns of the input data. This includes handling different data types (double, integer, and integer64) and applying the designated filling method (constant, last observation carried forward, or next observation carried backward) to each column in parallel. - - Given its optimization for column-wise operations, better speedups can be expected when the input data consists of a large number of columns. - \item\file{subset.c} - Used in \code{\link[=data.table]{[.data.table}} subsetting - - OpenMP is used here to parallelize the loops that perform the subsetting of vectors, with conditional checks and filtering of data. - - Since subset operations tend to be usually row-dependent, better speedups can be expected when dealing with a large number of rows. However, it also depends on whether the computations are focused on rows or columns (as dictated by the subsetting criteria). - \item\file{types.c} - Internal testing usage - - This caters to internal tests (not impacting any user-facing operations or functions), and OpenMP is being used here to test a message printing function inside a nested loop which has been collapsed into a single loop of the combined iteration space using \code{collapse(2)}, along with specification of dynamic scheduling for distributing the iterations in a way that can balance the workload among the threads. } -In general, or as applicable to all the aforementioned use cases, better speedup can be expected when dealing with large datasets. - -Having such data when using \code{\link{fread}()} or \code{\link{fwrite}()} (ones with significant speedups for larger file sizes) also means that while one part of the data is being read from or written to disk (I/O operations), another part can be simultaneously processed using multiple cores (parallel computations). This overlap reduces the total time taken for the read or write operation (as the system can perform computations during otherwise idle I/O time). - -Apart from increasing the size of the input data, function-specific parameters when considered can benefit more from parallelization or lead to an increase in speedup. For instance, these can be: - - \itemize{ - \item Having a large number of groups when using \code{\link{forder}()} or a multitude of combinations when using \code{\link{CJ}()} - \item Having several missing values in your data when using \code{\link{fcoalesce}()} or \code{\link{nafill}()} - \item Using larger window sizes and/or time series data when using \code{\link{froll}()} - \item Having more and/or complex conditional logic when using \code{\link{fifelse}()} or \code{\link{subset}()} - } - -Note: The information above is based on implementation-specific details as of March 2024. - + We endeavor to keep this list up to date, but note that the canonical reference here is the source code itself. } \examples{ getDTthreads(verbose=TRUE) -} \ No newline at end of file +} diff --git a/src/between.c b/src/between.c index 3d03ae5da..25d85b57c 100644 --- a/src/between.c +++ b/src/between.c @@ -1,5 +1,11 @@ #include "data.table.h" +/* + OpenMP is used here to parallelize: + - The loops that check if each element of the vector provided is between + the specified lower and upper bounds, for INTSXP and REALSXP types + - The checking and handling of undefined values (such as NaNs) +*/ SEXP between(SEXP x, SEXP lower, SEXP upper, SEXP incbounds, SEXP NAboundsArg, SEXP checkArg) { int nprotect = 0; R_len_t nx = length(x), nl = length(lower), nu = length(upper); diff --git a/src/cj.c b/src/cj.c index 0f9c940e5..e9b7dd541 100644 --- a/src/cj.c +++ b/src/cj.c @@ -1,5 +1,11 @@ #include "data.table.h" +/* + OpenMP is used here to parallelize: + - The element assignment in vectors + - The memory copying operations (blockwise replication of data using memcpy) + - The creation of all combinations of the input vectors over the cross-product space +*/ SEXP cj(SEXP base_list) { int ncol = LENGTH(base_list); SEXP out = PROTECT(allocVector(VECSXP, ncol)); diff --git a/src/coalesce.c b/src/coalesce.c index 996db837d..c6e18da9e 100644 --- a/src/coalesce.c +++ b/src/coalesce.c @@ -1,5 +1,11 @@ #include "data.table.h" +/* + OpenMP is used here to parallelize: + - The operation that iterates over the rows to coalesce the data + - The replacement of NAs with non-NA values from subsequent vectors + - The conditional checks within parallelized loops +*/ SEXP coalesce(SEXP x, SEXP inplaceArg) { if (TYPEOF(x)!=VECSXP) internal_error(__func__, "input is list(...) at R level"); // # nocov if (!IS_TRUE_OR_FALSE(inplaceArg)) internal_error(__func__, "argument 'inplaceArg' must be TRUE or FALSE"); // # nocov diff --git a/src/fifelse.c b/src/fifelse.c index bd28e88ad..5159aca4a 100644 --- a/src/fifelse.c +++ b/src/fifelse.c @@ -1,5 +1,11 @@ #include "data.table.h" +/* + OpenMP is being used here to parallelize loops that perform conditional + checks along with assignment operations over the elements of the + supplied logical vector based on the condition (test) and values + provided for the remaining arguments (yes, no, and na). +*/ SEXP fifelseR(SEXP l, SEXP a, SEXP b, SEXP na) { if (!isLogical(l)) { error(_("Argument 'test' must be logical.")); diff --git a/src/forder.c b/src/forder.c index 754c565a2..4939ade28 100644 --- a/src/forder.c +++ b/src/forder.c @@ -433,6 +433,19 @@ uint64_t dtwiddle(double x) //const void *p, int i) void radix_r(const int from, const int to, const int radix); +/* + OpenMP is used here to parallelize multiple operations that come together to + sort a data.table using the Radix algorithm. These include: + + - The counting of unique values and recursively sorting subsets of data + across different threads + - The process of finding the range and distribution of data for efficient + grouping and sorting + - Creation of histograms which are used to sort data based on significant + bits (each thread processes a separate batch of the data, computes the + MSB of each element, and then increments the corresponding bins), with + the distribution and merging of buckets +*/ SEXP forder(SEXP DT, SEXP by, SEXP retGrpArg, SEXP retStatsArg, SEXP sortGroupsArg, SEXP ascArg, SEXP naArg) // sortGroups TRUE from setkey and regular forder, FALSE from by= for efficiency so strings don't have to be sorted and can be left in appearance order // when sortGroups is TRUE, ascArg contains +1/-1 for ascending/descending of each by column; when FALSE ascArg is ignored diff --git a/src/fread.c b/src/fread.c index 6d943621a..932da5a76 100644 --- a/src/fread.c +++ b/src/fread.c @@ -1268,6 +1268,14 @@ static int detect_types( const char **pch, int8_t type[], int ncol, bool *bumped // // Returns 1 if it finishes successfully, and 0 otherwise. // +// OpenMP is used here to: +// - Parallelize the reading of data in chunks +// - Avoid race conditions or concurrent writes to the output data.table by having atomic +// operations on the string data +// - Manage synchronized updates to the progress bar and serialize the output to the console +// This function is highly optimized in reading and processing data with both large numbers of +// rows and columns, but the efficiency is more pronounced across rows. +// //================================================================================================= int freadMain(freadMainArgs _args) { args = _args; // assign to global for use by DTPRINT() in other functions diff --git a/src/froll.c b/src/froll.c index 3ab7bd927..c7e5cfb31 100644 --- a/src/froll.c +++ b/src/froll.c @@ -1,5 +1,14 @@ #include "data.table.h" +/* + OpenMP is used here to parallelize the loops in frollmean and frollsum. + + These functions benefit more in terms of speedup when the data has a large + number of columns, primarily due to the efficient memory access patterns + (cache-friendly) used when processing the data for each column + sequentially in memory to compute the rolling statistic. +*/ + /* fast rolling mean - router * early stopping for window bigger than input * also handles 'align' in single place diff --git a/src/fsort.c b/src/fsort.c index 12b5861bb..e03c4639b 100644 --- a/src/fsort.c +++ b/src/fsort.c @@ -98,6 +98,10 @@ int qsort_cmp(const void *a, const void *b) { return (xy); // largest first in a safe branchless way casting long to int } +/* + OpenMP is used here to find the range and distribution of data for efficient + grouping and sorting. +*/ SEXP fsort(SEXP x, SEXP verboseArg) { double t[10]; t[0] = wallclock(); diff --git a/src/fwrite.c b/src/fwrite.c index 94f2f2323..898a5a150 100644 --- a/src/fwrite.c +++ b/src/fwrite.c @@ -587,6 +587,14 @@ int compressbuff(z_stream *stream, void* dest, size_t *destLen, const void* sour } #endif +/* + OpenMP is used here primarily to parallelize the process of writing rows + to the output file, but error handling and compression (if enabled) are + also managed within the parallel region. Special attention is paid to + thread safety and synchronization, especially in the ordered sections + where output to the file and handling of errors is serialized to maintain + the correct sequence of rows. +*/ void fwriteMain(fwriteMainArgs args) { double startTime = wallclock(); diff --git a/src/gsumm.c b/src/gsumm.c index f88b546eb..7b6c12b59 100644 --- a/src/gsumm.c +++ b/src/gsumm.c @@ -37,6 +37,11 @@ static int nbit(int n) return nb; } +/* + Functions with GForce optimization are internally parallelized to speed up + grouped summaries over a large data.table. OpenMP is used here to + parallelize operations involved in calculating common group-wise statistics. +*/ SEXP gforce(SEXP env, SEXP jsub, SEXP o, SEXP f, SEXP l, SEXP irowsArg) { double started = wallclock(); const bool verbose = GetVerbose(); diff --git a/src/nafill.c b/src/nafill.c index 0017850bd..0be5fec76 100644 --- a/src/nafill.c +++ b/src/nafill.c @@ -87,6 +87,11 @@ void nafillInteger64(int64_t *x, uint_fast64_t nx, unsigned int type, int64_t fi snprintf(ans->message[0], 500, _("%s: took %.3fs\n"), __func__, omp_get_wtime()-tic); } +/* + OpenMP is being used here to parallelize the loop that fills missing values + over columns of the input data. This includes handling different data types + and applying the designated filling method to each column in parallel. +*/ SEXP nafillR(SEXP obj, SEXP type, SEXP fill, SEXP nan_is_na_arg, SEXP inplace, SEXP cols) { int protecti=0; const bool verbose = GetVerbose(); diff --git a/src/reorder.c b/src/reorder.c index 04201583d..1f9390628 100644 --- a/src/reorder.c +++ b/src/reorder.c @@ -1,5 +1,10 @@ #include "data.table.h" +/* + OpenMP is used here to reorder a vector or each column in a list of vectors + (such as in a data.table) based on a given vector that dictates the new + ordering of elements +*/ SEXP reorder(SEXP x, SEXP order) { // For internal use only by setkey(). diff --git a/src/subset.c b/src/subset.c index 101a08f40..2d962e135 100644 --- a/src/subset.c +++ b/src/subset.c @@ -272,8 +272,9 @@ static void checkCol(SEXP col, int colNum, int nrow, SEXP x) * 2) Originally for subsetting vectors in fcast and now the beginnings of [.data.table ported to C * 3) Immediate need is for R 3.1 as lglVec[1] now returns R's global TRUE and we don't want := to change that global [think 1 row data.tables] * 4) Could do it other ways but may as well go to C now as we were going to do that anyway +* +* OpenMP is used here to parallelize the loops that perform the subsetting of vectors, with conditional checks and filtering of data. */ - SEXP subsetDT(SEXP x, SEXP rows, SEXP cols) { // API change needs update NEWS.md and man/cdt.Rd int nprotect=0; if (!isNewList(x)) internal_error(__func__, "Argument '%s' to %s is type '%s' not '%s'", "x", "CsubsetDT", type2char(TYPEOF(rows)), "list"); // # nocov diff --git a/src/types.c b/src/types.c index e0fa20855..c43ee8078 100644 --- a/src/types.c +++ b/src/types.c @@ -50,6 +50,13 @@ void testRaiseMsg(ans_t *ans, int istatus, bool verbose) { } ans->int_v[0] = ans->status; } +/* + This caters to internal tests (not user-facing), and OpenMP is being used + here to test a message printing function inside a nested loop which has + been collapsed into a single loop of the combined iteration space using + collapse(2), along with specification of dynamic scheduling for distributing + the iterations in a way that can balance the workload among the threads. +*/ SEXP testMsgR(SEXP status, SEXP x, SEXP k) { if (!isInteger(status) || !isInteger(x) || !isInteger(k)) internal_error(__func__, "status, nx, nk must be integer"); // # nocov From 48d7fe7a63f22c4b0090d380df5654218d2b26d8 Mon Sep 17 00:00:00 2001 From: Ani Date: Tue, 3 Sep 2024 11:59:17 -0700 Subject: [PATCH 21/23] rm extra line From 473e312c53f6afd2760a678995b0839ef3b2e66d Mon Sep 17 00:00:00 2001 From: Ani Date: Tue, 3 Sep 2024 12:00:38 -0700 Subject: [PATCH 22/23] Update openmp-utils.Rd --- man/openmp-utils.Rd | 1 - 1 file changed, 1 deletion(-) diff --git a/man/openmp-utils.Rd b/man/openmp-utils.Rd index aa5315bf2..61661c3a8 100644 --- a/man/openmp-utils.Rd +++ b/man/openmp-utils.Rd @@ -53,7 +53,6 @@ We endeavor to keep this list up to date, but note that the canonical reference here is the source code itself. } - \examples{ getDTthreads(verbose=TRUE) } From f2eb965002321a615dffaa84c56a58570e5690d4 Mon Sep 17 00:00:00 2001 From: Michael Chirico Date: Tue, 3 Sep 2024 13:20:23 -0700 Subject: [PATCH 23/23] Mention wiki in comments --- .ci/atime/tests.R | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.ci/atime/tests.R b/.ci/atime/tests.R index f962b6672..b18c7228c 100644 --- a/.ci/atime/tests.R +++ b/.ci/atime/tests.R @@ -1,5 +1,7 @@ # A list of performance tests. # +# See documentation in https://github.com/Rdatatable/data.table/wiki/Performance-testing for best practices. +# # Each entry in this list corresponds to a performance test and contains a sublist with three mandatory arguments: # - N: A numeric sequence of data sizes to vary. # - setup: An expression evaluated for every data size before measuring time/memory.