I further optimized the spatial processing part, the calculation of weights is now even faster, here is the Github Commit Link.
Current times are (in my laptop for ohio): 18 sec weights calculation; 95 sec (1.6 min) timeseries extraction; <.05 sec AMS and PDS identification; 40 sec clustering; 2.5 sec plotting. With total of 2.5 minutes. That time comes down to just 2 minutes when processing trinity, and 1.5 minutes being spent on the timeseries extraction. So, this is disc bound, and we could probably speed it up with removing the disc IO into the netcdf files, but it’d require a lot of rewrite.
I’ve looked at the performance profiles and I don’t think we can further optimize it in python. Or at least will take serious effort. The code is still slow if we want to process thousands of basins (e.g. there are 2,413 HUC 8 basins, the number becomes very large if we go even smaller).
INTERACTIVE MAP: https://atreyagaurav.github.io/locus/index.html
I’ll run it for at least HUC01-HUC18 and put it here. Possibly HUC4s too.
I changed this part of the code to normalize each rows. Here I removed the normal assumption of the data as the precipitation data is heavily skewed (not close to normal). And I just divided all the precipitation values of cells with the total basin weighted precipitation on that event (day).
#+RESULTS[63095cb5352fd288f7fbad30f64aa361a4521cdf]:
commit 4c141e264d3ce656772411fec74000456f3ceea4
Author: Gaurav Atreya <allmanpride@gmail.com>
Date: Sun Apr 30 22:30:31 2023 -0400
Normalize by total rain volume before clustering
diff --git a/images/05/ams_1dy.png b/images/05/ams_1dy.png
index e98d9f7..a3ebca6 100644
Binary files a/images/05/ams_1dy.png and b/images/05/ams_1dy.png differ
diff --git a/images/05/pds_1dy.png b/images/05/pds_1dy.png
index ba44651..ecf3ed2 100644
Binary files a/images/05/pds_1dy.png and b/images/05/pds_1dy.png differ
diff --git a/src/cluster.py b/src/cluster.py
index 8b5731c..f9267d4 100644
--- a/src/cluster.py
+++ b/src/cluster.py
@@ -7,7 +7,6 @@ from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from kneed import KneeLocator
-from src.livneh import LivnehData
from src.huc import HUC
import src.precip as precip
@@ -21,7 +20,8 @@ def storm_centers(df: pd.DataFrame):
def dimensionality_reduction(df: pd.DataFrame):
pca = PCA(n_components=20)
- return pca.fit_transform(StandardScaler().fit_transform(df.to_numpy()))
+ df_norm = df.apply(lambda row: row / row.sum(), axis=1)
+ return pca.fit_transform(df_norm.to_numpy())
def clustering(m: np.ndarray):
The plots of the clusters before that can be seen in Figure fig:clus-old and the one from new algorithm can be seen in Figure fig:clus-new.
The old one has distinct preference of higher magnitude of rain in certain clusters while lower in other, but new one doesn’t show that bias. So it’s not as overwhelmed by the magnitude of rain like the previous one.
Note: The new ones show a few more points which was my mistake originally as I didn’t include 2011 (I forgot python range is exclusive, i.e. I used range(1915,2011)
instead of range(1915,2012)
).
ohio-region | trinity | north-branch-potomac | |
huc | 5 | 1203 | 2070002 |
---|---|---|---|
ams_1day_avg_silhouette | 0.12818 | 0.22129 | 0.16736 |
ams_1day_cluster_counts | 32 21 21 16 8 | 45 23 21 9 | 38 26 22 12 |
ams_1day_conversed | True | True | True |
ams_1day_lta_silhouette | 1 | 2 | 3 |
ams_1day_neg_silhouette | 0 | 0 | 0 |
ams_1day_num_cluster | 5 | 4 | 4 |
pds_1day_avg_silhouette | 0.14791 | 0.15337 | 0.14963 |
pds_1day_cluster_counts | 120 117 99 96 | 192 140 102 68 | 167 100 91 74 65 |
pds_1day_conversed | True | True | True |
pds_1day_lta_silhouette | 2 | 3 | 3 |
pds_1day_neg_silhouette | 0 | 0 | 0 |
pds_1day_num_cluster | 4 | 4 | 5 |