update ssi tutorial

drorlab · Dec 2, 2023 · c5b7d5a · c5b7d5a
1 parent cc41926
commit c5b7d5a
Show file tree

Hide file tree

Showing 2 changed files with 164 additions and 71 deletions.
diff --git a/docs/tut-6-ssi.rst b/docs/tut-6-ssi.rst
@@ -13,113 +13,206 @@ the function ``ssi_feature_analysis()``.
 The Co-SSI statistic, which tells how much the communication between two features 
 depends on the ensemble condition, can be calculated using ``cossi_featens_analysis()``.
 
-Feature - Feature Communication
--------------------------------
+Features and States
+-------------------
+
+.. code:: python
+
+    import os
+    import numpy as np
+    from pensa.features import \
+        read_structure_features, \
+        read_water_features, \
+        get_multivar_res, \
+        sort_features
+    from pensa.statesinfo import \
+        get_discrete_states
+    from pensa.comparison import \
+        ssi_ensemble_analysis, \
+        ssi_feature_analysis, \
+        cossi_featens_analysis
+
 First, load the structural features as described in the previous tutorial:
 
 .. code:: python
 
-    sim_a_rec = get_structure_features("traj/condition-a_receptor.gro", 
-                                       "traj/condition-a_receptor.xtc")
-    sim_b_rec = get_structure_features("traj/condition-b_receptor.gro",
-                                       "traj/condition-b_receptor.xtc")
+    sim_a_rec = read_structure_features(
+        "traj/condition-a_receptor.gro",
+        "traj/condition-a_receptor.xtc"
+    )
+    sim_b_rec = read_structure_features(
+        "traj/condition-b_receptor.gro",
+        "traj/condition-b_receptor.xtc"
+    )
     sim_a_rec_feat, sim_a_rec_data = sim_a_rec
     sim_b_rec_feat, sim_b_rec_data = sim_b_rec
 
 
-Then run the SSI feature-feature analysis in the same manner the other statistics.
+Then combine all backbone torsions from one residue to one multivariate feature. 
+You can also analyze sidechain torsions, just replace ``bb`` by ``sc``. Analysis of 
+sidechain torsions takes much more time to run though so for this tutorial, we look
+only at backbone torsions.
+
+.. code:: python
+
+    bbtors_res_feat_a, bbtors_res_data_a = get_multivar_res(
+        sim_a_rec_feat['bb-torsions'], sim_a_rec_data['bb-torsions'] 
+    )
+    bbtors_res_feat_b, bbtors_res_data_b = get_multivar_res(
+        sim_b_rec_feat['bb-torsions'], sim_b_rec_data['bb-torsions']
+    )
 
+From those residue-based multivariate features, we now determine the state boundaries.
 
 .. code:: python
 
-    names_bbtors, ssi_featfeat_bbtors = ssi_feature_analysis(sim_a_rec_feat, sim_b_rec_feat,
-                                                             sim_a_rec_data, sim_b_rec_data,
-                                                             torsions='bb', verbose=True)
-                                             
-The output of ``ssi_feature_analysis()`` produces an array for the SSI between
-all features, with the names_bbtors referring to both feature names. This result 
-can be visualized in a two-dimensional representation similar to the distances 
-using ``distances_visualization()``
+    bbtors_states = get_discrete_states(
+        bbtors_res_data_a, bbtors_res_data_b
+    )
+
+Let's also featurize the water cavities in the system, using the grid we generated in
+the preprocessing tutorial.
 
 .. code:: python
 
-    distances_visualization(names_bbtors, ssi_featfeat_bbtors,
-                            "plots/receptor_ssi-bbdist.pdf",
-                            vmin = 0.0, vmax = max(ssi_featfeat_bbtors),
-                            cbar_label='SSI')
+    grid = "traj/water_grid_ab_OH2_density.dx"
+    water_feat_a, water_data_a = read_water_features(
+        "traj/condition-a_water.gro", "traj/condition-a_water_aligned.xtc",
+        top_waters = 5, atomgroup = "OH2", grid_input = grid
+    )
+    water_feat_b, water_data_b = read_water_features(
+        "traj/condition-b_water.gro", "traj/condition-b_water.xtc",
+        top_waters = 5, atomgroup = "OH2", grid_input = grid
+    )
 
-The Co-SSI feature-feature-ensemble analysis is done in the same manner. 
+Just like we did for torsions, we can now determine the discrete states for the orientation
+of the water molecules in the pockets.
 
 .. code:: python
 
-    names_bbtors, ssi_featfeat_bbtors, cossi_bbtors = cossi_featens_analysis(sim_a_rec_feat, sim_b_rec_feat,
-                                                                             sim_a_rec_data, sim_b_rec_data,
-                                                                             torsions='bb', verbose=True)
-                                             
-The output of ``cossi_featens_analysis()`` produces an array for the SSI and the 
-Co-SSI between all features, with the names_bbtors producing the same output as 
-the feature-feature SSI. These results can again be visualized in a 2D representation. 
-It is worth noting, as ``cossi_featens_analysis()`` also computes SSI, you need not 
-run ``ssi_feature_analysis()`` in addition.
+    water_states = get_discrete_states(
+        water_data_a['WaterPocket_Distr'],
+        water_data_b['WaterPocket_Distr'],
+        discretize='gaussian', pbc=True
+    )
+
+Water occupancy (is water in the pocket or not?) is described as a binary feature with the 
+values 0 or 1. We can thus define the state boundaries manually.
+
+.. code:: python
+
+    water_occup_states = [[[-0.1, 0.5, 1.1]]] * len(water_states)
 
-Water pocket information transfer
----------------------------------
 
-Internal water pockets and ion/atom pockets can also receive information about 
-the ensemble condition. The State Specific Information (SSI) analysis can be 
-applied to investigate these distributions in the same manner. First we need 
-to featurize the water pockets using the combined ensemble water density explained 
-in the featurization tutorial. 
+Water Pockets
+-------------
+
+We start by comparing the occupancy between the two conditions, similar to what we did in the
+comparison tutorial:
 
 .. code:: python
 
-    struc = "traj/cond-a_water.gro"
-    xtc = "dens/cond-a_wateraligned.xtc"
-    grid = "dens/ab_grid_OH2_density.xtc"
-    water_feat_a, water_data_a = get_water_features(structure_input = struc, 
-                                                xtc_input = xtc,
-                                                top_waters = 5,
-                                                atomgroup = "OH2",
-                                                grid_input = grid)
+    water_names, water_ssi = ssi_ensemble_analysis(
+        water_feat_a['WaterPocket_OccupDistr'], water_feat_b['WaterPocket_OccupDistr'], 
+        water_data_a['WaterPocket_OccupDistr'], water_data_b['WaterPocket_OccupDistr'],
+        water_occup_states, verbose=True, h2o=False, pbc=False
+    )
 
-    struc = "traj/cond-b_water.gro"
-    xtc = "traj/cond-b_water.xtc"
-    grid = "dens/ab_grid_OH2_density.xtc"
-    water_feat_b, water_data_b = get_water_features(structure_input = struc, 
-                                                xtc_input = xtc,
-                                                top_waters = 5,
-                                                atomgroup = "OH2",
-                                                grid_input = grid)
-                                                
-    
-Information shared between water pockets and the ensemble condition is then 
-quantified using ``ssi_ensemble_analysis()``. We set ``torsions=None`` for 
-waters. 
+Except for water site O2, there are not many differences in occupancy, so let's have a closer 
+look and compare the orientation of the water molecules between the two conditions: 
 
 .. code:: python
 
-    data_names, data_ssi = ssi_ensemble_analysis(water_feat_a['WaterPocket_Distr'],water_feat_b['WaterPocket_Distr'],
-                                                 water_data_a['WaterPocket_Distr'],water_data_b['WaterPocket_Distr'], 
-                                                 torsions = None,
-                                                 verbose=True)                                                
- 
+    water_names, water_ssi = ssi_ensemble_analysis(
+        water_feat_a['WaterPocket_Distr'], water_feat_b['WaterPocket_Distr'], 
+        water_data_a['WaterPocket_Distr'], water_data_b['WaterPocket_Distr'],
+        water_states, verbose=True, h2o=True
+    )
 
-Additionally, we can see if the pocket occupancy (i.e. the presence/absence of 
-water at the site) shares SSI - ``pocket_occupancy=True``. Currently this is only enabled with 
-``ssi_ensemble_analysis``. We need to turn off the periodic boundary conditions
-as the distributions are no longer periodic - ``pbc=False``.
+Beyond comparing distributions, we can also quantify the amount of feature-feature communication
+between the water sites. The corresponding function looks similar to the ensemble comarison above,
+however it calculates the amount of shared information between each pair of residues.
 
 .. code:: python
 
-    data_names, data_ssi = ssi_ensemble_analysis(water_feat_a['WaterPocket_OccupDistr'],water_feat_b['WaterPocket_OccupDistr'],
-                                                 water_data_a['WaterPocket_OccupDistr'],water_data_b['WaterPocket_OccupDistr'],
-                                                 pocket_occupancy=True, pbc=False, verbose=True)
+    water_pairs_names, water_pairs_ssi = ssi_feature_analysis(
+        water_feat_a['WaterPocket_Distr'], water_feat_b['WaterPocket_Distr'], 
+        water_data_a['WaterPocket_Distr'], water_data_b['WaterPocket_Distr'],
+        water_states, verbose=True, h2o=True
+    )
+
+This function produces an array for the SSI between all features. This array is two-dimensional,
+so we visualize its values using a heat map:
 
-The same protocol applies for atom/ion pockets.
+.. code:: python
 
+    from pensa.comparison import resnum_heatmap, pair_features_heatmap
 
+    pair_features_heatmap(
+        water_pairs_names, water_pairs_ssi,
+        "plots/water-pairs_ssi.pdf",
+        vmin = 0.0, vmax = 1.,
+        cbar_label='SSI',
+        separator=' & '
+    )
 
 
+Water and Torsions Combined
+---------------------------
 
+Now we want to investigate information flow between more than one type of features.
+To do so, we combine the water sites and the torsions:
+
+.. code:: python
 
+    all_feat_a = water_feat_a['WaterPocket_Distr'] + bbtors_res_feat_a
+    all_feat_b = water_feat_b['WaterPocket_Distr'] + bbtors_res_feat_b
+    all_data_a = np.array(list(water_data_a['WaterPocket_Distr']) + list(bbtors_res_data_a), dtype=object)
+    all_data_b = np.array(list(water_data_b['WaterPocket_Distr']) + list(bbtors_res_data_b), dtype=object)
+    all_states = water_states + bbtors_states
 
+Note that we only use backbone torsions here to minimize computational effort.
+Analysis of sidechain torsions (or of both combined) can often deliver more scientific insigts.
+
+As we did for the water sites above, we now calculate the SSI for all combined feature-feature pairs.
+
+.. code:: python
+
+    all_pairs_names, all_pairs_ssi = ssi_feature_analysis(
+        all_feat_a, all_feat_b, 
+        all_data_a, all_data_b,
+        all_states, verbose=True
+    )
+
+The number of pairs for an entire protein is enormous so we determine those with 
+the highest SSI. To alleviate the computational effort for the sort function, we first 
+filter the pairs by a threshold SSI of 0.5:
+
+.. code:: python
+
+    relevant = np.abs(all_pairs_ssi) > 0.5 
+    not_self = np.array([name.split(' & ')[0] != name.split(' & ')[1] for name in all_pairs_names])
+    relevant *= not_self
+    argrelev = np.argwhere(relevant).flatten()
+    all_relevant_pairs_names = [all_pairs_names[i] for i in argrelev]
+    all_relevant_pairs_ssi = all_pairs_ssi[relevant]
+
+Then we run the actual sorting by SSI.
+
+.. code:: python
+
+    sort_features(all_relevant_pairs_names, all_relevant_pairs_ssi)
+
+
+The Co-SSI feature-feature-ensemble analysis is done in the same manner. 
+
+.. code:: python
+
+    all_new_pairs_names, all_new_pairs_ssi, all_new_pairs_cossi = cossi_featens_analysis(
+        all_feat_a, all_feat_b, all_feat_a, all_feat_b,
+        all_data_a, all_data_b, all_data_a, all_data_b,
+        all_states, all_states, verbose=True
+    )
+                                             
+The output of ``cossi_featens_analysis()`` produces an array for the SSI and the Co-SSI
+between all features. These results can again be visualized in a 2D representation.
diff --git a/tutorial/PENSA_Tutorial_SSI.py b/tutorial/PENSA_Tutorial_SSI.py
@@ -142,7 +142,7 @@
 extract_coordinates(ref_file_b, pdb_file_b, trj_file_b, out_name_b, sel_base_b)
 
 # Extract the coordinates of the ensemble a aligned to ensemble b
-extract_aligned_coords(
+extract_aligned_coordinates(
     out_name_a + ".gro", out_name_a + ".xtc",
     out_name_b + ".gro", out_name_b + ".xtc"
 )