Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallelization of conf filter #268

Merged
merged 3 commits into from
Oct 23, 2024

Conversation

zjgemi
Copy link
Collaborator

@zjgemi zjgemi commented Oct 21, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced batch processing capabilities for configuration checks, improving efficiency when handling multiple frames.
    • Added new filter classes (BarFilter, BazFilter) with specific checks for frame coordinates.
  • Bug Fixes

    • Enhanced clarity and efficiency in the configuration filtering process, streamlining logic and reducing complexity.
  • Tests

    • Updated test cases to reflect new filter logic and ensure accurate validation of frame counts and coordinate values.

zjgemi and others added 2 commits October 21, 2024 15:55
Copy link

coderabbitai bot commented Oct 21, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes involve modifications to several classes across multiple files, primarily focusing on enhancing the configuration filtering process. The TrajRenderLammps class's get_confs method has been simplified for efficiency. The ConfFilter and ConfFilters classes have been updated to support batch processing of configurations, introducing a new batched_check method. Additionally, the distance filters have been enhanced to allow parallel processing through a max_workers parameter. New filter classes have been added in the tests, improving the testing framework to align with the updated filtering logic.

Changes

File Path Change Summary
dpgen2/exploration/render/traj_render_lammps.py Simplified get_confs method in TrajRenderLammps class; corrected indentation for ss = ss.sub_system(id_selected[ii]).
dpgen2/exploration/selector/conf_filter.py Added batched_check method to ConfFilter class; modified check method in ConfFilters class to accept dpdata.MultiSystems instead of a single dpdata.System.
dpgen2/exploration/selector/distance_conf_filter.py Added max_workers parameter to constructors of DistanceConfFilter, BoxSkewnessConfFilter, and BoxLengthFilter classes; added batched_check methods to each class for concurrent processing; updated args methods for documentation.
tests/exploration/test_conf_filter.py Modified FooFilter class's check method; added BarFilter and BazFilter classes with new check methods; removed faked_filter class; updated test cases to reflect new filter logic and ensure correct frame preservation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TrajRenderLammps
    participant ConfFilters
    participant ConfFilter
    participant DistanceConfFilter

    User->>TrajRenderLammps: request get_confs
    TrajRenderLammps->>ConfFilters: call check(ms)
    ConfFilters->>ConfFilter: call batched_check(frames)
    ConfFilter-->>ConfFilters: return validity list
    ConfFilters-->>TrajRenderLammps: return valid configurations
    TrajRenderLammps-->>User: return configurations
Loading
sequenceDiagram
    participant User
    participant DistanceConfFilter

    User->>DistanceConfFilter: request batched_check(frames)
    DistanceConfFilter->>ProcessPoolExecutor: execute checks
    ProcessPoolExecutor-->>DistanceConfFilter: return results
    DistanceConfFilter-->>User: return valid frames
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (6)
dpgen2/exploration/render/traj_render_lammps.py (1)

133-133: Approve the simplification with suggestions for improvement

The change simplifies the filtering process and aligns with the PR objective of supporting parallelization. It likely improves performance by leveraging batch processing in ConfFilters.

However, consider the following suggestions:

  1. Add error handling to ensure conf_filters.check(ms) returns the expected MultiSystems object.
  2. Update the method's docstring to reflect the new behavior and requirements of conf_filters.

Consider applying this diff to improve error handling and documentation:

     def get_confs(
         self,
         trajs: Union[List[Path], List[HDF5Dataset]],
         id_selected: List[List[int]],
         type_map: Optional[List[str]] = None,
         conf_filters: Optional["ConfFilters"] = None,
         optional_outputs: Optional[List[Path]] = None,
     ) -> dpdata.MultiSystems:
+        """
+        Get configurations from trajectories and apply filters.
+        
+        :param conf_filters: Must implement a `check` method that accepts and returns a MultiSystems object.
+        """
         # ... (existing code)
         if conf_filters is not None:
-            ms = conf_filters.check(ms)
+            filtered_ms = conf_filters.check(ms)
+            if not isinstance(filtered_ms, dpdata.MultiSystems):
+                raise TypeError("conf_filters.check() must return a MultiSystems object")
+            ms = filtered_ms
         return ms
dpgen2/exploration/selector/conf_filter.py (1)

55-55: Use list comprehension for clarity

Consider using a list comprehension instead of list(map(...)) for better readability:

-    return list(map(self.check, frames))
+    return [self.check(frame) for frame in frames]
tests/exploration/test_conf_filter.py (1)

53-56: Enhance code readability by commenting on coordinate modifications.

The direct manipulation of faked_sys["coords"] modifies specific frames and coordinates, but the intent is not immediately clear. Adding comments to explain the purpose of these assignments will improve code clarity for future maintainers.

dpgen2/exploration/selector/distance_conf_filter.py (3)

214-217: Improve clarity of doc_max_workers documentation

To enhance understanding, consider rephrasing the documentation string for max_workers.

Apply this diff to update the documentation:

doc_max_workers = (
-    "The maximum number of processes used to filter configurations, None represents as many as the processors of the machine, and 1 for serial"
+    "The maximum number of processes used to filter configurations. None uses all available processors, and 1 forces serial execution."
)

300-303: Enhance documentation for max_workers in BoxSkewnessConfFilter.args

Consider updating the documentation string for better clarity.

Apply this diff:

doc_max_workers = (
-    "The maximum number of processes used to filter configurations, None represents as many as the processors of the machine, and 1 for serial"
+    "The maximum number of processes used to filter configurations. None uses all available processors, and 1 forces serial execution."
)

378-381: Refine the doc_max_workers text in BoxLengthFilter.args

Updating the documentation string can provide clearer guidance to users.

Apply this diff:

doc_max_workers = (
-    "The maximum number of processes used to filter configurations, None represents as many as the processors of the machine, and 1 for serial"
+    "The maximum number of processes used to filter configurations. None uses all available processors, and 1 forces serial execution."
)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 08d8d6e and 3b453f4.

📒 Files selected for processing (4)
  • dpgen2/exploration/render/traj_render_lammps.py (1 hunks)
  • dpgen2/exploration/selector/conf_filter.py (3 hunks)
  • dpgen2/exploration/selector/distance_conf_filter.py (10 hunks)
  • tests/exploration/test_conf_filter.py (1 hunks)
🧰 Additional context used
🔇 Additional comments (8)
tests/exploration/test_conf_filter.py (2)

100-104: Confirm that filtering with no matching frames behaves correctly.

The test checks that no frames are selected when none meet the filter criteria. Ensure that sel_ms.get_nframes() correctly returns 0, and that the system handles empty results without errors.


58-61: Verify the correctness of filter chaining and result extraction.

Ensure that chaining the .add() method works as intended with ConfFilters, and that filters.check(ms)[0] accurately retrieves the filtered system. Verify that the filtering logic applies all filters in sequence and produces the expected results.

You can run the following script to confirm the filter chaining behavior:

✅ Verification successful

Filter chaining and result extraction verified to function correctly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that ConfFilters correctly chains filters and returns expected results.

# Test: Search for the implementation details of ConfFilters.
rg 'class ConfFilters' -A 15

# Test: Check usages of the .add() and .check() methods to confirm chaining.
rg 'filters.add' -A 2

Length of output: 2176


Script:

#!/bin/bash
# Description: Extract the full implementation of the `check` method in the ConfFilters class.

ast-grep --pattern $'class ConfFilters {
  $$$
  check($_, $_) {
    $$$
  }
  $$$
}'

Length of output: 86


Script:

#!/bin/bash
# Description: Use ripgrep to locate the `check` method within the `ConfFilters` class and retrieve its implementation.

# Search for the `ConfFilters` class and extract the class definition context
rg 'class ConfFilters' -A 50

# Specifically search for the `def check` method within the `ConfFilters` class and extract its implementation
rg -C 5 'def check' dpgen2/exploration/selector/conf_filter.py

Length of output: 2930

dpgen2/exploration/selector/distance_conf_filter.py (6)

2-4: Approved: Importing ProcessPoolExecutor for parallel processing

The addition of ProcessPoolExecutor is appropriate for enabling parallel execution in the batched_check methods.


194-203: Verify that dpdata.System instances are serializable for multiprocessing

Since ProcessPoolExecutor leverages multiprocessing, confirm that dpdata.System objects in frames are picklable. If they contain unpicklable attributes, this could result in runtime errors when executing executor.map.


246-247: Approved: Added max_workers parameter to BoxSkewnessConfFilter

Including max_workers enhances the filter by enabling parallel processing in the batched_check method.


280-289: Verify multiprocessing compatibility in batched_check of BoxSkewnessConfFilter

Ensure that the frames passed to ProcessPoolExecutor are compatible with multiprocessing. Serialization issues with dpdata.System instances could cause errors during parallel execution.


324-325: Approved: Introduced max_workers to BoxLengthFilter for parallelism

The addition of max_workers allows for parallel processing, improving the performance of the batched_check method.


358-367: Confirm serialization of frames in BoxLengthFilter.batched_check

Verify that the objects in frames are serializable to prevent issues when using ProcessPoolExecutor.

Comment on lines 75 to 77
selected_idx = sum(
[[(i, j) for j in range(s.get_nframes())] for i, s in enumerate(ms)], []
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize list flattening for performance

Using sum(..., []) to flatten a list of lists can be inefficient. Consider using itertools.chain.from_iterable for better performance, especially with large datasets:

import itertools

selected_idx = list(itertools.chain.from_iterable(
    [(i, j) for j in range(s.get_nframes())] for i, s in enumerate(ms)
))

@@ -27,110 +27,78 @@ def check(
self,
frame: dpdata.System,
) -> bool:
return True
return frame["coords"][0][0][0] > 0.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to prevent potential index errors in FooFilter.check.

The current implementation assumes that frame["coords"][0][0][0] exists, which could lead to an IndexError if the coordinate data doesn't have the expected dimensions. Consider adding validation to ensure the coordinate array is properly structured before accessing it.

Comment on lines +33 to +39
class BarFilter(ConfFilter):
def check(
self,
frame: dpdata.System,
) -> bool:
return frame["coords"][0][0][1] > 0.0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to prevent potential index errors in BarFilter.check.

Similar to FooFilter, this method assumes that frame["coords"][0][0][1] exists. To avoid potential IndexError exceptions, ensure that the coordinate data is validated for the expected dimensions before accessing it.

Comment on lines +41 to +46
class BazFilter(ConfFilter):
def check(
self,
frame: dpdata.System,
) -> bool:
return frame["coords"][0][0][2] > 0.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to prevent potential index errors in BazFilter.check.

This method also assumes that frame["coords"][0][0][2] exists. Adding checks to validate the coordinate data's dimensions can prevent runtime errors due to unexpected data structures.

Comment on lines +68 to +69
faked_sys["coords"][1][0] = 1.0
faked_sys["coords"][3][0] = 3.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider refactoring repetitive test setup code for consistency.

The setup code for faked_sys, ms, and filters is repeated across multiple test methods. Refactoring this code into a helper method can improve maintainability and reduce duplication.

Also applies to: 71-74

Comment on lines +82 to +85
faked_sys["coords"][0][0] = 0.5
faked_sys["coords"][1][0] = 1.0
faked_sys["coords"][2][0] = 2.0
faked_sys["coords"][3][0] = 3.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor repeated code to enhance maintainability.

Similar to previous tests, the repeated setup of systems and filters can be consolidated into a shared method or fixture. This will make the tests cleaner and easier to manage.

Also applies to: 87-90

Comment on lines +139 to +140
def __init__(self, max_workers=None, custom_safe_dist=None, safe_dist_ratio=1.0):
self.max_workers = max_workers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure default value of max_workers aligns with expected behavior

Using max_workers=None allows ProcessPoolExecutor to utilize all available CPUs, which may lead to resource contention on systems with many cores. Consider setting a sensible default or providing guidance to users on choosing an appropriate value based on their environment.

Comment on lines 75 to 77
selected_idx = sum(
[[(i, j) for j in range(s.get_nframes())] for i, s in enumerate(ms)], []
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear for me why we need a sum here. You may also consider the suggestion by rabbitAI

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 69.76744% with 13 lines in your changes missing coverage. Please review.

Project coverage is 83.62%. Comparing base (08d8d6e) to head (a20cea0).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...pgen2/exploration/selector/distance_conf_filter.py 52.00% 12 Missing ⚠️
dpgen2/exploration/render/traj_render_lammps.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #268      +/-   ##
==========================================
- Coverage   83.65%   83.62%   -0.04%     
==========================================
  Files         104      104              
  Lines        6028     6057      +29     
==========================================
+ Hits         5043     5065      +22     
- Misses        985      992       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: zjgemi <liuxin_zijian@163.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
dpgen2/exploration/selector/conf_filter.py (2)

38-55: LGTM: New batched_check method

The new batched_check method is a great addition for enabling batch processing of configurations. The implementation is concise and efficient.

Consider using a list comprehension instead of map for slightly better readability:

def batched_check(self, frames: List[dpdata.System]) -> List[bool]:
    return [self.check(frame) for frame in frames]

This change is optional and doesn't affect performance significantly.


75-89: LGTM with suggestions: Improved check method implementation

The new implementation correctly handles multiple systems and frames, efficiently using the batched_check method of each filter. However, there are a few suggestions for improvement:

  1. Consider using more descriptive variable names, e.g., system_index instead of i and frame_index instead of j.

  2. The nested loops for creating selected_idx can be replaced with a list comprehension for better readability and potentially better performance:

selected_idx = [(i, j) for i, s in enumerate(ms) for j in range(s.get_nframes())]
  1. Consider using enumerate in the final loop for better readability:
for i, system_indices in enumerate(selected_idx_list):
    if system_indices:
        ms2.append(ms[i].sub_system(system_indices))

These changes will improve code readability without affecting functionality.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3b453f4 and a20cea0.

📒 Files selected for processing (1)
  • dpgen2/exploration/selector/conf_filter.py (3 hunks)
🧰 Additional context used
🔇 Additional comments (2)
dpgen2/exploration/selector/conf_filter.py (2)

9-11: LGTM: Import change for type hinting

The addition of List import from typing is appropriate for the new type hints in the batched_check method.


73-74: Verify impact of signature change

The change in the check method signature from dpdata.System to dpdata.MultiSystems for both input and output is appropriate for the new batched processing approach.

Please ensure that all calls to this method throughout the codebase have been updated to accommodate this change. Run the following script to verify:

@wanghan-iapcm wanghan-iapcm merged commit 35e0b97 into deepmodeling:master Oct 23, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants