Skip to content

Jupyter, Standalone Software Infrastructure, sample images, etc. to train a detectron2 mask-r-cnn model and use it to remove complex backgrounds from videos and images. Note: the final model is exported to torchscript (ts) format so it can be used generically outside of the detectron2 infrastructure, for instance in my standalone infrastructure

Notifications You must be signed in to change notification settings

IsabelleBaker/DynamicBackgroundRemoval

Repository files navigation

Welcome to my dynamic background removal GitHub.

I am an undergraduate student working on a research project at my university. As part of my responsibilities, I was asked to find a method for removing a 'dynamic' background in a set of videos that include animals, leaving behind only the animals. After searching around for solutions, including how to make the matterport Mask R-CNN code work natively on my Mac M2, I stumbled upon a youtube video explaining how to work with Meta's detectron2. Detectron2 was immediately appealing and made me realize that the problem statement I was working with needed to be adjusted. I mentally changed my assignment to "given a video with animals in it, return only the animals without any other background content." This may sound the same as my original assignment, but to me, the subtle difference in thought process was significant to me. So I decided to train and use detectron2 as the base for identifying the animals in my videos, creating instance masks, extracting the animals from videos based on these masks, pasting them onto a blank canvas, and then ultimately saving the newly constructed canvas as a video frame in a new video.

To make this notebook and learn about Machine Vision models, I studied many sources scattered around the internet. Therefore, let me know if you see something that looks like your code without acknowledgment, and I am happy to include a reference.

I have included the lessons and methods I learned during this process, even if they weren't strictly related to my original goal. I spent A LOT of time researching solutions to what seemed like simple problems with detectron2 and Colab, which ultimately required rather complex solutions. I sincerely hope this notebook helps someone else save time in the future. Let me know if you find something I am missing, and I'll include it as I have time.

Within this repository, I have the following files:

  1. A Jupyter Notebook (Baker_Training_Exporting_Mask_R_CNN.ipynb) that trains a Mask R-CNN model, evaluates, and then exports it to torchscript from within Google Colab.
  2. Installation Instructions (StandAlone-Install-Instructions.docx) for installing my standalone software framework.
  3. A User Interface written in wxpython for training Mask R-CNN models (model_trainer_gui.py)
  4. A User Interface written in wxpython for using these models to remove background from video and create behavior animations/contour images compatible with Ye Labs LabGyn project. (dynamic_background_gui.py)
  5. Two support libraries I wrote to enable the above capabilities. The first (dynamic_background_remover.py) handles all of the major code for inferencing and image processing. The second (animal_tracker.py) is an animal tracking algorithm I wrote which uses a combination of IoU tracking layered on top of a distance between centers tracking method.
  6. A User Interface written in wxpython for playing back animal track files saved into pkl files during analysis (Track_player_gui.py). This UI is able to play back the highly portable pkl file that holds the outline of each animal and its path without the need for the original video. You should also check out my "FrameGrabber" repository. It provides a way to grab frames from a video AND a way to use the track files to identiy frames that needed for model refinement. It is an alternate approach to the track player. I actually prefer to us the frame grabber for model refinement.

Note: I created a utility to grab frames out of videos AND as a tool to refine my training dataset. Like the track_player, you are able to load one of my track files and its associated video. Then, you can find frames where the model failed to properly identify the animal(s) and save them for later annotation. I use this in a train -> test -> refine -> train loop until I'm satisfied with the performance of my models. The tool is located here if you want to give it a try: https://github.com/IsabelleBaker/FrameGrabber

Problems I had to solve:

  1. Hacking together a static version of the core training/inference code was relatively easy. Making it generic so that I/you can make zero/minimal code changes and train 1 to N classes, took a lot of effort. This was the main reason I have this notebook's Global Variables at the start of code sections. Make sure you understand these variables and their usage before modifying the code in the notebook.
  2. I only found 1 sample of code explaining how to take an instance mask and use it to get only that object back out of an image. I ended up creating a new framework for use with my research. More on that later.
  3. I did not find any working examples of subclassing the default trainer to add the augmentations. One example was close to working, but its info needed to be completed. My functional example is shown in the notebook within the MyTrainer class. In addition, I have added the augmentation list to the Global Variables so that it can be easily modified for your purposes.
  4. Auto-saving the newly created model out of colab. This may sound minor, but it is vital when trying to conserve compute credits. I save a time/date marked version of the model and model configuration to my Google drive. At the end of the code, there is an auto-terminate function to delete the runtime and stop consuming credits. You should be able to adjust the Global Variables while offline and then connect the runtime, hit run-all, and walk away knowing that it won't waste your compute credits. You can change one of the Global Variables and stay connected after training if that is your desire.
  5. I could not find any simple method to display only the animals (multiple) I wanted if the model was trained on a large number of animals. The objective is that regardless of what was detected, only mask and highlight certain classes of things. All of the examples I read showed filtering to a single classes or not at all, but I implemented a method to filter N classes. This is done by recursively removing all detected things that are not in the un-ordered "my_display_things" list. Three lines of code accomplish what seemed very complex initially. However, as this notebook evolved, I have moved that specific functionality over to my framework in GitHub.
  6. I did not locate any complete examples of preparing a dataset from start to finish. The data preparation section of this doc may seem excessively detailed, but it comes from frustration at not finding a good tutorial. I've recently learned that CVAT, the online tool I use, does not support the safari browser. If this is an issue for you, please install Google Chrome.
  7. An EASY way to export a pytorch model to torchscript from detectron2. Every tutorial I found made it overly complicated. There is a tiny bit of customization in code borrowed from detectron2 and then it all worked. I think that the export function in detectron2 has a bug (actually two now) but my approach makes it very simple. It is critical that you follow my installation instructions if you are using the standalone framework for training. A defect introduced in late August 2022 will cause an error during export if you install from the latest source.
  8. A way to 'batch' load images into an exported torchscript model. This is really subtle in this notebook but the export_scripting method that I use is slightly modified from the code in detectron2. I have changed Tuple to List in the forward function. I never figured out how to build Tuple[Dict[str, torch.Tensor]] so I changed it to List[Dict[str, torch.Tensor]]. Like magic, it worked. This is actually how you batch load into the pth model so I'm not sure why the code was different in the included scripted export functions. If anyone reading this understands how to make the original structure and load multiple images/frames simultaneously please let me know.
  9. I will add more items here as I remember additional challenges I encountered and solved.

About the Jupiter Notebook

Within my notebook I do the following:

  1. Explain how to get your dataset ready to train your model.
  2. Give the code required to train your model, including the flexibility to train an individual thing, or many things, automagically with a simple change to configuration parameters.
  3. Create inference output with images using your new model.
  4. Export the model to torchscript format for portability.

Happy training!

*This was the actual goal of my work when I started.

FYI: A copy of the dataset, annotations, a video input, etc. are available here.

About the Standalone Software Framework

For now, I will not talk much about the standalone framework. I am in the process of creating some additional documentation for the UIs, but for now, run them and experiment with them. Almost every button and text entry has a tip for its usage if you hover over it.

I suggest starting without installing the trainer. That installation is a lot more involved. Begin with dynamic_background_gui.py using the models in my google drive and the larva avi file I provide. Once it successfully processes a few seconds of a video to completion, run my FrameGrabber (seperate project) or Track_payer_gui.py from this repo to get a sense of how the video is being processed. Only then, install all the software and libraries required to train your model. I still prefer to train my models in Google Colab and then I perform the background removal on my local machine. However, with the software I have provided, the choice is up to you.

Notes about the recent addition of Apple Silicon (MPS) support {Experimental}

In the standalone framework I have added the ability to enable MPS support. The framework code is compatible with MPS now but the pytorch code still has some work to do. If you want to experiment with MPS, just know that you may get some crashes, etc. for a while. Make sure to install the nightly builds to get the latest support from pytorch, instead of what is listed in my basic install instructions.

MPS acceleration is available on MacOS 12.3+: pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

-I have found that training a single animal works fine and is ~40% faster than using only my M1 Max cpu. With multiple animals the loss explodes to infinity and errors.

-Inferencing seems slower, as of right now, using MPS than just using my M1 Max cpu. That may be different if you don't have a Max cpu. This is likely due to torchvision::nms not currently being implemented. Therefore, with every frame it must fallback to CPU for this operation. That's a guess as to what is happening, but right now I still use CPU for inferencing locally.

-Some videos cause an exception to be thrown from within the model. Again, I think this is due to the immature pytorch code at this time.

I'll update these observations as I have time to do more testing. Let me know if you think anything else should be noted here about MPS.

If you have any feedback on this software, please reach out here or through my LinkedIn.

About

Jupyter, Standalone Software Infrastructure, sample images, etc. to train a detectron2 mask-r-cnn model and use it to remove complex backgrounds from videos and images. Note: the final model is exported to torchscript (ts) format so it can be used generically outside of the detectron2 infrastructure, for instance in my standalone infrastructure

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published