Skip to content

Commit

Permalink
updating docs, and adding Batch model to close #3
Browse files Browse the repository at this point in the history
  • Loading branch information
vsoch committed Jun 9, 2017
1 parent b99570a commit 9fc5666
Show file tree
Hide file tree
Showing 6 changed files with 384 additions and 186 deletions.
169 changes: 113 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,45 @@
# Sendit

**under development**

This is a dummy server for testing sending and receiving of data from an endpoint. The main job of the server will be to "sniff" for receiving a complete dicom series folder in a mapped data folder, and then to do the following:


- Add series as objects to the database.
- Add query with images as objects to the database.
- A folder, the result of a query, is represented as a "Batch"
- A single Dicom image is represented as an "Image"
- A "Series" is a set of dicom images
- A "Study" is a collection of series

Images will be moved around and processed on the level of a Batch, which is typically associated with a single accession number, series, and study, however there might be exceptions to this case. For high level overview, continue reading. For module and modality specific docs, see our [docs](docs) folder. If anything is missing documentation please [open an issue](https://www.github.com/pydicom/sendit)


## Download
Before you start, you should make sure that you have Docker and docker-compose installed, and a complete script for setting up the dependencies for any instance [is provided](scripts/setup_instance.sh). You should then clone the repo, and we recommend a location like `/opt`.

```
cd /opt
git clone https://www.github.com/pydicom/sendit
cd sendit
```

This will mean your application base is located at `/opt/sendit` and we recommend that your data folder (where your system process will add files) be maintained at `/opt/sendit/data`. You don't have to do this, but if you don't, you need to change the folder in the [docker-compose.yml](docker-compose.yml) to where you want it to be. For example, right now we map `data` in the application's directory to `/data` in the container, and it looks like this:

```
uwsgi:
restart: always
image: vanessa/sendit
volumes:
- ./data:/data
```

to change that to `/tmp/dcm` you would change that line to:

```
uwsgi:
restart: always
image: vanessa/sendit
volumes:
- /tmp/dcm:/data
```

Although we have groupings on the level of study, images will be generally moved around and processed on the level of Series. For high level overview, continue reading. For module and modality specific docs, see our [docs](docs) folder. If anything is missing documentation please [open an issue](https://www.github.com/pydicom/sendit)
Instructions for starting and interacting with the instance will follow Configuration, or the editing of local files, which must be done first.


## Configuration
Expand All @@ -35,13 +63,9 @@ For [config.py](sendit/settings/config.py) you should configure the following:
DEIDENTIFY_RESTFUL=True
```

If this variable is False, we skip this task, and images are instead sent to the next task (or tasks) to send them to different storage. If True, the images are first put in the queue to be de-identified, and then upon receival of the identifiers, then they are put into the same queues to be sent to storage. These functions can be modified to use different endpoints, or do different replacements in the data:

- The function `get_identifiers` under [main/tasks.py](sendit/apps/main/tasks.py) should take in a series ID, and use that series to look up images, and send a RESTful call to some API point to return fields to replace in the data. The JSON response should be saved to an `SeriesIdentifiers` object along with a pointer to the Series.
- The function `replace_identifers` also under [main/tasks.py](sendit/apps/main/tasks.py) should then load this object, do whatever work is necessary for the data, and then put the data in the queue for storage.

You might want to tweak both of the above functions depending on your call endpoint, the response format (should be json as it goes into a jsonfield), and then how it is used to deidentify the data.
If this variable is False, we skip this task, and the batch is sent to the next task (or tasks) to send to different storage. If True, the batch is first put in the queue to be de-identified, and then upon receival of the identifiers, the batch is put into the same queues to be sent to storage. These functions can be modified to use different endpoints, or do different replacements in the data. For more details about the deidentify functions, see [docs/deidentify.md](docs/deidentify.md)

The next set of variables are specific to [storage](docs/storage.md), which is the final step in the pipeline.

```
# We can turn on/off send to Orthanc. If turned off, the images would just be processed
Expand Down Expand Up @@ -82,20 +106,87 @@ This basically means that the entire site is locked down, or protected for use (
LOCKDOWN_PASSWORDS = ('mysecretpassword',)
```

Note that here we will need to add notes about securing the server (https), etc. For now, I'll just mention that it will come down to changing the [nginx.conf](nginx.conf) and [docker-compose.yml](docker-compose.yml) to those provided in the folder [https](https).


## Application
This application lives in a docker-compose orchestration of images running on `STRIDE-HL71`. This application has the following components (each a Docker image):

- **uwsgi**: is the main python application with Django (python)
- **nginx**: is a web server to make a status web interface for Research IT
- **worker**: is the same image as uwsgi, but configured to run a distributed job queue called [celery](http://www.celeryproject.org/).
- **redis**: is the database used by the worker, with serialization in json.

The job queue generally works by processing tasks when the server has available resources. There will be likely 5 workers for a single application deployment. The worker will do the following:

1. First receive a job from the queue to run the [import dicom](docs/import_dicom.md) task when a finished folder is detected by the [watcher](docs/watcher.md)
2. When import is done, hand to the next task to [de-identify](docs/deidentify.md) images. If the user doesn't want to do this based on [settings](sendit/settings/config.py), a task is fired off to send to storage. If they do, the request is made to the DASHER endpoint, and the identifiers saved.
a. In the case of de-identification, the next job will do the data strubbing with the identifiers, and then trigger sending to storage.
3. Sending to storage can be enabled to work with any or none of OrthanC and Google Cloud storage. If no storage is taken, then the application works as a static storage.


### Status
In order to track status of images, we have status states for images and batches.


```
IMAGE_STATUS = (('NEW', 'The image was just added to the application.'),
('PROCESSING', 'The image is currently being processed, and has not been sent.'),
('DONEPROCESSING','The image is done processing, but has not been sent.'),
('SENT','The image has been sent, and verified received.'),
('DONE','The image has been received, and is ready for cleanup.'))
BATCH_STATUS = (('NEW', 'The batch was just added to the application.'),
('PROCESSING', 'The batch currently being processed.'),
('DONEPROCESSING', 'The batch is done processing'),
('DONE','The batch is done, and images are ready for cleanup.'))
```

#### Image Status
Image statuses are updated at each appropriate timepoint, for example:

- All new images by default are given `NEW`
- When an image starts any de-identification, but before any request to send to storage, it will have status `PROCESSING`. This means that if an image is not to be processed, it will immediately be flagged with `DONEPROCESSING`
- As soon as the image is done processing, or if it is intended to go right to storage, it gets status `DONEPROCESSING`.
- After being send to storage, the image gets status `SENT`, and only when it is ready for cleanup is gets status `DONE`. Note that this means that if a user has no requests to send to storage, the image will remain with the application (and not be deleted.)

#### Batch Status
A batch status is less granular, but more informative for alerting the user about possible errors.

- All new batches by default are given `NEW`.
- `PROCESSING` is added to a batch as soon as the job to deidentify is triggered.
- `DONEPROCESSING` is added when the batch finished de-identification, or if it skips and is intended to go to storage.
- `DONE` is added after all images are sent to storage, and are ready for cleanup.


### Errors
The most likely error would be an inability to read a dicom file, which could happen for any number of reasons. This, and generally any errors that are triggered during the lifecycle of a batch, will flag the batch as having an error. The variable `has_error` is a boolean that belongs to a batch, and a matching JSONField `errors` will hold a list of errors for the user. This error flag will be most relevant during cleanup.

## Basic Pipeline
This application lives in a docker-compose application running on `STRIDE-HL71`.
For server errors, the application is configured to be set up with Opbeat. @vsoch has an account that can handle Stanford deployed applications, and all others should follow instructions for setup [on the website](opbeat.com/researchapps). It comes down to adding a few lines to the [main settings](sendit/settings/main.py). Opbeat (or a similar service) is essential for being notified immediately when any server error is triggered.


### 1. Data Input
### Cleanup
Upon completion, we will want some level of cleanup of both the database, and the corresponding files. It is already the case that the application moves the input files from `/data` into its own media folder (`images`), and cleanup might look like any of the following:

- In the most ideal case, there are no errors, no flags for the batch, and the original data folder was removed by the `dicom_import` task, and the database and media files removed after successful upload to storage. This application is not intended as some kind of archive for data, but a node that filters and passes along.
- Given an error to `dicom_import`, a file will be left in the original folder, and the batch `has_error` will be true. In this case, we don't delete files, and we rename the original folder to have extension `.err`

If any further logging is needed (beyond the watcher) we should discuss (see questions below)


## Deployment
After configuration is done and you have a good understanding of how things work, you are ready to turn it on! First, let's learn about how to start and stop the watcher, and the kind of datasets and location that the watcher is expecting. It is up to you to plop these dataset folders into the application's folder being watched.


### 1. Running the Watcher
This initial setup is stupid in that it's going to be checking an input folder to find new images. We do this using the [watcher](sendit/apps/watcher) application, which is started and stopped with a manage.py command:

```
python manage.py watcher_start
python manage.py watcher_stop
```

And the default is to watch for files added to [data](data), which is mapped to '/data' in the container. This means that `STRIDE-HL71` will receive DICOM from somewhere. It should use an atomic download strategy, but with folders, into the application data input folder. This will mean that when it starts, the folder might look like:
And the default is to watch for files added to [data](data), which is mapped to '/data' in the container. This means that `STRIDE-HL71` will receive DICOM from somewhere. It should use an atomic download strategy, but with folders, into the application data input folder. This will mean that when it starts, the folder (inside the container) might look like:


```bash
Expand All @@ -118,53 +209,19 @@ Only when all of the dicom files are finished copying will the driving function

```

A directory is considered "finished" and ready for processing when it does **not** have an entension that starts with "tmp". For more details about the watcher daemon, you can look at [his docs](docs/watcher.md). While many examples are provided, for this application we use the celery task `import_dicomdir` in [main/tasks.py](sendit/apps/main/tasks.py) to read in a finished dicom directory from the directory being watched, and this uses the class `DicomCelery` in the [event_processors](sendit/apps/watcher/event_processors.py) file. Other examples are provided, in the case that you want to change or extend the watcher daemon.
A directory is considered "finished" and ready for processing when it does **not** have an entension that starts with "tmp". For more details about the watcher daemon, you can look at [his docs](docs/watcher.md). While many examples are provided, for this application we use the celery task `import_dicomdir` in [main/tasks.py](sendit/apps/main/tasks.py) to read in a finished dicom directory from the directory being watched, and this uses the class `DicomCelery` in the [event_processors](sendit/apps/watcher/event_processors.py) file. Other examples are provided, in the case that you want to change or extend the watcher daemon. For complete details about the import of dicom files, see [docs/dicom_import.md](docs/dicom_import.md)


### 2. Database Models
The Dockerized application will check the folder at some frequency (once a minute perhaps) and look for folders that are not in the process of being populated. When a folder is found:
The Dockerized application is constantly monitoring the folder to look for folders that are not in the process of being populated. When a folder is found:

- A new object in the database is created to represent the "Series"
- A new object in the database is created to represent the "Batch"
- Each "Image" is represented by an equivalent object
- Each "Image" is linked to its "Series", and if relevant, the "Series" is linked to a "Study."
- Each "Image" is linked to its "Batch"
- Currently, all uids for each must be unique.


### 3. Retrive Identifiers
After these objects are created, we will generate a single call to a Restful service to get back a response that will have fields that need to be substituted in the data. For Stanford, we will use the DASHER API to get identifiers for the study. The call will be made, the response received, and the response itself saved to the database as a "SeriesIdentifiers" object. This object links to the Series it is intended for. The Series id will be put into a job queue for the final processing. This step will not be performed if


### 3. Replacement of identifiers
The job queue will process datasets when the server has available resources. There will be likely 5 workers for a single application deployment. The worker will do the following:

- receive a job from the queue with a series id
- use the series ID to look up the identifiers, and all dicom images
- for each image, prepare a new dataset that has been de-identified (this will happen in a temporary folder)
- send the dataset to the cloud Orthanc, and (maybe also?) Datastore and Storage

Upon completion, we will want some level of cleanup of both the database, and the corresponding files. This application is not intended as some kind of archive for data, but a node that filters and passes along.


# Status States
In order to track status of images, we should have different status states. I've so far created a set for images, which also give information about the status of the series they belong to:

```
IMAGE_STATUS = (('NEW', 'The image was just added to the application.'),
('PROCESSING', 'The image is currently being processed, and has not been sent.'),
('DONEPROCESSING','The image is done processing, but has not been sent.'),
('SENT','The image has been sent, and verified received.'),
('DONE','The image has been received, and is ready for cleanup.'))
```

These can be tweaked as needed, and likely I will do this as I develop the application. I will want to add more / make things simpler. I'm not entirely sure where I want these to come in, but they will.
Generally, the query of interest will retrieve a set of images with an associated accession number, and the input folder will be named by the accession number. Since there is variance in the data with regard to `AccessionNumber` and different series identifiers, for our batches we give them ids based on the folder name.


# Questions

- When there is error reading a dicom file (I am using "force" so this will read most dicom files, but will have a KeyError for a non dicom file) I am logging the error and moving on, without deleting the file. Is this the approach we want to take?
- I am taking the approach that, given that a header field is considered PHI by HIPAA, it will be removed (meaning replaced with an empty string) in the header, unless it's the jittered timestamp or suid, AND the actual data will be sent as `custom_fields`. For all fields that aren't HIPAA, I am not adding them as `custom_fields`, and leaving them in the data. I think we can have them searchable via Google Datastore, but it would be redundant to have them in `DASHER` too. Do you agree?
- The `id_source` example (Stanford MRN) is "human friendly", whereas the `SOPInstanceUID` is not. For now, assuming that we are pulling from Stanford PACS, I have it set to use a default `Stanford MRN`. Is there a different field/strategy I should take?
- Right now I am skipping an image given that the Entity id (meaning the identifiers `id` --> `AccessionNumber`) OR the item's `id` --> `InstanceNumber` is missing. Is this the right approach to take? If not, should we default to something else?
- For each item in a request, there is an `id_source` and the example is `GE PACS`. However, it's not clear if this is something that should come from the dicom data (or set by default by us, since we are pulling all from the same PACS) or if it should be one of the following (in the dicom header). Right now I am using `SOPInstanceUID`, but that isn't super human friendly.
- For the fields that we don't need to remove from the dicom images (eg, about the image data) I think it wouldn't be useful to have as `custom_fields`, so I am suggesting (and currently implementing) that we don't send it to dasher with each item. We can send these fields to datastore to be easily searched, if that functionality is wanted.
- I originally had the PatientID being used as the identifiers main id, but I think this should actually be AccessionNumber (and the PatientID represented in the custom_fields, because we don't even always have a patient, but we will have an accession number!) Right now I am using accession number, and I can change this if necessary.
- Given no errors for a batch, we will be cleaning up the database and the media files, which means complete deletion. Is there any desire for a log to be maintained somewhere, and if so, where? Right now, the logs that we have are for the watcher, that logs the name of the folders and when they are complete. If we want more logging, for what actions, under what circumstances?
Loading

0 comments on commit 9fc5666

Please sign in to comment.