Skip to content

Commit

Permalink
updating config, tasks, models, docs
Browse files Browse the repository at this point in the history
  • Loading branch information
vsoch committed Jun 10, 2017
1 parent 9f64402 commit 6ed01e9
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 64 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,17 @@ Once you have your `secrets.py`, it needs the following added:
- `DEBUG`: Make sure to set this to `False` for production.


For [config.py](sendit/settings/config.py) you should configure the following:
For [config.py](sendit/settings/config.py) you should first configure settings for the restful API:

```
# If True, we will have the images first go to a task to retrieve fields to deidentify
DEIDENTIFY_RESTFUL=True
# The default study to use
SOM_STUDY="test"
```

If this variable is False, we skip this task, and the batch is sent to the next task (or tasks) to send to different storage. If True, the batch is first put in the queue to be de-identified, and then upon receival of the identifiers, the batch is put into the same queues to be sent to storage. These functions can be modified to use different endpoints, or do different replacements in the data. For more details about the deidentify functions, see [docs/deidentify.md](docs/deidentify.md)
If `DEIDENTIFY_RESTFUL` is False, we skip this task, and the batch is sent to the next task (or tasks) to send to different storage. If True, the batch is first put in the queue to be de-identified, and then upon receival of the identifiers, the batch is put into the same queues to be sent to storage. The `SOM_STUDY` is part of the Stanford DASHER API to specify a study, and the default should be set before you start the application. If the study needs to vary between calls, please [post an issue](https://www.github.com/pydicom/sendit) and it can be added to be done at runtime. These functions can be modified to use different endpoints, or do different replacements in the data. For more details about the deidentify functions, see [docs/deidentify.md](docs/deidentify.md)

The next set of variables are specific to [storage](docs/storage.md), which is the final step in the pipeline.

Expand Down
53 changes: 23 additions & 30 deletions docs/deidentify.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,18 @@ De-identification happens by way of a series of celery tasks defined in [main/ta
- The first task `get_identifiers` under [main/tasks.py](sendit/apps/main/tasks.py) takes in a batch ID, and uses that batch to look up images, and send a RESTful call to some API point to return fields to replace in the data. The JSON response should be saved to an `BatchIdentifiers` object along with a pointer to the `Batch`.
- The second task `replace_identifers` also under [main/tasks.py](sendit/apps/main/tasks.py) then loads this object, does whatever work is necessary for the data, and then puts the data in the queue for storage.

You (the implementer of this application) might want to tweak both of the above functions depending on your call endpoint, the response format (should be json as it goes into a jsonfield), and then how it is used to deidentify the data.
The entire process of starting with an image, generating a request with some specific set of variables and actions to take, and then after the response is received, using it to deidentify the data, lives outside of this application with the [stanford open modules](https://github.com/vsoch/som/tree/master/som/api/identifiers/dicom) for python with the identifiers api client. SOM is a set of open source python modules that (yes, it is purposefully done so that som also implies "School of Medicine" `:L)`) serve the client, and plugins for working with different data types. If you are interested in how this process is done, we recommend reading the [README](https://github.com/vsoch/som/blob/master/som/api/identifiers/dicom/README.md).


## Customizing De-identification
If you have a different use case, you have several options for customizing this step.

1. you can specify a different `config.json` to the get_identifiers function, in the case that you want a different set of rules applied to your de-identification.
2. you can implement a new module (for example, for a different data type) by submitting a PR to the identifiers repository.
3. If you don't use DASHER, or do something entirely different, you have complete control to not use these som provided functions at all, in which case you will want to tweak the functions in [tasks.py](../sendit/apps/main/tasks.py).

For the purposes of documentation, we will review what the de-identification provided here looks like:

## 1. Datastructure Generated
The post to DASHER will like something like the following:

Expand All @@ -15,7 +24,7 @@ The post to DASHER will like something like the following:
"identifiers":[
{
"id":"14953772",
"id_source":"Stanford MRN",
"id_source":"PatientID",
"id_timestamp":"1961-07-27T00:00:00Z",
"custom_fields":[
{
Expand Down Expand Up @@ -45,52 +54,38 @@ The post to DASHER will like something like the following:
}
```

A list of identifiers is given, and we can think of each thing in the list being an Entity, or corresponding to one Patient/Session. Each in this list has a set of id_* fields, and a list of items associated. This matches to our dicom import model, as the identifiers will be associated with one Accession Number, and the items the corresponding dicom files for the series.
A list of identifiers is given, and we can think of each thing in the list being an Entity, or corresponding to one Patient/Session. Each in this list has a set of id_* fields, and a list of items associated. This matches to our dicom import model, as the identifiers will be associated with one PatientID (and likely one AccessionNumber), and the items the corresponding dicom files for the series.

**Important** A dicom file that doesn't have an Entity (`AccessionNumber`) OR `InstanceNumber` Item id will be skipped, as these fields are required.
**Important** A dicom file that doesn't have an Entity (`PatientID`) OR `SOPInstanceUID` (Item id) will be skipped, as these fields are required.

While it is assumed that one folder of files, corresponding to one accession number, will truly have that be the case, given that the headers present different information (eg, different series/study) we will post a call to the API for each separate Entity represented in the dataset.


### Identifiers
We can only get so much information about an individual from a dicom image, so most of these will be default, or empty. `id`: will correspond to the `PatientID`. The `id_source`, since it is not provided in the data, will always (for now) default to `Stanford MRN`. The `id_timestamp` will be blank, because it's not clear to me how we could derive when the id was generated. Fields that are specific to the patient will be put into `custom_fields` for the patient, so it might look something like the following:
If you look in the [fields parsed](https://github.com/vsoch/som/blob/master/som/api/identifiers/dicom/config.json) in the conig, or even more horrifying, the [several thousand](https://gist.github.com/vsoch/77211a068f45f7255b0d97cf005db572) active DICOM header fields, rest assured that most images will not have most of these. You will notice for our default, `id` will correspond to the `PatientID`. The `id_source`, then, is `PatientID`. The `id_timestamp`, since we are talking about a person, corresponds to the individual's birth date, and we use the date to generate a timestamp ([example here](https://gist.github.com/vsoch/23d6b313bd231cad855877dc544c98ed)). We mostly care about the fields that need to be saved (`custom_fields`) but then blanked or coded in the data that gets sent to storage.

```
"id": 12345678,
"id_source": "Stanford MRN",
"id_source": "PatientID",
"id_timestamp": {},
"custom_fields": [
{
"key": "OtherPatientIDs","value": "value"
},
{
"key": "OtherPatientNames","value": "value"
},
{
"key": "OtherPatientIDsSequence","value": "value"
"key": "OtherPatientIDs","value": "FIRST^LAST"
},
{
"key": "PatientAddress", "value": "value"
"key": "PatientAddress", "value": "222 MICHEY LANE"
},
{
"key": "PatientBirthDate","value": "value"
"key": "PatientName","value": "Mickey^Mouse"
},
{
"key": "PatientBirthName","value": "value"
},
{
"key": "PatientMotherBirthName","value": "value"
},
{
"key": "PatientName","value": "value"
},
{
"key": "PatientTelephoneNumbers","value": "value"
"key": "PatientTelephoneNumbers","value": "111-111-1111"
}
```

## Items
A list of items is associated with each Entity (the example above). The id for the item will correspond to the InstanceNumber, and the `id_source` will correspond to the `InstanceCreatorUID`. The timestamp must be derived from `InstanceCreationDate` and `InstanceCreationTime`.
A list of items is associated with each Entity (the example above). The id for the item will correspond to the `SOPInstanceUID`, and thus the `id_source` is `SOPInstanceUID`. The timestamp must be derived from `InstanceCreationDate` and `InstanceCreationTime` using the same function linked above.

```
"items": [
Expand Down Expand Up @@ -130,7 +125,7 @@ We will be removing all PHI from the datasets before moving into the cloud, as s
- Any other unique identifying number, characteristic, or code


To be explicitly clear, here are a set of tables to describe **1** the dicom identifier, **2** if relevent, how it is mapped to a field for the DASHER API, **3**, if the data is removed (meaning left as an empty string) before going into the cloud, meaning that it is considered in the HIPAA list above. Not all dicoms have all of these fields, and if the field is not found, no action is taken.
To be explicitly clear, here are a set of tables to describe **1** the dicom identifier, **2** if relevent, how it is mapped to a field for the DASHER API, **3**, if the data is removed (meaning left as an empty string) before going into the cloud, meaning that it is considered in the HIPAA list above. Not all dicoms have all of these fields, and if the field is not found, no action is taken. This is a broad overview - to get exact actions you should look at the [config.json](https://github.com/vsoch/som/blob/master/som/api/identifiers/dicom/config.json).

### PHI Identifiers
For each of the below, a field under `DASHER` is assumed to be given with an Entity, one of which makes up a list of identifiers, for a `POST`. Removed does not mean that the field is deleted, but that it is made empty. If replacement is defined, the field from the `DASHER` response is subbed instead of a ''. For most of the below, we give the PHI data as a `custom_field` (to be stored with `DASHER`) and put an empty string in its spot for the data uploaded to Storage.
Expand Down Expand Up @@ -206,7 +201,7 @@ The response might look like the following:
[
{
"id": 12345678,
"id_source": "Stanford MRN",
"id_source": "PatientID",
"suid": "103e",
"jittered_timestamp": {},
"custom_fields": [
Expand Down Expand Up @@ -234,5 +229,3 @@ The response might look like the following:
]
}
```

**MORE TO COME** not done yet :)
15 changes: 15 additions & 0 deletions sendit/apps/main/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,22 @@ class Batch(models.Model):
errors = JSONField(default=dict())
modify_date = models.DateTimeField('date modified', auto_now=True)
tags = TaggableManager()

def change_images_status(self,status):
'''change all images to have the same status'''
for dcm in self.image_set.all():
dcm.status = status
dcm.save()


def get_image_paths(self):
'''return file paths for all images associated
with a batch'''
image_files = []
for dcm in self.image_set.all():
image_files.append(dcm.image.path)
return image_files

def get_absolute_url(self):
return reverse('batch_details', args=[str(self.id)])

Expand Down
73 changes: 41 additions & 32 deletions sendit/apps/main/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,16 @@
)

from som.api.identifiers.dicom import (
get_identifiers as get_ids
get_identifiers as get_ids,
replace_identifiers as replace_ids
)

from som.api.identifiers import Client

from sendit.settings import (
DEIDENTIFY_RESTFUL,
SEND_TO_ORTHANC,
SOM_STUDY,
ORTHANC_IPADDRESS,
ORTHANC_PORT,
SEND_TO_GOOGLE,
Expand Down Expand Up @@ -140,7 +144,7 @@ def import_dicomdir(dicom_dir):


@shared_task
def get_identifiers(bid):
def get_identifiers(bid,study=None):
'''get identifiers is the celery task to get identifiers for
all images in a batch. A batch is a set of dicom files that may include
more than one series/study. This is done by way of sending one restful call
Expand All @@ -149,30 +153,36 @@ def get_identifiers(bid):
'''
batch = Batch.objects.get(id=bid)

if study is None:
study = SOM_STUDY

if DEIDENTIFY_RESTFUL is True:

identifiers = dict()
images = batch.image_set.all()
for dcm in images:

# Returns dictionary with {"id": {"identifiers"...}}
dcm = change_status(dcm,"PROCESSING")
ids = get_ids(dicom_file=dcm.image.path)

for uid,identifiers in ids.items():
# Create an som client
cli = Client()

# STOPPED HERE - I'm not sure why we need to keep
# study given that we represent things as batches of dicom
# It might be more suitable to model as a Batch,
# where a batch is a grouping of dicoms (that might actually
# be more than one series. Then we would store as Batch,
# and use the batch ID to pass around and get the images.
# Stopping here for tonight.
# Will need to test this out:
replacements = BatchIdentifiers.objects.create(series=)
# Process all dicoms at once, one call to the API
dicom_files = batch.get_image_paths()
batch.change_images_status('PROCESSING')

# Returns dictionary with {"id": {"identifiers"...}}
ids = get_ids(dicom_files=dicom_files)

# This should only be for one loop, given a folder with one patient
deids = dict()
for uid,identifiers in ids.items():
bot.debug("som.client making request to deidentify %s" %(uid))
deids[uid] = cli.deidentify(ids=identifiers,
study=study)

batch_ids = BatchIdentifiers.objects.create(batch=batch,
response=deids)
batch_ids.save()

replace_identifiers.apply_async(kwargs={"bid":bid})


replace_identifiers.apply_async(kwargs={"bid":bid})

else:
bot.debug("Restful de-identification skipped [DEIDENTIFY_RESTFUL is False]")
Expand All @@ -189,21 +199,20 @@ def replace_identifiers(bid):
'''
try:
batch = Batch.objects.get(id=bid)
batch_identifiers = BatchIdentifiers.get(batch=batch)
batch_ids = BatchIdentifiers.get(batch=batch)

# replace ids to update the dicom_files (same paths)
dicom_files = batch.get_image_paths()
updated_files = replace_ids(dicom_files=dicom_files,
response=batch_ids.response)
change_status(batch,"DONEPROCESSING")
batch.change_images_status('DONEPROCESSING')

except:
bot.error("In replace_identifiers: Batch %s or identifiers does not exist." %(bid))
return None


for image in batch.image_set.all():

# Do deidentify replcement here
change_status(image,"DONEPROCESSING")


# trigger storage function
change_status(batch,"DONEPROCESSING")
change_status(batch.image_set.all(),"DONEPROCESSING")
# We don't get here if the call above failed
upload_storage.apply_async(kwargs={"bid":bid})


Expand All @@ -226,7 +235,7 @@ def upload_storage(bid):
bot.log("Uploading to Google Storage %s" %(GOOGLE_CLOUD_STORAGE))
# GOOGLE_CLOUD_STORAGE

change_status(dcm,"SENT")
batch.change_images_status('SENT')


change_status(batch,"DONE")
Expand Down
8 changes: 8 additions & 0 deletions sendit/settings/config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@

#####################################################
# RESTFUL API
#####################################################

# De-identify
# If True, we will have the images first go to a task to retrieve fields to deidentify
DEIDENTIFY_RESTFUL=True

# The default study to use
SOM_STUDY="test"

#####################################################
# STORAGE
#####################################################
Expand Down

0 comments on commit 6ed01e9

Please sign in to comment.