diff --git a/README.md b/README.md index 9f11429..c79574a 100644 --- a/README.md +++ b/README.md @@ -56,14 +56,17 @@ Once you have your `secrets.py`, it needs the following added: - `DEBUG`: Make sure to set this to `False` for production. -For [config.py](sendit/settings/config.py) you should configure the following: +For [config.py](sendit/settings/config.py) you should first configure settings for the restful API: ``` # If True, we will have the images first go to a task to retrieve fields to deidentify DEIDENTIFY_RESTFUL=True + +# The default study to use +SOM_STUDY="test" ``` -If this variable is False, we skip this task, and the batch is sent to the next task (or tasks) to send to different storage. If True, the batch is first put in the queue to be de-identified, and then upon receival of the identifiers, the batch is put into the same queues to be sent to storage. These functions can be modified to use different endpoints, or do different replacements in the data. For more details about the deidentify functions, see [docs/deidentify.md](docs/deidentify.md) +If `DEIDENTIFY_RESTFUL` is False, we skip this task, and the batch is sent to the next task (or tasks) to send to different storage. If True, the batch is first put in the queue to be de-identified, and then upon receival of the identifiers, the batch is put into the same queues to be sent to storage. The `SOM_STUDY` is part of the Stanford DASHER API to specify a study, and the default should be set before you start the application. If the study needs to vary between calls, please [post an issue](https://www.github.com/pydicom/sendit) and it can be added to be done at runtime. These functions can be modified to use different endpoints, or do different replacements in the data. For more details about the deidentify functions, see [docs/deidentify.md](docs/deidentify.md) The next set of variables are specific to [storage](docs/storage.md), which is the final step in the pipeline. diff --git a/docs/deidentify.md b/docs/deidentify.md index 82541f6..00621bf 100644 --- a/docs/deidentify.md +++ b/docs/deidentify.md @@ -4,9 +4,18 @@ De-identification happens by way of a series of celery tasks defined in [main/ta - The first task `get_identifiers` under [main/tasks.py](sendit/apps/main/tasks.py) takes in a batch ID, and uses that batch to look up images, and send a RESTful call to some API point to return fields to replace in the data. The JSON response should be saved to an `BatchIdentifiers` object along with a pointer to the `Batch`. - The second task `replace_identifers` also under [main/tasks.py](sendit/apps/main/tasks.py) then loads this object, does whatever work is necessary for the data, and then puts the data in the queue for storage. -You (the implementer of this application) might want to tweak both of the above functions depending on your call endpoint, the response format (should be json as it goes into a jsonfield), and then how it is used to deidentify the data. +The entire process of starting with an image, generating a request with some specific set of variables and actions to take, and then after the response is received, using it to deidentify the data, lives outside of this application with the [stanford open modules](https://github.com/vsoch/som/tree/master/som/api/identifiers/dicom) for python with the identifiers api client. SOM is a set of open source python modules that (yes, it is purposefully done so that som also implies "School of Medicine" `:L)`) serve the client, and plugins for working with different data types. If you are interested in how this process is done, we recommend reading the [README](https://github.com/vsoch/som/blob/master/som/api/identifiers/dicom/README.md). +## Customizing De-identification +If you have a different use case, you have several options for customizing this step. + +1. you can specify a different `config.json` to the get_identifiers function, in the case that you want a different set of rules applied to your de-identification. +2. you can implement a new module (for example, for a different data type) by submitting a PR to the identifiers repository. +3. If you don't use DASHER, or do something entirely different, you have complete control to not use these som provided functions at all, in which case you will want to tweak the functions in [tasks.py](../sendit/apps/main/tasks.py). + +For the purposes of documentation, we will review what the de-identification provided here looks like: + ## 1. Datastructure Generated The post to DASHER will like something like the following: @@ -15,7 +24,7 @@ The post to DASHER will like something like the following: "identifiers":[ { "id":"14953772", - "id_source":"Stanford MRN", + "id_source":"PatientID", "id_timestamp":"1961-07-27T00:00:00Z", "custom_fields":[ { @@ -45,52 +54,38 @@ The post to DASHER will like something like the following: } ``` -A list of identifiers is given, and we can think of each thing in the list being an Entity, or corresponding to one Patient/Session. Each in this list has a set of id_* fields, and a list of items associated. This matches to our dicom import model, as the identifiers will be associated with one Accession Number, and the items the corresponding dicom files for the series. +A list of identifiers is given, and we can think of each thing in the list being an Entity, or corresponding to one Patient/Session. Each in this list has a set of id_* fields, and a list of items associated. This matches to our dicom import model, as the identifiers will be associated with one PatientID (and likely one AccessionNumber), and the items the corresponding dicom files for the series. -**Important** A dicom file that doesn't have an Entity (`AccessionNumber`) OR `InstanceNumber` Item id will be skipped, as these fields are required. +**Important** A dicom file that doesn't have an Entity (`PatientID`) OR `SOPInstanceUID` (Item id) will be skipped, as these fields are required. While it is assumed that one folder of files, corresponding to one accession number, will truly have that be the case, given that the headers present different information (eg, different series/study) we will post a call to the API for each separate Entity represented in the dataset. + ### Identifiers -We can only get so much information about an individual from a dicom image, so most of these will be default, or empty. `id`: will correspond to the `PatientID`. The `id_source`, since it is not provided in the data, will always (for now) default to `Stanford MRN`. The `id_timestamp` will be blank, because it's not clear to me how we could derive when the id was generated. Fields that are specific to the patient will be put into `custom_fields` for the patient, so it might look something like the following: +If you look in the [fields parsed](https://github.com/vsoch/som/blob/master/som/api/identifiers/dicom/config.json) in the conig, or even more horrifying, the [several thousand](https://gist.github.com/vsoch/77211a068f45f7255b0d97cf005db572) active DICOM header fields, rest assured that most images will not have most of these. You will notice for our default, `id` will correspond to the `PatientID`. The `id_source`, then, is `PatientID`. The `id_timestamp`, since we are talking about a person, corresponds to the individual's birth date, and we use the date to generate a timestamp ([example here](https://gist.github.com/vsoch/23d6b313bd231cad855877dc544c98ed)). We mostly care about the fields that need to be saved (`custom_fields`) but then blanked or coded in the data that gets sent to storage. ``` "id": 12345678, - "id_source": "Stanford MRN", + "id_source": "PatientID", "id_timestamp": {}, "custom_fields": [ { - "key": "OtherPatientIDs","value": "value" - }, - { - "key": "OtherPatientNames","value": "value" - }, - { - "key": "OtherPatientIDsSequence","value": "value" + "key": "OtherPatientIDs","value": "FIRST^LAST" }, { - "key": "PatientAddress", "value": "value" + "key": "PatientAddress", "value": "222 MICHEY LANE" }, { - "key": "PatientBirthDate","value": "value" + "key": "PatientName","value": "Mickey^Mouse" }, { - "key": "PatientBirthName","value": "value" - }, - { - "key": "PatientMotherBirthName","value": "value" - }, - { - "key": "PatientName","value": "value" - }, - { - "key": "PatientTelephoneNumbers","value": "value" + "key": "PatientTelephoneNumbers","value": "111-111-1111" } ``` ## Items -A list of items is associated with each Entity (the example above). The id for the item will correspond to the InstanceNumber, and the `id_source` will correspond to the `InstanceCreatorUID`. The timestamp must be derived from `InstanceCreationDate` and `InstanceCreationTime`. +A list of items is associated with each Entity (the example above). The id for the item will correspond to the `SOPInstanceUID`, and thus the `id_source` is `SOPInstanceUID`. The timestamp must be derived from `InstanceCreationDate` and `InstanceCreationTime` using the same function linked above. ``` "items": [ @@ -130,7 +125,7 @@ We will be removing all PHI from the datasets before moving into the cloud, as s - Any other unique identifying number, characteristic, or code -To be explicitly clear, here are a set of tables to describe **1** the dicom identifier, **2** if relevent, how it is mapped to a field for the DASHER API, **3**, if the data is removed (meaning left as an empty string) before going into the cloud, meaning that it is considered in the HIPAA list above. Not all dicoms have all of these fields, and if the field is not found, no action is taken. +To be explicitly clear, here are a set of tables to describe **1** the dicom identifier, **2** if relevent, how it is mapped to a field for the DASHER API, **3**, if the data is removed (meaning left as an empty string) before going into the cloud, meaning that it is considered in the HIPAA list above. Not all dicoms have all of these fields, and if the field is not found, no action is taken. This is a broad overview - to get exact actions you should look at the [config.json](https://github.com/vsoch/som/blob/master/som/api/identifiers/dicom/config.json). ### PHI Identifiers For each of the below, a field under `DASHER` is assumed to be given with an Entity, one of which makes up a list of identifiers, for a `POST`. Removed does not mean that the field is deleted, but that it is made empty. If replacement is defined, the field from the `DASHER` response is subbed instead of a ''. For most of the below, we give the PHI data as a `custom_field` (to be stored with `DASHER`) and put an empty string in its spot for the data uploaded to Storage. @@ -206,7 +201,7 @@ The response might look like the following: [ { "id": 12345678, - "id_source": "Stanford MRN", + "id_source": "PatientID", "suid": "103e", "jittered_timestamp": {}, "custom_fields": [ @@ -234,5 +229,3 @@ The response might look like the following: ] } ``` - -**MORE TO COME** not done yet :) diff --git a/sendit/apps/main/models.py b/sendit/apps/main/models.py index bf59acd..6d08e09 100644 --- a/sendit/apps/main/models.py +++ b/sendit/apps/main/models.py @@ -102,7 +102,22 @@ class Batch(models.Model): errors = JSONField(default=dict()) modify_date = models.DateTimeField('date modified', auto_now=True) tags = TaggableManager() + + def change_images_status(self,status): + '''change all images to have the same status''' + for dcm in self.image_set.all(): + dcm.status = status + dcm.save() + + def get_image_paths(self): + '''return file paths for all images associated + with a batch''' + image_files = [] + for dcm in self.image_set.all(): + image_files.append(dcm.image.path) + return image_files + def get_absolute_url(self): return reverse('batch_details', args=[str(self.id)]) diff --git a/sendit/apps/main/tasks.py b/sendit/apps/main/tasks.py index 572a82f..c7ef6ff 100644 --- a/sendit/apps/main/tasks.py +++ b/sendit/apps/main/tasks.py @@ -43,12 +43,16 @@ ) from som.api.identifiers.dicom import ( - get_identifiers as get_ids + get_identifiers as get_ids, + replace_identifiers as replace_ids ) +from som.api.identifiers import Client + from sendit.settings import ( DEIDENTIFY_RESTFUL, SEND_TO_ORTHANC, + SOM_STUDY, ORTHANC_IPADDRESS, ORTHANC_PORT, SEND_TO_GOOGLE, @@ -140,7 +144,7 @@ def import_dicomdir(dicom_dir): @shared_task -def get_identifiers(bid): +def get_identifiers(bid,study=None): '''get identifiers is the celery task to get identifiers for all images in a batch. A batch is a set of dicom files that may include more than one series/study. This is done by way of sending one restful call @@ -149,30 +153,36 @@ def get_identifiers(bid): ''' batch = Batch.objects.get(id=bid) + if study is None: + study = SOM_STUDY + if DEIDENTIFY_RESTFUL is True: - identifiers = dict() images = batch.image_set.all() - for dcm in images: - - # Returns dictionary with {"id": {"identifiers"...}} - dcm = change_status(dcm,"PROCESSING") - ids = get_ids(dicom_file=dcm.image.path) - for uid,identifiers in ids.items(): + # Create an som client + cli = Client() - # STOPPED HERE - I'm not sure why we need to keep - # study given that we represent things as batches of dicom - # It might be more suitable to model as a Batch, - # where a batch is a grouping of dicoms (that might actually - # be more than one series. Then we would store as Batch, - # and use the batch ID to pass around and get the images. - # Stopping here for tonight. - # Will need to test this out: - replacements = BatchIdentifiers.objects.create(series=) + # Process all dicoms at once, one call to the API + dicom_files = batch.get_image_paths() + batch.change_images_status('PROCESSING') + + # Returns dictionary with {"id": {"identifiers"...}} + ids = get_ids(dicom_files=dicom_files) + + # This should only be for one loop, given a folder with one patient + deids = dict() + for uid,identifiers in ids.items(): + bot.debug("som.client making request to deidentify %s" %(uid)) + deids[uid] = cli.deidentify(ids=identifiers, + study=study) + + batch_ids = BatchIdentifiers.objects.create(batch=batch, + response=deids) + batch_ids.save() + + replace_identifiers.apply_async(kwargs={"bid":bid}) - - replace_identifiers.apply_async(kwargs={"bid":bid}) else: bot.debug("Restful de-identification skipped [DEIDENTIFY_RESTFUL is False]") @@ -189,21 +199,20 @@ def replace_identifiers(bid): ''' try: batch = Batch.objects.get(id=bid) - batch_identifiers = BatchIdentifiers.get(batch=batch) + batch_ids = BatchIdentifiers.get(batch=batch) + + # replace ids to update the dicom_files (same paths) + dicom_files = batch.get_image_paths() + updated_files = replace_ids(dicom_files=dicom_files, + response=batch_ids.response) + change_status(batch,"DONEPROCESSING") + batch.change_images_status('DONEPROCESSING') + except: bot.error("In replace_identifiers: Batch %s or identifiers does not exist." %(bid)) return None - - for image in batch.image_set.all(): - - # Do deidentify replcement here - change_status(image,"DONEPROCESSING") - - - # trigger storage function - change_status(batch,"DONEPROCESSING") - change_status(batch.image_set.all(),"DONEPROCESSING") + # We don't get here if the call above failed upload_storage.apply_async(kwargs={"bid":bid}) @@ -226,7 +235,7 @@ def upload_storage(bid): bot.log("Uploading to Google Storage %s" %(GOOGLE_CLOUD_STORAGE)) # GOOGLE_CLOUD_STORAGE - change_status(dcm,"SENT") + batch.change_images_status('SENT') change_status(batch,"DONE") diff --git a/sendit/settings/config.py b/sendit/settings/config.py index 3619a86..00f6bbd 100644 --- a/sendit/settings/config.py +++ b/sendit/settings/config.py @@ -1,7 +1,15 @@ + +##################################################### +# RESTFUL API +##################################################### + # De-identify # If True, we will have the images first go to a task to retrieve fields to deidentify DEIDENTIFY_RESTFUL=True +# The default study to use +SOM_STUDY="test" + ##################################################### # STORAGE #####################################################