Skip to content

Latest commit

 

History

History
690 lines (574 loc) · 31.8 KB

weekly-meetings-yasmin.md

File metadata and controls

690 lines (574 loc) · 31.8 KB

The material in the weekly-meetings-yasmin.md is derived from "Whitaker Lab Project Management" by Dr. Kirstie Whitaker and the Whitaker Lab team, used under CC BY 4.0. weekly-meetings-yasmin.md is licensed under CC BY 4.0 by [Yasmin Sarkhosh].

Yasmin's Weekly Meeting Notes



Date: 06 feb 2024

Who helped you this week?

N/A

What did you achieve?

What did you struggle with?

N/A

What would you like to work on next week?

  • Finalizing research questions
  • Finalizing the methodology for the project
  • Experiments - looking into MICCA 2023
    • Experiment 1 - on webpage**
      • ‘command F’ - ‘cancer’
      • randomly selecting papers

Where do you need help from Veronika?

  • Finalizing research questions
  • Finalizing the methodology for the project!
  • Setting research boundaries (finalizing a scope for the project)

Any other topics

N/A

What are the agreements after this meeting? (to fill in after the meeting)

Looking into MICCA 2023 ONLY

Write instructions - Experiment 1

  • How did I find the papers?
    • ‘command F’ - ‘cancer’
  • Annotation scheme
    • Categories: why these?
    • How do I examine/annotate papers?
  • Find number of all papers
    • Define the number of rows from the total number of papers
    • Add all papers
    • Filter by ‘cancer’
      • What organ is in focus?
      • What image type?
    • Find the dataset(s)
      • Number of dataset(s)?
      • What are the demographic information?
      • What are other subgroups in mentioned in the paper?
  • Think of a 'analysis code' for my method that makes it applicable for future analysis
  • Remember: insert all experiments into ‘weekly meetings’

Date: 13 feb 2024

Who helped you this week?

N/A

What did you achieve?

Extracting all papers from MICCA 2023 webpage into an Excel sheet

  • Experiment 2 - Excel Power Query**

    • save page as text file in ARC browser
    • open the file with Excel
      • data - get data - from text file - clean and remove html tags
      • result: a whole list of papers in excel
      • 730 papers in total extracted from micca 2023 webpage
      • 23 papers with 'cancer' in their titles
      • in progress: check for 'cancer' in the content of the papers
  • Manually found 10 volumes in PDF format with all 730 papers

  • Experiment 3 - search for cancer-related papers (not just by title)

    • Downloaded 10 PDF volumes with containting all papers from this url: https://link.springer.com/book/10.1007/978-3-031-43999-5
    • opened the PDFs and searched for 'cancer' by command + F
      • papers, with 'cancer' in their titel was marked with orange
      • papers, with 'cancer' beyond their titel was marked with purple
    • No. of papers with 'cancer' in their title: 23
    • No. of papers with 'cancer' beyond their title (in the content): in progress (17 papers)
    • No. of papers in total: 730
  • Annotation scheme

    • Planning how to analyse each paper for demographics

What did you struggle with?

Writing code that will:

  • Extract ALL 730 PDFs from the MICCA 2023 webpage by webscrapping
    • Currently managed to extract all hmtl from the page, however still struggling with downloading each pdf from hmtl into a folder
      • Managed to extract all direct pdf urls into a list, however still not able to extract text from/download each pdf
  • Extract PDFS with 'cancer' in their title from webpage

What would you like to work on next week?

  • Continueing my analysis of demographics in papers

Where do you need help from Veronika?

  • Web scraping/writing code to help me find demographic information
  • Keep me on the right track
  • Figuring out which steps to take next
    • What is important
    • The boundaries of papers: cancer in the title and/or cancer in the content

Any other topics

  • I need to reschedule meetings if possible due to work (not all tuesdays, but some)

What are the agreements after this meeting? (to fill in after the meeting)

Meeting notes:

  • Look into variables: pretend that you have the data
  • There are papers that are useful about MICCA 2023 in Notion meeting notes
  • Make excel sheet into csv files
  • flowchart: reasoning for selecting papers and the annotation
  • Working next tuesday: use weekly meetings for questions

Date: 20 feb 2024

Who helped you this week?

N/A

What did you achieve?

From the latest meeting:

Experiment 3

  • Finished Experiment 3 - search for cancer-related papers (not just by title)
  • From the extracted pdfs (10 volumens, combined contains 730 papers):
    • look for ‘cancer’ in each paper, select papers that have 'cancer' in their content and not in their title
    • 178 out of 730 papers were found with the search word ‘cancer’ in their content
    • 23 out of 730 papers with 'cancer' in their title (and also content)

Experiment 4

  • Trying to see if I can extract text from PDFs into a dataframe with MyMuPDF:
    • Each row is the title of the paper
    • Columns: text, headings (divide text by secton such as Abstract, Introduction etc)

Experiment 5 - extract 730 papers into a csv file by title, author(s) and year

What did you struggle with?

Experiment 5 - extract 730 papers into a csv file by title, author(s) and year

  • Managed to extract volumen 1 (part 1 of MICCAI 2023) into a dataframe from the copied HTML file, but not the other 9 parts

What would you like to work on next week?

  • To have a finished dataset of all 730 papers, organised into a csv file by title, author(s), year
  • To establish the way best to work with the text analysis/annotation of the pdf files:
    • Is the group of 23 papers (cancer in title) enough for my analysis or should I go by the 178 papers too (cancer in content)?

Where do you need help from Veronika?

I have done so many experiments and had so many issues with extracting text from PDFs files that I might need to know what I should focus on next. However, I have set a deadline for the end of this month (28 feb), where a dataset should be ready for analysis/annotation process

Any other topics

What are the agreements after this meeting? (to fill in after the meeting)


Date: 27 feb 2024

Who helped you this week?

N/A

What did you achieve?

  • Extracted 730 PDFs into a csv file with the help of https://github.com/purrlab/MICCAI-paper-analysis/
  • Looking further into my annotation scheme and how I want to define it
  • Experimenting with converting the PDFS into docx - txt files with the purpose of extracting the text in jypyter notebook
    • Converted the 10 volumens of PDFs into text files with utf-8 encoding
    • Saved all 10 volumens of text files as 1 complete text file

What did you struggle with?

What would you like to work on next week?

  • Settling on my annotation scheme
    • defining the properties I want to annotate per paper and
    • how to analyze those properties
  • To ask Théo to help with the text extraction and annotation

Where do you need help from Veronika?

N/A

Any other topics

  • I can't join meeting today (the 27 feb) as I'm working

What are the agreements after this meeting? (to fill in after the meeting)


Date: 05 mar 2024

Who helped you this week?

N/A

What did you achieve?

  • Looked into which key demographic information I want to analyse
    • How do these research papers define and group their data?
      • Dataset: Gender, Age, Geolocation, Gender-specific illness, Subgroups
      • Idea: looking into authors as well? Educational background, Departments, Area?
  • Annotated 23 papers with cancer in their title in Excel
  • Managed to somewhat write a code that somewhat splits PDFs into separate research papers
    • Some minor errors in the splitting, however all volumes are now seperated into 730 research papers
  • Wrote Théo a mail asking for help. We will meet thursday this week
  • Succesfully installed PDF annotator

What did you struggle with?

  • Managed to somewhat write a code that somewhat splits PDFs into separate research papers

What would you like to work on next week?

  • Annotation and analysis

Where do you need help from Veronika?

N/A

Any other topics

What are the agreements after this meeting? (to fill in after the meeting)

  • Make columns into 0/1
  • clean the experiment form raw to clean by making it more database friendly
  • annotation guidelines: reproducable, add as appendix in the report
  • make script for clean data analysis


Date: 12 mar 2024

Due to illness there is nothing added for this week



Date: 19 mar 2024

Who helped you this week?

N/A

What did you achieve?

Organizing my notebooks and notes

  • Gathering useful notes from my readings into schemes
  • Organizing codes and csv files in github
  • Rewriting and refining the jupyter notebooks

Preprocessing, Data Extraction and Analysis/Data viz

  • Wrote/refined script, that:
    • extracted relevant information (e.g., mentions of demographic data, ethical considerations, methodologies for bias mitigation) from the papers.
    • extracted relevant informatio by a list of keywords
    • structured data according to key indicators
    • designed a data structure (e.g., a pandas DataFrame) where each row represents a paper by title and columns represent extracted sentences by key indicators
    • data viz. of key indicators in papers

Outputs can be find here:

Wrote a script that searches for 'cancer' in text

  • The papers were choosen by:
    • defining the text/relevant content of a paper to start from Abstract ending with incl. Conclusion
    • authors and affiliations, Acknowlegdement and References were excluded from the content
    • then searching for keyword 'cancer' in the defined text for each 730 paper
    • result: 189 papers were selected and added into a dataframe
    • previously, I manually counted the number of papers by searching from keywords 'cancer', 'tumor', 'tumour'
      • I found 178 papers excluding papers with keywords in their title
      • with the titles its a total of 201 papers
      • therefore I have approved the selected 189 to fit the criteria when looking for 'cancer' only

Annotation experiment: Papers from the MICCA 2023 with ‘cancer’ in their title

  • Categories:
    • Organs
    • Image types
    • Number of datasets
    • Sex-specific cancer
    • Demographic information
      • How do they define their data?
      • Do they use demographic information in their datasets?
      • How do they evaluate their results?
      • Do they consider how the data affects their results?
    • Other subgroups

Findings: annotation of papers with cancer in the title

  • 14 out of 23 papers had no mentioning of demographic information
  • 8 papers with demographic information mentioned in their paper
    • 1 paper defined their data collection by age and gender, data was collected from 7 medical centres (geolocation)
    • the 7 others have data collected by geolocation, however these are vaguely mentioned in their paper
  • 23 out of 23 papers do not mentioned anything about fairness nor bias
    • 1 paper mentioned a “sightly gender imbalance”
    • 1 paper mentioned datasets are unbalanced
  • Organs:
    • Breast/breast tissue: 7 papers
    • Cervix: 1 paper
    • Colorectal: 3 papers
    • Kidney: 1 paper
    • Liver: 1 paper
    • gall bladder: 1 paper
    • prostate gland: 2 papers
    • lungs/lung tissue: 3 papers
    • head and neck: 2 papers

Annotation scheme: reflections

Purpose of the annotation: To examine fairness and bias in research papers

  • How can researchers in medical AI, specifically in medical imaging, incorporate less bias’ and more fairness into their models?

  • What are the practices and/or methods that can reduce bias and promote fairness when creating models?

  • Do they implement recommendations that address bias and prevent algorithm discrimination?

    • Recommendations from: alt text
  • How can we use demographic information to analyse papers?

    • What is it?
    • How do we define demographic information?
    • How are they useful for analyzing fairness and bias?
  • Are there any other methods useful for analysing papers?

  • Categories for annotation scheme:

    • Organs
    • Image types
    • Number of datasets
    • Sex-specific cancer
    • Demographic information
  • Inspiration: The Values Encoded in Machine Learning Research

Further details: datasets

  • Do they mention, critique, evaluate, or reflect upon their dataset?
    • Do they evaluate the quality of their dataset?
    • Are there any imbalances in their data collection? and do they consider how these imbalances might affect their model?
    • Do they consider the defined subgroups in their datasets, such as distinguishing data by patients and not by sex too? Are patients further differentiated by age, ethnicity, and/or geolocation?
    • Do they identify weaknesses within their model?
    • Do they contemplate the potential social impacts of their models?

Others

  • Finalized findings from the annotation scheme experiment
  • Annotation scheme and analysis
    • Annotation guide
    • Annotation scheme
      • Finalize a scheme, that is supported by theory, recommendations, litterature
      • Make a list of recommendations by references to address fairness and bias' in medical AI

What did you struggle with?

Writing a script that searches for papers that work with cancer

  • Issues with extracting the titles from the selected papers to merge with the MICCAI 2023 to get the metadata
  • Solved by hard coding and many, many attemps/refinements of the script

What would you like to work on next week?

Where do you need help from Veronika?

  • I need to evaluate my findings and works, see what's useful (since I have a lot by now)
  • Plan next steps

Any other topics

What are the agreements after this meeting? (to fill in after the meeting)

  • Look at what distinguishes the papers that do discuss something about age/gender, from the ones that don't
  • The reflections/recommendations you have seem relevant for the discussion
  • Start putting some structure on the report already to see how it looks in the template

Date: 22 mar 2024

Who helped you this week?

Théo https://github.com/yasminsarkhosh/machine-learning-bsc-thesis-2024/blob/60e1d14d3d03c3e9781dc40e1e4c9600b5c41f7c/meeting_w_theo.md


Date: 02 apr 2024

Who helped you this week?

N/A

What did you achieve?

Annotations: goal Examining the difference between discussing bias and actively reducing bias in medical AI research.

  • From annotating demographcs and bias in papers manually (from sentence extraction) I want to examine the difference between mentioning bias vs actively reducing bias

  • Currently, I have data visualisations showing the occurences of demographic keywords, biases and more in the selected papers. Giving me an idea of IF papers "prioritise" demographics in datasets and biases, but not giving me any insights of HOW they prioritise/not prioritise, the actual cause of their choices (actively excluding demographics from their datasets, or does their datasets simply not provide any demographic information?)

  • Annotation scheme

    • Annotated 263 papers by (and in separate csv files):
      • Demographic information 1/2: age, gender/sex, race
      • Demographic information 2/2: geolocations (as hospital, country, city, area)
      • Bias:
        • analysing/annotating types of bias in papers
        • bias mentioned? bias sentence, algorithmic bias, sub type of algorithmic bias, bias as a technical term, sentence for bias as technical term, reasoning for technical bias, data bias, sub type of data bias, reasoning for data bias, measurement bias sentence for measurement bias, reasoning for measurement bias
    • Still in progress: dataset information, diseases, fairness,

Notebooks

Report

  • Outlines of sections for the report
  • Wrote the Introduction section
  • Methodology in progress
  • Flowchart of work processes and data extractions (still in progress) Extracting papers: alt text

Extracting data from papers: alt text

Selecting papers for further analysis: alt text

What did you struggle with?

  • Mental health

Annotations: bias

  • Definition and Criteria: How do I define "talking about bias" versus "actively reducing bias"? For example, mentioning bias could be categorized as simply acknowledging its existence, while actively reducing bias might involve specific methodologies or interventions implemented within the research design. Where do I set the boundary?
  • Bias comes in many forms: bias related to algorithms and model performance vs bias related to examining the diversity and representativeness of the dataset used to train AI models.

Forms of Bias

  1. Data Bias: This occurs when the dataset used to train the AI model is not representative of the population it's intended to serve. It can lead to the model performing poorly for certain groups.
    • Subtypes:
      • Selection Bias: Arises when the data collected are not representative of the target population.
      • Sampling Bias: Occurs when the dataset does not accurately reflect the diversity of the population.
      • Label Bias: Happens when the labels used for training the AI model do not accurately represent the true nature of the data points.
  2. Algorithmic Bias: Refers to biases that are introduced by the algorithm itself, often through the underlying assumptions made by the developers.
    • Subtypes:
      • Inductive Bias: The set of assumptions an algorithm makes to predict outputs for inputs it has not encountered.
      • Confirmation Bias: Occurs when an algorithm is developed or tuned in a way that it inadvertently confirms the developers' pre-existing beliefs.
  3. Measurement Bias: Involves errors in the way data are measured or collected, leading to inaccurate representations of reality.
    • Example: Using measurement tools or techniques that are not equally valid across different groups.
  4. Reporting Bias: Occurs when there is a selective revealing or suppression of information by researchers or participants.
    • Example: Overemphasis on successful outcomes over negative or null results.
  5. Sociocultural Bias: Arises from societal stereotypes and cultural assumptions that can be encoded into AI models.
    • Example: An AI system that reflects or amplifies societal stereotypes related to race, gender, or socioeconomic status.

What would you like to work on next week?

  1. Annotations and gathering everything I have found together
  2. Data visualisations
  3. How I can use my current findings

Where do you need help from Veronika?

With all the above

Any other topics

Imposter Syndrome

  • I feel kinda scattered. Been working hard on extracting data, defining annotation guides and purposes, writing and rewriting code, experiments. I work on my BA non-stop, and I managed to create and gather a lot of stuff/information. However, is it enough? What am I missing, where should I put my focus now nearing the submission date. Nonetheless, I find it quite difficult evaluating the "quality"/"level" of my project: am I living up to the expections of making/writing a BA-thesis? All though I feel a lot of the work processes and methods I have used for this projects are new to me (not really something that's been prioritised in the curriculum I can't quite figure out if that lives up to the academic requirements for a BA thesis in data science..)

Any other topics

  • Models? Metrices?
  • How much of the notebooks should I add to the report? Like the reasoning for how I created the different notebooks?
  • Annotation scheme

Date: 09 apr 2024

Who helped you this week?

N/A

What did you achieve?

  • Merged findings together in the report
    • Introduction: done
    • Methodology: done
    • Method: in-progress
    • Results/Findings: not started
    • Discussion: notes added
  • Finished a final annotation guide
  • Refined already-made annotations by merging values into categories
    • Annotated 50 papers by sentence extractions

The list of organs from the annotations are divided into main categories of the body, based on anatomical regions and organ systems. No Organ Mentioned: Captures the explicit mention of "no organ mentioned".

Categories Cranial/Head and Neck Thoracic/Chest Abdominal Female Reproductive System
Includes organs located in the head and neck region the heart, lungs, esophagus, and trachea liver, spleen, stomach, pancreas, gallbladder, small and large bowel, and various sections of the intestine (duodenum) organs related to female reproductive functions, including the cervix and uterus
Categories Skeletal System Skin and Breast Pelvic Male Reproductive System
Includes the spinal cord, mandible skin and breast, as these are specified separately from internal organ systems colorectal organs specific to the male reproductive function, such as the penis and prostate
Categories Endocrine System Whole Body Urinary System Lymphatic/Immune System
Includes glands like the thyroid terms that refer to the entire body/not specific to one region, such as "whole body" and "cells" kidneys and other urinary tract structures lymph nodes

The list of image types from the annotations are divided into main categories of the body, based on the imaging technique, purpose, or the type of information they provide

Category Subcategory Examples
Radiology MRI (Magnetic Resonance Imaging) 'mri', 'mr'
CT (Computed Tomography) 'ct', 'ct volumes'
X-rays 'xrays'
Ultrasound 'us volumes', 'ius'
PET (Positron Emission Tomography) 'pet'
fMRI (Functional MRI) 'fmri'
Endoscopy General Endoscopy 'endoscopy'
Gastroscopy 'gastroscopy'
Colonoscopies 'colonoscopies'
Pathology Pathology Images 'pathology images'
Histology Images 'histology images', 'h&e', 'stained image tiles'
WSI (Whole Slide Imaging) 'wsi', 'wsis'
Imaging Processing and Analysis Segmentations 'segmentations'
ROI Masks (Region of Interest Masks) 'roi masks'
Miscellaneous 3D Imaging '3d'
Image Titles 'image titles'
No Image Type Mentioned 'no image type mentioned'
  • Decided to focus only on dataset details in papers for now (and wait with processing findings/annotations of bias-related sentences as it's more complicated)
  • Worked on data visualisations and refining flowcharts

What did you struggle with?

  • My mental health - not taking any breaks, feeling self-critical of my achivements
  • Finding my focus for this thesis - I feel lost
  • Data visualisations
  • How I should present the findings of annotations and whether the main categories are sufficient enough

What would you like to work on next week?

Where do you need help from Veronika?

Any other topics


Date: 16 apr 2024

Who helped you this week?

N/A

What did you achieve?

Report

  • Introduction: semi-done
  • Background: semi-done
  • Methodology: semi-done
  • Methods: in-progress
    • Flowcharts
  • Findings: in-progress
    • Making plots of findings
    • Notebook: analysis and data visualisations
    • Plots are added into a separate .tex file to get an overview of what plots are relevant and interesting
      • Captions are added
      • In overleaf this section is called 'data_visualisations.tex'

Annotations

  • Annotated 62 papers
  • Checked through the annotated papers 2 times
  • Refined scope of categories. Purpose: clear descriptions of what annotators need to annotate
  • Refined the annotation guide
    • Still in progress
    • Some categories need an update

What would you like to work on next week?

  • Finalizing the annotation guide completely for testing
  • Finding the right plots for Findings, maybe some inspiration for plot-ideas: both visually or altso categorially (what columns would be interesting to plot)

Where do you need help from Veronika?

  • To pick the right plots, evaluate which plots make sense, and if there are other good techniques for making data visualisations for categorial data

Any other topics


Date: 30 apr 2024

Who helped you this week?

N/A

What did you achieve?

  • Annotation Guide: Done
    • Folder with 100 articles for annotations
    • Instructions
    • Annotation Scheme for Annotating
    • Provided keyword list for searching after relevant text in articles
  • Annotation Data: Done
    • 100 annotated articles
    • 100 rows with raw data and 15 categories (besides article information)
  • Analysis of annotation data:
    • Notebook: almost done with plots and statistics
    • Overview of findings
    • List of datasets used
  • Report
    • Introduction: Done
    • Background: Almost done
    • Methodology: Done
    • Experimental Setting: Done
    • Findings: Almost Done
    • Discussion: In progress
    • Flowcharts: Almost done
    • Appendencies: Almost done
  • Data visualisations
    • Stacked bar charts
    • Sankey plots

What would you like to work on next week?

  • Finishing report
  • Tidying notebooks and make them ready for submission (script + requirements.txt)
  • Settling on data visualisations (categories and metrices)

Where do you need help from Veronika?

  • Helping me maintain focus in my report:
    • What should I leave out?
    • Where should I put more focus?
    • Should I add more theories to support the purpose of my paper?
    • Does the sectioning made sense?
    • Which plots should I settle on?
  • Feedback on Annotation Guide
  • How I should organise my submission in Github
    • MICCAI papers
    • Notebooks
    • Raw Data
    • Cleaned/Processed Data
    • Packages and Libraries

Any other topics

What are the agreements after this meeting? (to fill in after the meeting)

Submitting all notebook and organising them into sub folders by relevance Articles for running the notebooks

  • Submit selected papers
  • Submit ALL csv files (make sure all notebooks outputs csv file)
    • Make sure all notebooks have an output as an result/purpose for that notebook
  • Figure out how to submit folders with MICCAI papers (check all path when running)
  • README.txt explain the process of running the notebooks and what in the repo

Report

  • plot: no demographic information vs extension demo and compare organs distribution
    • I have already a code that makes plots of organs categories (organ_distribution_no_demo) and individual organs (individual_organ_no_demo): done
  • describe cardiovaskular stacked bar (women and male have equal risks until women hit menopause) Skærmbillede 2024-01-20 kl. 16.12.17.png

Skærmbillede 2024-01-20 kl. 18.09.02.png


Date: 07 May 2024

Who helped you this week?

N/A

What did you achieve?

  • Finished GitHub repo-related tasks:
    • readme.md file
    • requirements.txt
    • description of repo, installments, notebooks and data
    • directory structure of the github
    • notebooks and notebook descriptions/purpose/inputs/outputs
    • output files organised into folders and sub folders
    • re-ran all notebooks to check/correct errors
  • Finished sections in report:
    • Methodology
    • Experimental Setups

What would you like to work on next week?

  • Focusing solely on writing/refining my report

Where do you need help from Veronika?

  • To look through my github repo and check if anything is missing?
  • (Or another) to re-run my notebooks
  • If possible, a final read on my report

Any other topics

  • Exam
    • What should we prepare/what are the requirements/expectations?

What are the agreements after this meeting? (to fill in after the meeting)

  • V fromt the report: It could be relevant to mention/discuss papers on bias reduction, fairness etc within the more machine learning-focused papers in medical imaging domain - these studies exist but tend to focus on a few datasets, because many datasets do not have this information
  • Look into related work in the MICCAI 2023 papers beyond the selection
  • Background

    • section 2.d: the example of diabetes is a bit far out of the scope of the paper, look into examples within MICCAI 2023
    • sex change in medical imaging as a section (inconsistency in definitions/meanings/terms)
    • gender vs sex: define what you are focusing on and what contributes gender and/or sex, not clear if they report biological sex or gender (was how you define yourself as)
  • For writing sections

    • recipe and the justification: salt, acid, heat, Explain why the steps are important, and why you cannot go to step 4 before going through step 1 and 2 and 3.
  • Exam