course4

aasu14 · Mar 4, 2019 · 2d06f16 · 2d06f16
1 parent 0c9129c
commit 2d06f16
Show file tree

Hide file tree

Showing 14 changed files with 33,103 additions and 0 deletions.
diff --git a/Applied Text Mining in Python/Assignment 1.ipynb b/Applied Text Mining in Python/Assignment 1.ipynb
@@ -0,0 +1,205 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Assignment 1\n",
+    "\n",
+    "In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data. \n",
+    "\n",
+    "Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.\n",
+    "\n",
+    "The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates. \n",
+    "\n",
+    "Here is a list of some of the variants you might encounter in this dataset:\n",
+    "* 04/20/2009; 04/20/09; 4/20/09; 4/3/09\n",
+    "* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;\n",
+    "* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009\n",
+    "* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009\n",
+    "* Feb 2009; Sep 2009; Oct 2010\n",
+    "* 6/2008; 12/2009\n",
+    "* 2009; 2010\n",
+    "\n",
+    "Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:\n",
+    "* Assume all dates in xx/xx/xx format are mm/dd/yy\n",
+    "* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)\n",
+    "* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).\n",
+    "* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).\n",
+    "* Watch out for potential typos as this is a raw, real-life derived dataset.\n",
+    "\n",
+    "With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.\n",
+    "\n",
+    "For example if the original series was this:\n",
+    "\n",
+    "    0    1999\n",
+    "    1    2010\n",
+    "    2    1978\n",
+    "    3    2015\n",
+    "    4    1985\n",
+    "\n",
+    "Your function should return this:\n",
+    "\n",
+    "    0    2\n",
+    "    1    4\n",
+    "    2    0\n",
+    "    3    1\n",
+    "    4    3\n",
+    "\n",
+    "Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.\n",
+    "\n",
+    "*This function should return a Series of length 500 and dtype int.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0         03/25/93 Total time of visit (in minutes):\\n\n",
+       "1                       6/18/85 Primary Care Doctor:\\n\n",
+       "2    sshe plans to move as of 7/8/71 In-Home Servic...\n",
+       "3                7 on 9/27/75 Audit C Score Current:\\n\n",
+       "4    2/6/96 sleep studyPain Treatment Pain Level (N...\n",
+       "5                    .Per 7/06/79 Movement D/O note:\\n\n",
+       "6    4, 5/18/78 Patient's thoughts about current su...\n",
+       "7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...\n",
+       "8                         3/7/86 SOS-10 Total Score:\\n\n",
+       "9             (4/10/71)Score-1Audit C Score Current:\\n\n",
+       "dtype: object"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "doc = []\n",
+    "with open('dates.txt') as file:\n",
+    "    for line in file:\n",
+    "        doc.append(line)\n",
+    "\n",
+    "df = pd.Series(doc)\n",
+    "df.head(10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def date_sorter():\n",
+    "    \n",
+    "    # Your code here\n",
+    "    # Full date\n",
+    "    global df\n",
+    "    dates_extracted = df.str.extractall(r'(?P<origin>(?P<month>\\d?\\d)[/|-](?P<day>\\d?\\d)[/|-](?P<year>\\d{4}))')\n",
+    "    index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n",
+    "    dates_extracted = dates_extracted.append(df[index_left].str.extractall(r'(?P<origin>(?P<month>\\d?\\d)[/|-](?P<day>([0-2]?[0-9])|([3][01]))[/|-](?P<year>\\d{2}))'))\n",
+    "    index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n",
+    "    del dates_extracted[3]\n",
+    "    del dates_extracted[4]\n",
+    "    dates_extracted = dates_extracted.append(df[index_left].str.extractall(r'(?P<origin>(?P<day>\\d?\\d) ?(?P<month>[a-zA-Z]{3,})\\.?,? (?P<year>\\d{4}))'))\n",
+    "    index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n",
+    "    dates_extracted = dates_extracted.append(df[index_left].str.extractall(r'(?P<origin>(?P<month>[a-zA-Z]{3,})\\.?-? ?(?P<day>\\d\\d?)(th|nd|st)?,?-? ?(?P<year>\\d{4}))'))\n",
+    "    del dates_extracted[3]\n",
+    "    index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n",
+    "\n",
+    "    # Without day\n",
+    "    dates_without_day = df[index_left].str.extractall('(?P<origin>(?P<month>[A-Z][a-z]{2,}),?\\.? (?P<year>\\d{4}))')\n",
+    "    dates_without_day = dates_without_day.append(df[index_left].str.extractall(r'(?P<origin>(?P<month>\\d\\d?)/(?P<year>\\d{4}))'))\n",
+    "    dates_without_day['day'] = 1\n",
+    "    dates_extracted = dates_extracted.append(dates_without_day)\n",
+    "    index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n",
+    "\n",
+    "    # Only year\n",
+    "    dates_only_year = df[index_left].str.extractall(r'(?P<origin>(?P<year>\\d{4}))')\n",
+    "    dates_only_year['day'] = 1\n",
+    "    dates_only_year['month'] = 1\n",
+    "    dates_extracted = dates_extracted.append(dates_only_year)\n",
+    "    index_left = ~df.index.isin([x[0] for x in dates_extracted.index])\n",
+    "\n",
+    "    # Year\n",
+    "    dates_extracted['year'] = dates_extracted['year'].apply(lambda x: '19' + x if len(x) == 2 else x)\n",
+    "    dates_extracted['year'] = dates_extracted['year'].apply(lambda x: str(x))\n",
+    "\n",
+    "    # Month\n",
+    "    dates_extracted['month'] = dates_extracted['month'].apply(lambda x: x[1:] if type(x) is str and x.startswith('0') else x)\n",
+    "    month_dict = dict({'September': 9, 'Mar': 3, 'November': 11, 'Jul': 7, 'January': 1, 'December': 12,\n",
+    "                       'Feb': 2, 'May': 5, 'Aug': 8, 'Jun': 6, 'Sep': 9, 'Oct': 10, 'June': 6, 'March': 3,\n",
+    "                       'February': 2, 'Dec': 12, 'Apr': 4, 'Jan': 1, 'Janaury': 1,'August': 8, 'October': 10,\n",
+    "                       'July': 7, 'Since': 1, 'Nov': 11, 'April': 4, 'Decemeber': 12, 'Age': 8})\n",
+    "    dates_extracted.replace({\"month\": month_dict}, inplace=True)\n",
+    "    dates_extracted['month'] = dates_extracted['month'].apply(lambda x: str(x))\n",
+    "\n",
+    "    # Day\n",
+    "    dates_extracted['day'] = dates_extracted['day'].apply(lambda x: str(x))\n",
+    "\n",
+    "    # Cleaned date\n",
+    "    dates_extracted['date'] = dates_extracted['month'] + '/' + dates_extracted['day'] + '/' + dates_extracted['year']\n",
+    "    dates_extracted['date'] = pd.to_datetime(dates_extracted['date'])\n",
+    "\n",
+    "    dates_extracted.sort_values(by='date', inplace=True)\n",
+    "    df1 = pd.Series(list(dates_extracted.index.labels[0]))\n",
+    "    \n",
+    "    return df1# Your answer here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "coursera": {
+   "course_slug": "python-text-mining",
+   "graded_item_id": "LvcWI",
+   "launcher_item_id": "krne9",
+   "part_id": "Mkp1I"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}