{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "54413574-4b98-4df0-9272-92df0d4b9262",
   "metadata": {},
   "source": [
    "# Get PSEO IPEDS Crosswalk\n",
    "\n",
    "This script creates a crosswalk of data from PSEO member institutions and the Integrated Postsecondary Education Data System (IPEDS). OPEID is used to initially match institutions, after which exceptions are handled individually. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9cd69d2-f62b-4c34-a992-6a4d4a840d42",
   "metadata": {},
   "source": [
    "### Step 1: Import required libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1047,
   "id": "d59b1103-1518-4f8b-9acd-fefd13282406",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import requests\n",
    "import zipfile\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad7a04be-b14d-4884-99a4-05df82a50639",
   "metadata": {},
   "source": [
    "### Step 2: Fetch the list of PSEO member institutions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1049,
   "id": "864b210c-17dc-4379-a391-86656e512dd7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "List of PSEO member institutions successfully downloaded.\n"
     ]
    }
   ],
   "source": [
    "# Define the URL for the list of institutions.\n",
    "url = \"https://lehd.ces.census.gov/data/pseo/latest_release/all/pseo_all_institutions.csv\"\n",
    "\n",
    "# Read the CSV file into a data frame.\n",
    "pseo = pd.read_csv(url, dtype={\"institution\":str})\n",
    "pseo = pseo[[\"institution\", \"label\"]]\n",
    "pseo.rename(columns={\"label\":\"label_institution_pseo\"}, inplace=True)\n",
    "\n",
    "# Print a success message.\n",
    "print(f\"List of PSEO member institutions successfully downloaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecdf437a-3126-4ace-9445-9882e586405c",
   "metadata": {},
   "source": [
    "### Step 3: Fetch the most recent data from the IPEDS Institutional Characteristics survey."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1051,
   "id": "d822efcd-64ac-4cde-866f-8e999cca652b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading HD2023.zip...\n",
      "Downloaded HD2023.zip.\n",
      "Extracting HD2023.zip...\n",
      "Extracted contents of HD2023.zip.\n",
      "Deleted HD2023.zip.\n"
     ]
    }
   ],
   "source": [
    "# Define the parameters for downloading and unzipping the IPEDS survey file.\n",
    "url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2023.zip'\n",
    "zip_filename = 'HD2023.zip'\n",
    "\n",
    "# Download the ZIP file.\n",
    "print(f\"Downloading {zip_filename}...\")\n",
    "response = requests.get(url)\n",
    "response.raise_for_status()\n",
    "\n",
    "# Save the ZIP file to disk.\n",
    "with open(zip_filename, \"wb\") as f:\n",
    "    f.write(response.content)\n",
    "print(f\"Downloaded {zip_filename}.\")\n",
    "\n",
    "# Extract the contents of the ZIP file.\n",
    "print(f\"Extracting {zip_filename}...\")\n",
    "with zipfile.ZipFile(zip_filename, 'r') as zip_ref:\n",
    "    zip_ref.extractall()\n",
    "print(f\"Extracted contents of {zip_filename}.\")\n",
    "\n",
    "# Delete the ZIP file.\n",
    "os.remove(zip_filename)\n",
    "print(f\"Deleted {zip_filename}.\")\n",
    "\n",
    "# Read the data into a data frame.\n",
    "ipeds = pd.read_csv(\"HD2023.csv\", encoding=\"latin1\", dtype={\"UNITID\":str, \"OPEID\":str})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f51eafce-5b99-41ae-94ea-0df00175cbe4",
   "metadata": {},
   "source": [
    "### Step 4: Drop unused columns from the IPEDS data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1053,
   "id": "f3c451ff-6c0c-43b3-879e-012a5ec8100b",
   "metadata": {},
   "outputs": [],
   "source": [
    "ipeds = ipeds[[\"UNITID\", \"INSTNM\", \"OPEID\", \"ICLEVEL\"]]\n",
    "ipeds.rename(columns={\"UNITID\":\"ipeds_id\", \"INSTNM\":\"label_institution_ipeds\", \"OPEID\": \"opeid\", \"ICLEVEL\":\"institution_level\"}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15987d64-f1aa-494d-92f3-4335128f81b9",
   "metadata": {},
   "source": [
    "### Step 5: Merge the PSEO and IPEDS data frame on the OPEID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1055,
   "id": "95ff5491-a8fb-4f40-8713-00246245883a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  institution               label_institution_pseo ipeds_id  \\\n",
      "0    00105100                University of Alabama   100751   \n",
      "1    00105200  University of Alabama at Birmingham   100663   \n",
      "2    00105500  University of Alabama in Huntsville   100706   \n",
      "3    00108100             Arizona State University   104151   \n",
      "4    00108200          Northern Arizona University   105330   \n",
      "\n",
      "                     label_institution_ipeds     opeid  institution_level  \n",
      "0                  The University of Alabama  00105100                1.0  \n",
      "1        University of Alabama at Birmingham  00105200                1.0  \n",
      "2        University of Alabama in Huntsville  00105500                1.0  \n",
      "3  Arizona State University Campus Immersion  00108100                1.0  \n",
      "4                Northern Arizona University  00108200                1.0  \n"
     ]
    }
   ],
   "source": [
    "crosswalk = pseo.merge(ipeds, how=\"left\", left_on=\"institution\", right_on=\"opeid\")\n",
    "print(crosswalk.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "073f1755-9416-4134-a1e9-208770c5b2f8",
   "metadata": {},
   "source": [
    "### Step 6: Identify and extract duplicate records for review."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1057,
   "id": "b674c73c-e0a6-45c0-950f-696a0a3a9af3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Duplicate records exported to file duplicates.csv.\n"
     ]
    }
   ],
   "source": [
    "duplicates = crosswalk[crosswalk.duplicated(subset=\"label_institution_pseo\", keep=False)]\n",
    "duplicates.to_csv(\"duplicates.csv\", index=False)\n",
    "print(f\"Duplicate records exported to file duplicates.csv.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f612063a-f289-4d5b-b29a-7bad0af0dc19",
   "metadata": {},
   "source": [
    "### Step 7: Remove duplicate records where institutions are not matched correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1059,
   "id": "e1b15d3d-8bf0-4804-b65a-fb57b2553171",
   "metadata": {},
   "outputs": [],
   "source": [
    "crosswalk = crosswalk[~crosswalk[\"ipeds_id\"].isin([\"128300\", \"479956\", \"492962\", \"443711\", \"439154\", \"491288\", \"491297\", \"228732\", \"440916\", \"160533\"])]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b36557d-68f9-414f-a6d9-a17efc2aabc4",
   "metadata": {},
   "source": [
    "### Step 8: Confirm duplicate records have been removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1061,
   "id": "a8282563-52b2-4793-a619-2cfd218c5440",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Empty DataFrame\n",
      "Columns: [institution, label_institution_pseo, ipeds_id, label_institution_ipeds, opeid, institution_level]\n",
      "Index: []\n"
     ]
    }
   ],
   "source": [
    "duplicates = crosswalk[crosswalk.duplicated(subset=\"institution\", keep=False)]\n",
    "print(duplicates)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b943ab44-83e0-49ac-b2d0-4c2ae211ca5e",
   "metadata": {},
   "source": [
    "### Step 9: Manually add institution level to two-year institutions without an IPEDS match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1063,
   "id": "eb1cb10c-d89f-430f-b45a-89afbe1770ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of 2YR institutions without an IPEDS match.\n",
    "target_institutions = [\n",
    "    \"00139200\", \n",
    "    \"00139800\", \n",
    "    \"00139900\", \n",
    "    \"00177900\", \n",
    "    \"00235000\", \n",
    "    \"00235600\", \n",
    "    \"00311601\", \n",
    "    \"00451300\", \n",
    "    \"00554900\", \n",
    "    \"00677500\", \n",
    "    \"00698100\", \n",
    "    \"00698200\", \n",
    "    \"00803800\", \n",
    "    \"00842300\", \n",
    "    \"00854700\", \n",
    "    \"00976400\", \n",
    "    \"00976500\", \n",
    "    \"00978600\", \n",
    "    \"00992300\", \n",
    "    \"00992400\", \n",
    "    \"00992500\", \n",
    "    \"00992600\", \n",
    "    \"01003700\", \n",
    "    \"01003800\", \n",
    "    \"01003900\", \n",
    "    \"01004000\", \n",
    "    \"01004100\", \n",
    "    \"01010900\", \n",
    "    \"01053000\", \n",
    "    \"01115000\", \n",
    "    \"02074400\", \n",
    "    \"03521300\", \n",
    "    \"00356100\",\n",
    "    \"00400900\",\n",
    "    \"00850400\",\n",
    "    \"00851000\",\n",
    "    \"02100200\",\n",
    "    \"03538300\",\n",
    "    \"00457901\",\n",
    "    \"00178400\",\n",
    "    \"00524600\",\n",
    "    \"00850300\",\n",
    "    \"02077400\",\n",
    "    \"02245505\",\n",
    "    \"00548800\",\n",
    "    \"00552600\"\n",
    "]\n",
    "\n",
    "# Update level to \"2\" where institution level is null and institution is in the list of 2YR institutions.\n",
    "crosswalk.loc[\n",
    "    crosswalk[\"institution_level\"].isnull() & crosswalk[\"institution\"].isin(target_institutions),\n",
    "    \"institution_level\"\n",
    "] = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3301a8a3-1b24-48f9-a0ac-1210fb413b58",
   "metadata": {},
   "source": [
    "### Step 10: Manually add institution level to remaining institutions without an IPEDS match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1065,
   "id": "e6fa94ad-db1e-4118-abd4-cd99e46347be",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update level to \"1\" where institution level is null.\n",
    "crosswalk.loc[crosswalk[\"institution_level\"].isnull(), \"institution_level\"] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd2d1291-c873-4eba-9d7a-a3bb73f51f66",
   "metadata": {},
   "source": [
    "### Step 11: Add a label for institution level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1067,
   "id": "5e6d272d-1db8-4d8a-b449-655aa1607855",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"level\": [1, 2, 3, 1, 2]})\n",
    "\n",
    "# Mapping dictionary\n",
    "level_mapping = {\n",
    "    1: \"4YR\",\n",
    "    2: \"2YR\",\n",
    "    3: \"2YR\"\n",
    "}\n",
    "\n",
    "# Create new column with mapped labels\n",
    "crosswalk[\"label_institution_level\"] = crosswalk[\"institution_level\"].map(level_mapping)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c0c4365-5621-4659-ab2d-5681aeee41b6",
   "metadata": {},
   "source": [
    "### Step 12: Write final crosswalk to a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1069,
   "id": "593c2a24-b61b-4024-a993-8e3756ca07a4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Crosswalk data successfully written to institution_crosswalk.csv.\n"
     ]
    }
   ],
   "source": [
    "crosswalk.to_csv(\"institution_crosswalk.csv\", index=False)\n",
    "print(f\"Crosswalk data successfully written to institution_crosswalk.csv.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bf0a633-87bd-4270-8f07-eae3d80f69e9",
   "metadata": {},
   "source": [
    "### Step 13: Delete the duplicate records file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1071,
   "id": "9b23d6b3-e76f-4b24-ad83-3671f7aabfb7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "duplicates.csv has been deleted.\n"
     ]
    }
   ],
   "source": [
    "file_path = \"duplicates.csv\"\n",
    "\n",
    "if os.path.exists(file_path):\n",
    "    os.remove(file_path)\n",
    "    print(f\"{file_path} has been deleted.\")\n",
    "else:\n",
    "    print(f\"{file_path} does not exist.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e127dc6-e533-4cab-a7e2-304e26f27f08",
   "metadata": {},
   "source": [
    "### Step 14: Delete the IPEDS Institutional Characteristics survey file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1073,
   "id": "5a207d05-cc0d-4785-9551-d1e672107f8b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hd2023.csv has been deleted.\n"
     ]
    }
   ],
   "source": [
    "file_path = \"hd2023.csv\"\n",
    "\n",
    "if os.path.exists(file_path):\n",
    "    os.remove(file_path)\n",
    "    print(f\"{file_path} has been deleted.\")\n",
    "else:\n",
    "    print(f\"{file_path} does not exist.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:base] *",
   "language": "python",
   "name": "conda-base-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}