{ "cells": [ { "cell_type": "markdown", "id": "54413574-4b98-4df0-9272-92df0d4b9262", "metadata": {}, "source": [ "# Get PSEO IPEDS Crosswalk\n", "\n", "This script creates a crosswalk of data from PSEO member institutions and the Integrated Postsecondary Education Data System (IPEDS). OPEID is used to initially match institutions, after which exceptions are handled individually. " ] }, { "cell_type": "markdown", "id": "e9cd69d2-f62b-4c34-a992-6a4d4a840d42", "metadata": {}, "source": [ "### Step 1: Import required libraries." ] }, { "cell_type": "code", "execution_count": 1047, "id": "d59b1103-1518-4f8b-9acd-fefd13282406", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import requests\n", "import zipfile\n", "import os" ] }, { "cell_type": "markdown", "id": "ad7a04be-b14d-4884-99a4-05df82a50639", "metadata": {}, "source": [ "### Step 2: Fetch the list of PSEO member institutions." ] }, { "cell_type": "code", "execution_count": 1049, "id": "864b210c-17dc-4379-a391-86656e512dd7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "List of PSEO member institutions successfully downloaded.\n" ] } ], "source": [ "# Define the URL for the list of institutions.\n", "url = \"https://lehd.ces.census.gov/data/pseo/latest_release/all/pseo_all_institutions.csv\"\n", "\n", "# Read the CSV file into a data frame.\n", "pseo = pd.read_csv(url, dtype={\"institution\":str})\n", "pseo = pseo[[\"institution\", \"label\"]]\n", "pseo.rename(columns={\"label\":\"label_institution_pseo\"}, inplace=True)\n", "\n", "# Print a success message.\n", "print(f\"List of PSEO member institutions successfully downloaded.\")" ] }, { "cell_type": "markdown", "id": "ecdf437a-3126-4ace-9445-9882e586405c", "metadata": {}, "source": [ "### Step 3: Fetch the most recent data from the IPEDS Institutional Characteristics survey." ] }, { "cell_type": "code", "execution_count": 1051, "id": "d822efcd-64ac-4cde-866f-8e999cca652b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading HD2023.zip...\n", "Downloaded HD2023.zip.\n", "Extracting HD2023.zip...\n", "Extracted contents of HD2023.zip.\n", "Deleted HD2023.zip.\n" ] } ], "source": [ "# Define the parameters for downloading and unzipping the IPEDS survey file.\n", "url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2023.zip'\n", "zip_filename = 'HD2023.zip'\n", "\n", "# Download the ZIP file.\n", "print(f\"Downloading {zip_filename}...\")\n", "response = requests.get(url)\n", "response.raise_for_status()\n", "\n", "# Save the ZIP file to disk.\n", "with open(zip_filename, \"wb\") as f:\n", " f.write(response.content)\n", "print(f\"Downloaded {zip_filename}.\")\n", "\n", "# Extract the contents of the ZIP file.\n", "print(f\"Extracting {zip_filename}...\")\n", "with zipfile.ZipFile(zip_filename, 'r') as zip_ref:\n", " zip_ref.extractall()\n", "print(f\"Extracted contents of {zip_filename}.\")\n", "\n", "# Delete the ZIP file.\n", "os.remove(zip_filename)\n", "print(f\"Deleted {zip_filename}.\")\n", "\n", "# Read the data into a data frame.\n", "ipeds = pd.read_csv(\"HD2023.csv\", encoding=\"latin1\", dtype={\"UNITID\":str, \"OPEID\":str})" ] }, { "cell_type": "markdown", "id": "f51eafce-5b99-41ae-94ea-0df00175cbe4", "metadata": {}, "source": [ "### Step 4: Drop unused columns from the IPEDS data frame." ] }, { "cell_type": "code", "execution_count": 1053, "id": "f3c451ff-6c0c-43b3-879e-012a5ec8100b", "metadata": {}, "outputs": [], "source": [ "ipeds = ipeds[[\"UNITID\", \"INSTNM\", \"OPEID\", \"ICLEVEL\"]]\n", "ipeds.rename(columns={\"UNITID\":\"ipeds_id\", \"INSTNM\":\"label_institution_ipeds\", \"OPEID\": \"opeid\", \"ICLEVEL\":\"institution_level\"}, inplace=True)" ] }, { "cell_type": "markdown", "id": "15987d64-f1aa-494d-92f3-4335128f81b9", "metadata": {}, "source": [ "### Step 5: Merge the PSEO and IPEDS data frame on the OPEID." ] }, { "cell_type": "code", "execution_count": 1055, "id": "95ff5491-a8fb-4f40-8713-00246245883a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " institution label_institution_pseo ipeds_id \\\n", "0 00105100 University of Alabama 100751 \n", "1 00105200 University of Alabama at Birmingham 100663 \n", "2 00105500 University of Alabama in Huntsville 100706 \n", "3 00108100 Arizona State University 104151 \n", "4 00108200 Northern Arizona University 105330 \n", "\n", " label_institution_ipeds opeid institution_level \n", "0 The University of Alabama 00105100 1.0 \n", "1 University of Alabama at Birmingham 00105200 1.0 \n", "2 University of Alabama in Huntsville 00105500 1.0 \n", "3 Arizona State University Campus Immersion 00108100 1.0 \n", "4 Northern Arizona University 00108200 1.0 \n" ] } ], "source": [ "crosswalk = pseo.merge(ipeds, how=\"left\", left_on=\"institution\", right_on=\"opeid\")\n", "print(crosswalk.head())" ] }, { "cell_type": "markdown", "id": "073f1755-9416-4134-a1e9-208770c5b2f8", "metadata": {}, "source": [ "### Step 6: Identify and extract duplicate records for review." ] }, { "cell_type": "code", "execution_count": 1057, "id": "b674c73c-e0a6-45c0-950f-696a0a3a9af3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Duplicate records exported to file duplicates.csv.\n" ] } ], "source": [ "duplicates = crosswalk[crosswalk.duplicated(subset=\"label_institution_pseo\", keep=False)]\n", "duplicates.to_csv(\"duplicates.csv\", index=False)\n", "print(f\"Duplicate records exported to file duplicates.csv.\")" ] }, { "cell_type": "markdown", "id": "f612063a-f289-4d5b-b29a-7bad0af0dc19", "metadata": {}, "source": [ "### Step 7: Remove duplicate records where institutions are not matched correctly." ] }, { "cell_type": "code", "execution_count": 1059, "id": "e1b15d3d-8bf0-4804-b65a-fb57b2553171", "metadata": {}, "outputs": [], "source": [ "crosswalk = crosswalk[~crosswalk[\"ipeds_id\"].isin([\"128300\", \"479956\", \"492962\", \"443711\", \"439154\", \"491288\", \"491297\", \"228732\", \"440916\", \"160533\"])]" ] }, { "cell_type": "markdown", "id": "9b36557d-68f9-414f-a6d9-a17efc2aabc4", "metadata": {}, "source": [ "### Step 8: Confirm duplicate records have been removed." ] }, { "cell_type": "code", "execution_count": 1061, "id": "a8282563-52b2-4793-a619-2cfd218c5440", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Empty DataFrame\n", "Columns: [institution, label_institution_pseo, ipeds_id, label_institution_ipeds, opeid, institution_level]\n", "Index: []\n" ] } ], "source": [ "duplicates = crosswalk[crosswalk.duplicated(subset=\"institution\", keep=False)]\n", "print(duplicates)" ] }, { "cell_type": "markdown", "id": "b943ab44-83e0-49ac-b2d0-4c2ae211ca5e", "metadata": {}, "source": [ "### Step 9: Manually add institution level to two-year institutions without an IPEDS match." ] }, { "cell_type": "code", "execution_count": 1063, "id": "eb1cb10c-d89f-430f-b45a-89afbe1770ce", "metadata": {}, "outputs": [], "source": [ "# List of 2YR institutions without an IPEDS match.\n", "target_institutions = [\n", " \"00139200\", \n", " \"00139800\", \n", " \"00139900\", \n", " \"00177900\", \n", " \"00235000\", \n", " \"00235600\", \n", " \"00311601\", \n", " \"00451300\", \n", " \"00554900\", \n", " \"00677500\", \n", " \"00698100\", \n", " \"00698200\", \n", " \"00803800\", \n", " \"00842300\", \n", " \"00854700\", \n", " \"00976400\", \n", " \"00976500\", \n", " \"00978600\", \n", " \"00992300\", \n", " \"00992400\", \n", " \"00992500\", \n", " \"00992600\", \n", " \"01003700\", \n", " \"01003800\", \n", " \"01003900\", \n", " \"01004000\", \n", " \"01004100\", \n", " \"01010900\", \n", " \"01053000\", \n", " \"01115000\", \n", " \"02074400\", \n", " \"03521300\", \n", " \"00356100\",\n", " \"00400900\",\n", " \"00850400\",\n", " \"00851000\",\n", " \"02100200\",\n", " \"03538300\",\n", " \"00457901\",\n", " \"00178400\",\n", " \"00524600\",\n", " \"00850300\",\n", " \"02077400\",\n", " \"02245505\",\n", " \"00548800\",\n", " \"00552600\"\n", "]\n", "\n", "# Update level to \"2\" where institution level is null and institution is in the list of 2YR institutions.\n", "crosswalk.loc[\n", " crosswalk[\"institution_level\"].isnull() & crosswalk[\"institution\"].isin(target_institutions),\n", " \"institution_level\"\n", "] = 2" ] }, { "cell_type": "markdown", "id": "3301a8a3-1b24-48f9-a0ac-1210fb413b58", "metadata": {}, "source": [ "### Step 10: Manually add institution level to remaining institutions without an IPEDS match." ] }, { "cell_type": "code", "execution_count": 1065, "id": "e6fa94ad-db1e-4118-abd4-cd99e46347be", "metadata": {}, "outputs": [], "source": [ "# Update level to \"1\" where institution level is null.\n", "crosswalk.loc[crosswalk[\"institution_level\"].isnull(), \"institution_level\"] = 1" ] }, { "cell_type": "markdown", "id": "dd2d1291-c873-4eba-9d7a-a3bb73f51f66", "metadata": {}, "source": [ "### Step 11: Add a label for institution level." ] }, { "cell_type": "code", "execution_count": 1067, "id": "5e6d272d-1db8-4d8a-b449-655aa1607855", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({\"level\": [1, 2, 3, 1, 2]})\n", "\n", "# Mapping dictionary\n", "level_mapping = {\n", " 1: \"4YR\",\n", " 2: \"2YR\",\n", " 3: \"2YR\"\n", "}\n", "\n", "# Create new column with mapped labels\n", "crosswalk[\"label_institution_level\"] = crosswalk[\"institution_level\"].map(level_mapping)" ] }, { "cell_type": "markdown", "id": "9c0c4365-5621-4659-ab2d-5681aeee41b6", "metadata": {}, "source": [ "### Step 12: Write final crosswalk to a CSV file." ] }, { "cell_type": "code", "execution_count": 1069, "id": "593c2a24-b61b-4024-a993-8e3756ca07a4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Crosswalk data successfully written to institution_crosswalk.csv.\n" ] } ], "source": [ "crosswalk.to_csv(\"institution_crosswalk.csv\", index=False)\n", "print(f\"Crosswalk data successfully written to institution_crosswalk.csv.\")" ] }, { "cell_type": "markdown", "id": "3bf0a633-87bd-4270-8f07-eae3d80f69e9", "metadata": {}, "source": [ "### Step 13: Delete the duplicate records file." ] }, { "cell_type": "code", "execution_count": 1071, "id": "9b23d6b3-e76f-4b24-ad83-3671f7aabfb7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "duplicates.csv has been deleted.\n" ] } ], "source": [ "file_path = \"duplicates.csv\"\n", "\n", "if os.path.exists(file_path):\n", " os.remove(file_path)\n", " print(f\"{file_path} has been deleted.\")\n", "else:\n", " print(f\"{file_path} does not exist.\")" ] }, { "cell_type": "markdown", "id": "4e127dc6-e533-4cab-a7e2-304e26f27f08", "metadata": {}, "source": [ "### Step 14: Delete the IPEDS Institutional Characteristics survey file." ] }, { "cell_type": "code", "execution_count": 1073, "id": "5a207d05-cc0d-4785-9551-d1e672107f8b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hd2023.csv has been deleted.\n" ] } ], "source": [ "file_path = \"hd2023.csv\"\n", "\n", "if os.path.exists(file_path):\n", " os.remove(file_path)\n", " print(f\"{file_path} has been deleted.\")\n", "else:\n", " print(f\"{file_path} does not exist.\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:base] *", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }