{ "cells": [ { "cell_type": "markdown", "id": "bc287ff6-cc37-4763-ba1d-2f9f92324094", "metadata": {}, "source": [ "# Get PSEO Earnings (All)\n", "\n", "This script retrieves and prepares all earnings data from the Postsecondary Employment Outcomes (PSEO) Explorer tool. The script first downloads and unzips a compressed data file with the raw PSEO Explorer data for all participating states. Next, the script downloads labels for the raw data and merges them with the original data file. Finally, the script exports the complete data set to a CSV file in the working directory.\n", "\n", "Documentation for specific data elements included in the dataset are available at https://lehd.ces.census.gov/data/schema/latest/lehd_public_use_schema.html. " ] }, { "cell_type": "markdown", "id": "d8349040-dd4b-4cc5-b63b-b4588e3ba1e6", "metadata": {}, "source": [ "### Step 1: Import required libraries." ] }, { "cell_type": "code", "execution_count": 2, "id": "f33151f4-45ec-4299-9d0e-4feccc97cd23", "metadata": {}, "outputs": [], "source": [ "import requests\n", "import gzip\n", "import shutil\n", "import os\n", "import pandas as pd\n", "import zipfile\n", "import re" ] }, { "cell_type": "markdown", "id": "d9e7919e-3c2b-49ea-954b-c9f135d451ef", "metadata": {}, "source": [ "### Step 2: Download, unzip, and read in the raw PSEO earnings data." ] }, { "cell_type": "code", "execution_count": 4, "id": "94936c74-849c-4e31-93f3-c1b41584422a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading file from https://lehd.ces.census.gov/data/pseo/latest_release/all/pseoe_all.csv.gz...\n", "File downloaded successfully as pseoe_all.csv.gz.\n", "Reading data from pseoe_all.csv.gz into a data frame ...\n", "Data read successfully into a data frame.\n", "Deleting pseoe_all.csv.gz...\n", "Cleanup complete.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/pseo/latest_release/all/pseoe_all.csv.gz\" # Link to the compressed data file.\n", "gz_filename = \"pseoe_all.csv.gz\" # Name of the compressed data file.\n", "unzipped_filename = \"pseoe_all.csv\" # Name of the data file within the compressed data file. \n", "\n", "try:\n", " # Step 1: Download the file.\n", " print(f\"Downloading file from {url}...\")\n", " response = requests.get(url, stream=True)\n", " response.raise_for_status() # Raise an HTTPError for bad responses (4xx and 5xx)\n", "\n", " # Write the downloaded content to a .gz file.\n", " with open(gz_filename, \"wb\") as gz_file:\n", " gz_file.write(response.content)\n", " print(f\"File downloaded successfully as {gz_filename}.\")\n", "\n", " # Step 2: Read the data directly from the .gz file into a data frame.\n", " print(f\"Reading data from {gz_filename} into a data frame ...\")\n", " with gzip.open(gz_filename, \"rt\") as gz_file:\n", " pseoe = pd.read_csv(gz_file, dtype={'institution':'string', 'degree_level':'string', 'cipcode':'string'}, low_memory=False)\n", " print(\"Data read successfully into a data frame.\")\n", "\n", " # Step 3: Clean up - delete the .gz file and the unzipped file if it exists.\n", " print(f\"Deleting {gz_filename}...\")\n", " os.remove(gz_filename)\n", " if os.path.exists(unzipped_filename):\n", " print(f\"Deleting {unzipped_filename}...\")\n", " os.remove(unzipped_filename)\n", "\n", " print(\"Cleanup complete.\")\n", "\n", "except requests.exceptions.RequestException as e:\n", " print(f\"Error during download: {e}\")\n", "except (OSError, IOError) as e:\n", " print(f\"Error during file handling: {e}\")\n", "except Exception as e:\n", " print(f\"An unexpected error occurred: {e}\")" ] }, { "cell_type": "markdown", "id": "f67589b4-7eb7-439d-a819-d25064fd83f4", "metadata": {}, "source": [ "### Step 3: Drop unused columns from the data frame." ] }, { "cell_type": "code", "execution_count": 6, "id": "4bad67ea-e557-4ed2-acfd-f7a59106267c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unused columns dropped from the data frame.\n" ] } ], "source": [ "pseoe = pseoe.drop(columns=['geo_level', 'geography', 'ind_level', 'industry', 'status_y1_earnings',\t\n", " 'status_y1_grads_earn',\t'status_y5_earnings', 'status_y5_grads_earn', 'status_y10_earnings',\n", " 'status_y10_grads_earn',\t'status_y1_ipeds_count', 'status_y5_ipeds_count', 'status_y10_ipeds_count'])\n", "\n", "print(\"Unused columns dropped from the data frame.\")" ] }, { "cell_type": "markdown", "id": "31c560f6-91bb-4222-812f-1a0243cde3e0", "metadata": {}, "source": [ "### Step 4: Add aggregate level labels to the data frame." ] }, { "cell_type": "code", "execution_count": 8, "id": "79b9d04c-acde-4426-b326-80f263902284", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Aggregate level labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_agg_level_pseo.csv\"\n", "\n", "columns_to_read = ['agg_level_pseo', 'grad_char']\n", "\n", "agg_levels = pd.read_csv(url, usecols=columns_to_read)\n", "\n", "pseoe = pd.merge(pseoe, agg_levels, how='left', on='agg_level_pseo')\n", "\n", "print(\"Aggregate level labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "3f45327e-aaeb-4455-8748-060362d62bd9", "metadata": {}, "source": [ "### Step 5: Add institution labels to the data frame." ] }, { "cell_type": "code", "execution_count": 10, "id": "70968e7d-0874-472f-b3d0-75ae6fee6620", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Institution labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_institution.csv\"\n", "\n", "columns_to_read = ['institution', 'label', 'city', 'institution_state']\n", "\n", "institutions = pd.read_csv(url, usecols=columns_to_read)\n", "\n", "pseoe = pd.merge(pseoe, institutions, how='left', on='institution')\n", "pseoe = pseoe.rename(columns={'label': 'label_institution'})\n", "pseoe = pseoe.rename(columns={'institution_state': 'state'})\n", "\n", "print(\"Institution labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "95641036-0284-48b1-b625-fc18c85f4433", "metadata": {}, "source": [ "### Step 6: Add state name labels to the data frame." ] }, { "cell_type": "code", "execution_count": 12, "id": "62961910-fe53-4f3e-9287-1444e57a7d69", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State name labels added to the data frame.\n" ] } ], "source": [ "df = {\n", " 'state': [\n", " 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',\n", " 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',\n", " 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',\n", " 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',\n", " 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'\n", " ],\n", " 'label_state': [\n", " 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',\n", " 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',\n", " 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey',\n", " 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',\n", " 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'\n", " ]\n", "}\n", "\n", "states = pd.DataFrame(df)\n", "\n", "pseoe = pd.merge(pseoe, states, how='left', on='state')\n", "\n", "print(\"State name labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "f114ebb7-7133-481e-a24d-9e2d9d24f103", "metadata": {}, "source": [ "### Step 7: Add state and state name labels for state-level aggregate records to the data frame." ] }, { "cell_type": "code", "execution_count": 14, "id": "92f92a62-b7ca-4aea-b65c-fb13890f3729", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State and state name labels added to state-level aggregate records in the data frame.\n" ] } ], "source": [ "state_to_code = {\n", " 'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR',\n", " 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT',\n", " 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',\n", " 'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN',\n", " 'Iowa': 'IA', 'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA',\n", " 'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA', 'Michigan': 'MI',\n", " 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',\n", " 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH',\n", " 'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY',\n", " 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',\n", " 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA',\n", " 'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD',\n", " 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',\n", " 'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV',\n", " 'Wisconsin': 'WI', 'Wyoming': 'WY',\n", "}\n", "\n", "def update_state_info(row):\n", " match = re.match(r'^Institutions in (.+)$', row['label_institution'])\n", " if match:\n", " state_name = match.group(1).strip()\n", " state_code = state_to_code.get(state_name)\n", " if state_code:\n", " if pd.isna(row['state']) or row['state'] == '':\n", " row['state'] = state_code\n", " if pd.isna(row['label_state']) or row['label_state'] == '':\n", " row['label_state'] = state_name\n", " return row\n", "\n", "pseoe = pseoe.apply(update_state_info, axis=1)\n", "\n", "print(\"State and state name labels added to state-level aggregate records in the data frame.\")" ] }, { "cell_type": "markdown", "id": "f14bfe5a-6624-4862-b124-76e5f512c429", "metadata": {}, "source": [ "### Step 8: Add institution level labels to the data frame." ] }, { "cell_type": "code", "execution_count": 16, "id": "10fca48c-606a-4b04-8066-78b8a6e0f970", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Institution level labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_inst_level.csv\"\n", "\n", "columns_to_read = ['inst_level', 'label']\n", "\n", "inst_levels = pd.read_csv(url, usecols=columns_to_read)\n", "\n", "pseoe = pd.merge(pseoe, inst_levels, how='left', on='inst_level')\n", "pseoe = pseoe.rename(columns={'label': 'label_inst_level'})\n", "\n", "print(\"Institution level labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "b1272925-0218-419f-b9bc-09461a0c355c", "metadata": {}, "source": [ "### Step 9: Add degree level labels to the data frame." ] }, { "cell_type": "code", "execution_count": 18, "id": "c0fe367c-4918-45d1-b055-d15b4844fb8f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Degree level labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_degree_level.csv\"\n", "\n", "columns_to_read = ['degree_level', 'label']\n", "\n", "degree_levels = pd.read_csv(url, usecols=columns_to_read, dtype={'degree_level': 'string'})\n", "\n", "pseoe = pd.merge(pseoe, degree_levels, how='left', on='degree_level')\n", "pseoe = pseoe.rename(columns={'label': 'label_degree_level'})\n", "\n", "print(\"Degree level labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "f6e3eab4-07a5-4278-87cd-960cc0280c8d", "metadata": {}, "source": [ "### Step 10: Add CIP level labels to the data frame." ] }, { "cell_type": "code", "execution_count": 20, "id": "32be63b5-b5b7-4299-bdee-bffb95793c65", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CIP level labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_cip_level.csv\"\n", "\n", "columns_to_read = ['cip_level', 'label']\n", "\n", "cip_levels = pd.read_csv(url, usecols=columns_to_read)\n", "\n", "pseoe = pd.merge(pseoe, cip_levels, how='left', on='cip_level')\n", "pseoe = pseoe.rename(columns={'label': 'label_cip_level'})\n", "\n", "print(\"CIP level labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "5c69b6c3-0b17-46f6-8e17-ffa847da6fbf", "metadata": {}, "source": [ "### Step 11: Add CIP code labels to the data frame." ] }, { "cell_type": "code", "execution_count": 22, "id": "3689ced4-8e24-488e-82d2-22117ca1edbd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CIP code labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_cipcode.csv\"\n", "\n", "columns_to_read = ['cipcode', 'label']\n", "\n", "cip_codes = pd.read_csv(url, usecols=columns_to_read, dtype={'cipcode': str})\n", "\n", "pseoe = pd.merge(pseoe, cip_codes, how='left', on='cipcode')\n", "pseoe = pseoe.rename(columns={'label': 'label_cipcode'})\n", "\n", "print(\"CIP code labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "56df9e76-07e9-4496-8d4e-84b5fd6610ed", "metadata": {}, "source": [ "### Step 12: Add CIP code families to the data frame." ] }, { "cell_type": "code", "execution_count": 24, "id": "bc49a205-f2cb-404c-b2f3-eb604da718c8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CIP code families added to the data frame.\n" ] } ], "source": [ "pseoe['cipcode_family'] = pseoe['cipcode'].astype(str).str[:2]\n", "\n", "print(\"CIP code families added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "f55720bc-ca8c-43a2-9855-45e08e896540", "metadata": {}, "source": [ "### Step 13: Add CIP code family labels to the data frame." ] }, { "cell_type": "code", "execution_count": 26, "id": "b1328e95-369c-40e5-b1d7-e7fbe6cef2d2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CIP code family labels added to the data frame.\n" ] } ], "source": [ "url = \"https://lehd.ces.census.gov/data/schema/latest/label_cipcode.csv\"\n", "\n", "columns_to_read = ['cipcode', 'label']\n", "\n", "cip_code_families = pd.read_csv(url, usecols=columns_to_read, dtype={'cipcode': str})\n", "cip_code_families = cip_code_families.rename(columns={'cipcode':'cipcode_family'})\n", "\n", "pseoe = pd.merge(pseoe, cip_code_families, how='left', on='cipcode_family')\n", "pseoe = pseoe.rename(columns={'label': 'label_cipcode_family'})\n", "\n", "print(\"CIP code family labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "f6021770-5c7f-46a6-abb5-fca4530c134c", "metadata": {}, "source": [ "### Step 14: Update CIP code labels for 2-digit CIP code records." ] }, { "cell_type": "code", "execution_count": 28, "id": "b3be9c38-3fa0-49c2-978d-027879008b4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CIP code and CIP code names updated for 2-digit CIP code aggregate records.\n" ] } ], "source": [ "pseoe.loc[pseoe['cip_level'] == \"2\", 'cipcode'] = '00'\n", "\n", "pseoe.loc[pseoe['cip_level'] == \"2\", 'label_cipcode'] = (\n", " \"All Instructional Programs in \" + pseoe.loc[pseoe['cip_level'] == \"2\", 'label_cipcode']\n", ")\n", "\n", "print(\"CIP code and CIP code names updated for 2-digit CIP code aggregate records.\")" ] }, { "cell_type": "markdown", "id": "39a4f9ea-57e4-4b59-9ba2-2e36e093eb80", "metadata": {}, "source": [ "### Step 15: Add grad cohort labels to the data frame." ] }, { "cell_type": "code", "execution_count": 30, "id": "b10c4311-85bb-45bc-8b64-672725ab96d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grad cohort labels added to the data frame.\n" ] } ], "source": [ "pseoe['label_grad_cohort'] = pseoe.apply(\n", " lambda row: 'All Cohorts' if row['grad_cohort'] == 0 \n", " else f\"{row['grad_cohort']}-{row['grad_cohort'] + row['grad_cohort_years'] - 1}\", \n", " axis=1\n", ")\n", "\n", "print(\"Grad cohort labels added to the data frame.\")" ] }, { "cell_type": "markdown", "id": "a37f266b-fe50-48e9-8d4c-ea9cf3cd2372", "metadata": {}, "source": [ "### Step 16: Reorder the columns in the data frame." ] }, { "cell_type": "code", "execution_count": 32, "id": "947bb4db-52bc-484e-8f8b-586e704c3bab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns in the data frame reordered for export.\n" ] } ], "source": [ "pseoe = pseoe[['agg_level_pseo', 'grad_char', 'inst_level', 'label_inst_level', 'institution', 'label_institution', 'city', 'state', 'label_state', \n", " 'degree_level', 'label_degree_level', 'cip_level', 'label_cip_level', 'cipcode', 'label_cipcode', 'cipcode_family',\n", " 'label_cipcode_family', 'grad_cohort', 'grad_cohort_years', 'label_grad_cohort', 'y1_p25_earnings', 'y1_p50_earnings', \n", " 'y1_p75_earnings', 'y1_grads_earn', 'y5_p25_earnings', 'y5_p50_earnings', 'y5_p75_earnings', 'y5_grads_earn', 'y10_p25_earnings',\n", " 'y10_p50_earnings', 'y10_p75_earnings', 'y10_grads_earn', 'y1_ipeds_count', 'y5_ipeds_count', 'y10_ipeds_count' \n", " ]]\n", "\n", "print(\"Columns in the data frame reordered for export.\")" ] }, { "cell_type": "markdown", "id": "6c1cabd3-2439-42df-b7e2-966bdd20b970", "metadata": {}, "source": [ "### Step 17: Write the data in the data frame to a CSV file." ] }, { "cell_type": "code", "execution_count": 34, "id": "3ef3e363-27c9-48d0-bb52-caeb22efa15e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The file pseoe_all.csv has been written to the working directory.\n" ] } ], "source": [ "pseoe.to_csv('pseoe_all.csv', index=False)\n", "print(\"The file pseoe_all.csv has been written to the working directory.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b039ff03-d5da-45a7-ad31-5a1359bbca1a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:base] *", "language": "python", "name": "conda-base-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }