Food Access Code - Michael J McFall

Data Wrangling

Table of Contents
For this project, I identified three datasets which contained variables of interest:

The United States Department of Agriculture’s Food Access Research Atlas
1. Supplemental Data - County
  1. FIPS (A unique identifier for each county)
  2. County
  3. State
  4. Population Estimate, 2015
2. STORES
  1. FIPS
  2. GROCPTH14 (Number of grocery stores per thousand population)
  3. SUPERCPTH14 (Number of supercenter stores per thousand population)
  4. CONVSPTH14 (Number of convenience stores per thousand population)
3. SOCIOECONOMIC
  1. Kept all variables
The Internal Revenue Service’s Statistics of Income
1. FIPS (columns 0 & 2)
2. agi_stub (The Adjusted Gross Income Bracket numbered 1-8)
  1. 1 = Under $1
  2. 2 = $1 under $10,000
  3. 3 = $10,000 under $25,000
  4. 4 = $25,000 under $50,000
  5. 5 = $50,000 under $75,000
  6. 6 = $75,000 under $100,000
  7. 7 = $100,000 under $200,000
  8. 8 = $200,000 or more
3. N1 (The number of returns filed in a county in an income bracket)
The American Community Survey’s 5-year estimates of educational attainment
1. GEO.id2 (FIPS)
2. HC_02_EST_VC17 (% of population high school graduate or higher)
3. HC_02_EST_VC18 (% of population bachelor's degree or higher)

Importing the 3 datasets with only variables of interest


atlas = 'Data/Food environment atlas.xls'
county_columns = ['FIPS', 'County', 'State', 'Population Estimate, 2015']
county_data = pd.read_excel(atlas, 
                            sheet_name='Supplemental Data - County', 
                            usecols=county_columns)
stores_columns = ['FIPS', 'GROCPTH14', 'SUPERCPTH14', 'CONVSPTH14']
stores = pd.read_excel(atlas, sheet_name='STORES', usecols=stores_columns)
socioeconomic = pd.read_excel(atlas, sheet_name='SOCIOECONOMIC')

income = pd.read_csv('Data/15incyallagi.csv', usecols=[0,2,4,5])

education_columns = ['GEO.id2', 'HC02_EST_VC17', 'HC02_EST_VC18']
education = pd.read_csv('Data/ACS_15_5YR_S1501_with_ann.csv', 
                        usecols=education_columns, 
                        skiprows=[1], 
                        encoding='latin-1')

Transforming data before merging

Some state and county names have leading/trailing spaces that need to be removed. I also need to create a dictionary matching state with FIPS code.

county_data['County'] = county_data['County'].str.strip()
county_data['State'] = county_data['State'].str.strip()
state_FIPS = (pd.DataFrame([county_data['State'], 
                            county_data['FIPS'] // 1000])
              .transpose()
              .groupby(by='State').max()
              .to_dict()['FIPS']
              )

I need to create my variable of interest is the ratio of convenience stores to grocery stores and supercenters.


    stores['conv_to_groc'] = (stores['CONVSPTH14'] / 
                              (stores['GROCPTH14'] + stores['SUPERCPTH14']))
                             .replace(np.inf, np.nan)

I do not want to duplicate the state and county when merging I drop them.

socioeconomic.drop(['State', 'County'], axis='columns', inplace=True)

Renaming the education columns to something useful

education.columns = ['FIPS', 'high_school_pct', 'bachelor_pct']

The FIPS used in the other datasets is stored in two columns in the income data. The income data needs to be pivoted so that there is one row per county. The values should be normalized to the total number of returns filed in each county.


income['FIPS'] = (income['STATEFIPS'].astype(str) + 
                  income['COUNTYFIPS'].astype(str)
                                      .apply(str.rjust, args=(3,'0')))
                 .astype(int)
income_pivot = income.pivot_table(index='FIPS', 
                                  columns='agi_stub', 
                                  values='N1', 
                                  margins=True, 
                                  aggfunc=sum)
for col in income_pivot.columns:
    income_pivot[col] = income_pivot[col] / income_pivot['All'] * 100
income_pivot.drop('All', axis='columns', inplace=True)
income_pivot.columns = ['income_pct' + str(x) for x in range(1,9)]
income_pivot = income_pivot.reset_index()

The datasets are ready to be merged


merge1 = county_data.merge(stores, how='left', on='FIPS')
merge2 = merge1.merge(socioeconomic, how='left', on='FIPS')
merge3 = merge2.merge(education, how='left', on='FIPS')
merge4 = merge3.merge(income_pivot, how='left', on='FIPS')
merge4 = merge4.set_index(['State', 'County','FIPS'])
merge4.info()

MultiIndex: 3142 entries, (Alabama, Autauga, 1001) to (Wyoming, Weston, 56045)
Data columns (total 30 columns):
Population Estimate, 2015    3142 non-null object
GROCPTH14                    3140 non-null float64
SUPERCPTH14                  3140 non-null float64
CONVSPTH14                   3140 non-null float64
conv_to_groc                 3072 non-null float64
PCT_NHWHITE10                3140 non-null float64
PCT_NHBLACK10                3140 non-null float64
PCT_HISP10                   3140 non-null float64
PCT_NHASIAN10                3140 non-null float64
PCT_NHNA10                   3140 non-null float64
PCT_NHPI10                   3140 non-null float64
PCT_65OLDER10                3140 non-null float64
PCT_18YOUNGER10              3140 non-null float64
MEDHHINC15                   3139 non-null float64
POVRATE15                    3139 non-null float64
PERPOV10                     3140 non-null float64
CHILDPOVRATE15               3139 non-null float64
PERCHLDPOV10                 3140 non-null float64
METRO13                      3140 non-null float64
POPLOSS10                    3140 non-null float64
high_school_pct              3142 non-null float64
bachelor_pct                 3142 non-null float64
income_pct1                  3141 non-null float64
income_pct2                  3141 non-null float64
income_pct3                  3141 non-null float64
income_pct4                  3141 non-null float64
income_pct5                  3141 non-null float64
income_pct6                  3141 non-null float64
income_pct7                  3141 non-null float64
income_pct8                  3141 non-null float64
dtypes: float64(29), object(1)
memory usage: 791.3+ KB

It looks like the comma in the population is keeping it from being interpreted as a number. From looking at the data dictionary, I know that there are four categorical variables in this dataframe.


merge4['Population Estimate, 2015'] = (merge4['Population Estimate, 2015']
                                      .str.replace(',', '')
                                      .astype('float64')
                                      )
merge4['METRO13'] = merge4['METRO13'].astype('category')
merge4['PERCHLDPOV10'] = merge4['PERCHLDPOV10'].astype('category')
merge4['PERPOV10'] = merge4['PERPOV10'].astype('category')
merge4['POPLOSS10'] = merge4['POPLOSS10'].astype('category')
merge4.columns = merge4.columns.str.lower()

Missing Data


merge4[merge4.drop('conv_to_groc', axis=1).isna().sum(axis='columns') > 0]

There are three rows with missing data and they are all missing many values, and so I will drop them.

merge4.dropna(thresh=29, inplace=True)

There are 67 rows where conv_to_groc is missing because there are zero grocery stores in the county and dividing by zero is undefined. Those where the number of convenience stores is also zero will be set to 0. As will be seen later one good predictor of this ratio is median household income. To fill the rest of the missing values, I will create 20 bins of the income variable and use the maximum ratio in each bin.


merge4['bins'] = pd.cut(merge4['medhhinc15'],bins=20)
ratio_by_income = merge4.groupby(by='bins')['conv_to_groc'].max()

def fill_ratio(row):
    if np.isnan(row['conv_to_groc']) and row['convspth14'] == 0:
        return 0
    elif np.isnan(row['conv_to_groc']) and row['convspth14'] != 0:
            return ratio_by_income[row['medhhinc15']]
    else:
        return row['conv_to_groc']
        
merge4['conv_to_groc'] = merge4.apply(lambda row: fill_ratio(row), axis=1)
df = merge4.drop('bins', axis=1).copy()
df.to_pickle('Data\Capstone_Milestone.pkl')

Food Access Final Report

Completed as part of the Springboard Data Science Career Track

Table of Contents

Introduction

Data Wrangling

Exploratory Data Analysis

Building a Model

Recommendations

Conclusions

Introduction

Data Wrangling

Importing the 3 datasets with only variables of interest

Transforming data before merging

Missing Data

Exploratory Data Analysis

Questions to investigate:

Subgroup Analysis

Building a Model

Regressions on dataset before feature engineering

Feature selection and engineering

Income Features

Interaction terms

Hyperparameter tuning

Evaluating model

Recommendations

Conclusions

Future Work