Urban Health Risk Mapping

While estimating health outcomes at a neighborhood scale is important for promoting urban health, it has been a costly and time-consuming task. The Urban Health Risk Mapping project leverages crowdsourced data and machine learning technologies to predict the census tract-level health outcomes for ten major US cities, including Austin, Baltimore, Boston, Dallas, Washington, D.C., Houston, Los Angeles, New York City, San Antonio, and San Francisco. The machine-learning-enabled approach has an advantage over the traditional survey methods in terms of time and cost.

Austin

Baltimore

Boston

Dallas

Houston

Los Angeles

New York City

San Antonio

San Francisco

Washington D.C.

Urban Health Risk Mapping

The project consists of four parts: (1) database development, (2) modeling and analytics, (3) visualization and web development, and (4) community engagement and application. The first two parts are associated with the actual building, training, and testing of machine learning models. The targets are the various health outcomes, namely the prevalence of common non-communicable chronic diseases such as coronary heart disease, cancer, diabetes, poor mental health, obesity, and stroke. The actual health outcomes used in training and testing the models are accessed from the CDC’s 500 Cities Project. The features are created based on three data sources, namely the CDC’s Social Vulnerability Index (SVI) dataset, the EPA’s Smart Location Database (SLD), and the 311 service request datasets accessed from each municipality. Sixty features (i.e., predictor variables) are considered, which characterize the social environment, the physical environment, and the aspects and degrees of neighborhood disorder. A variety of machine learning algorithms are applied and compared, including Ridge Regression, Lasso Regression, Elastic Net, Support Vector Machine, Decision Tree, Random Forest, Extra Trees, and Gradient Boosting. To improve the model performance, the model hyperparameters are fine-tuned using 10-fold cross-validation. Different sets of features are also experimented with.

It is shown that the tract-level prevalence for the common non-communicable chronic diseases can be reasonably well predicted based on the publicly available datasets. Furthermore, two major findings have been yielded from this study: (1) the sociodemographic and socioeconomic variables are the strongest predictors for tract-level health outcomes; (2) the historical records of 311 service requests can be a useful complementary data source because the information distilled from the 311 data often helps improve the models’ performance.

The datasets and the predictive models are published online. Users can play with the models interactively by using the web tools we developed. The web tools can help the public and city officials evaluate future scenarios and understand how changes in the neighborhood conditions can lead to changes in the health outcomes.

Data Sources

The census tract-level health data are drawn from the 500 Cities Project dataset. (https://chronicdata.cdc.gov/browse?category=500+Cities)

The built environment variables are calculated based on EPA’s Smart Location Database (SLD). (https://www.epa.gov/smartgrowth/smart-location-mapping#SLD)

The socioeconomic and sociodemographic variables are extracted from CDC’s Social Vulnerability Index (SVI) dataset. (https://svi.cdc.gov/data-and-tools-download.html)

The 311 data are accessed from the open data portal of each municipality:

Austin: https://data.austintexas.gov/Utilities-and-City-Services/Austin-311-Public-Data/xwdj-i9he
Baltimore: https://data.baltimorecity.gov/City-Services/311-Customer-Service-Requests/9agw-sxsr
Boston: https://data.boston.gov/dataset/311-service-requests
Dallas: https://www.dallasopendata.com/City-Services/311-Service-Requests-October-1-2016-to-September-3/shgm-yzbp
https://www.dallasopendata.com/City-Services/311-Service-Requests-October-1-2018-to-Present-/m36q-vtbr
Washington, D.C.: https://opendata.dc.gov/datasets/311-city-service-requests-in-2019
Houston: http://www.houstontx.gov/311/
Los Angeles: https://data.lacity.org/browse?q=311&sortBy=relevance&page=2
New York City: https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
San Antonio: https://data.sanantonio.gov/dataset/service-calls
San Francisco: https://data.sfgov.org/City-Infrastructure/311-Cases/vw6y-z8j6/data

Abbreviations and Descriptions of Variables

Variable	Abbreviation	Data Source
Outcome variable
Arthritis among adults aged ≥ 18 years (%)	ARTHRITIS	CDC’s 500 Cities Project
High blood pressure among adults aged ≥ 18 years (%)	BPHIGH
Cancer (excluding skin cancer) among adults aged ≥ 18 years (%)	CANCER
Current asthma prevalence among adults aged ≥ 18 years (%)	CASTHMA
Coronary heart disease among adults aged ≥ 18 years (%)	CHD
Chronic obstructive pulmonary disease among adults aged ≥ 18 years (%)	COPD
Diagnosed diabetes among adults aged ≥ 18 years (%)	DIABETES
High cholesterol among adults aged ≥ 18 years who have been screened in the past 5 years (%)	HIGHCHOL
Chronic kidney disease among adults aged ≥ 18 years (%)	KIDNEY
Mental health not good for ≥ 14 days among adults aged ≥ 18 years (%)	MHLTH
Physical health not good for ≥ 14 days among adults aged ≥ 18 years (%)	PHLTH
Stroke among adults aged ≥ 18 years (%)	STROKE
All teeth lost among adults aged ≥ 65 years (%)	TEETHLOST
Binge drinking prevalence among adults aged ≥ 18 years (%)	BINGE
Current smoking among adults aged ≥ 18 years (%)	CSMOKING
No leisure-time physical activity among adults aged ≥ 18 years	LPA
Obesity among adults aged ≥ 18 years	OBESITY
Sleeping less than 7 hours among adults aged ≥ 18 years	SLEEP
Note: The column names for the predicted health outcome values are made simply by prefixing a lowercase ‘p’ before the variable names shown above. For example, ‘ARTHRITIS’ becomes ‘pARTHRITIS’.
Predictor variable
Percentage of persons below poverty	P_POV	CDC’s SVI data
Percentage of civilian (age 16+) unemployed estimate	P_UNEMP
Per capita income (US$)	PCI
Percentage of persons with no high school diploma (age 25+)	P_NOHSDP
Percentage of persons aged 65 and older	P_AGE65P
Percentage of persons aged 17 and younger	P_AGE17M
Percentage of civilian noninstitutionalized population with a disability	P_DISABL
Percentage of single parent households with children under 18	P_SNGPNT
Percentage minority (all persons except white, non-Hispanic)	P_MINRTY
Percentage of persons (age 5+) who speak English “less than well”	P_LIMENG
Percentage of housing in structures with 10 or more units	P_MUNIT
Percentage of mobile homes	P_MOBILE
Percentage of occupied housing units with more people than rooms	P_CROWD
Percentage of households with no vehicle available	P_NOVEH
Percentage of persons in institutionalized group quarters	P_GROUPQ
Percentage uninsured in the total civilian noninstitutionalized population	P_UNINSUR
Percent of population that is working aged	P_WRKAGE	EPA’s Smart Location Database
Percent of one-car households	P_AO1
Percent of two-plus-car households	P_AO2P
Percentage of low-wage workers (earning $1250/month or less) among total workers (home location)	P_LOWWAGEr
Percentage of low-wage workers (earning $1250/month or less) among total workers (work location)	P_LOWWAGEe
Gross residential density (HU/acre) on unprotected land	D_HH
Gross population density (people/acre) on unprotected land	D_POP
Gross employment density (jobs/acre) on unprotected land	D_EMP
Gross activity density (employment + HUs) on unprotected land	D_HUEMP
Gross retail (5-tier) employment density (jobs/acre) on unprotected land	D_EMP_RET
Gross office (5-tier) employment density (jobs/acre) on unprotected land	D_EMP_OFF
Gross industrial (5-tier) employment density (jobs/acre) on unprotected land	D_EMP_IND
Gross service (5-tier) employment density (jobs/acre) on unprotected land	D_EMP_SVC
Gross entertainment (5-tier) employment density (jobs/acre) on unprotected land	D_EMP_ENT
Jobs per household	JOBSPERHH
5-tier employment entropy (denominator set to observed employment types in the census tract)	EMPMIX
Employment and household entropy	EMPHHMIX
Employment and household entropy (based on vehicle trip production and trip attractions including all 5 employment categories)	TRIPMIX
Trip productions and trip attractions equilibrium index	TRIPEQ
Household workers per job, by census tract	WRKSPERJOB
Household workers per job equilibrium index	HHWRKJOBEQ
Total road network density	D_RD
Network density in terms of facility miles of auto-oriented links per square mile	D_RD_AO
Network density in terms of facility miles of multi-modal links per square mile	D_RD_MM
Network density in terms of facility miles of pedestrian-oriented links per square mile	D_RD_PO
Street intersection density (auto-oriented intersections eliminated)	D_X_EXCLAO
Intersection density in terms of auto-oriented intersections per square mile	D_X_AO
Intersection density in terms of multi-modal intersections having three legs per square mile	D_X_MM3
Intersection density in terms of multi-modal intersections having four or more legs per square mile	D_X_MM4
Intersection density in terms of pedestrian-oriented intersections having three legs per square mile	D_X_PO3
Intersection density in terms of pedestrian-oriented intersections having four or more legs per square mile	D_X_PO4
Proportion of census tract employment within ¼ mile of fixed-guideway transit stop	P_EMP025
Proportion of census tract employment within ½ mile of fixed-guideway transit stop	P_EMP050
Aggregate frequency of transit service per square mile	D_TRANSIT