Project Overview
This is the link for the final paper. Click here.
Motivation For The Project
I want to look at National Park data for my research project. The National Parks are an important part of the United States and function as areas of land that are preserved by the national government. They can have historical elements, or be a place to enjoy the wonderful views they have to offer. I have been to several National Parks myself. I would like to look at the data to :
Find which parks have the most visitors each year and if visitation in the parks is increasing or decreasing.
Identify what could have caused these spikes and peaks in visitation.
Explore which regions of the United States have the most parks.
Determine if certain parks contribute more to the economy.
Look at the impacts COVID-19 had on visitation.
Background Knowledge
The National Park Service is primarily funded by Congress, but also is funded through park entrance fees and some private philanthropies. For years the National Park Service has lacked the funding it needs to maintain and protect these national parks. Since the first National Park, Yellowstone, was established in 1872, there have been an additional 400 National Parks with over 20,000 employees added. This research is important, because it can help determine how much funding the National Parks need and which ones need more attention than others. It can help us discover how different issues have affected these National Parks. It can also help us see if certain parks should be closed during certain times of the year in order to have a more efficient and effective budget for these parks. We know that national parks provide peaceful places to enjoy scenery and give wildlife and native plants a safe home which maintains our ecosystem. Economically, national parks create jobs in tourism, park management and capital works and draw visitors to regional areas where they spend their money in local towns.
Summary of the Modeling Result
In my model I was able to figure out how visitation, park size, and park type correlated to the amount of visitor spending and economic contribution of the parks. If I had more time, I would’ve liked to gather data on climate change, stocks, and financial patterns to see how these factors could have affected visitation at parks. I was able to see that visitation decreased in 2020 due to the COVID-19 pandemic.
Data Summary
I was able to collect 30 years of data from the National Park Service. I collected Public Use Data data from 1979 to 2021. I also collected data on the amount of trails located in each park from Kaggle, All Trails Data. Additionally I was able to find Visitor Spending Data for 2020.
All Trails Data
- This data is from the All Trails website, which is a platform containing all hiking trails located in each National Park. It includes the coordinates of each trail, park name, state name, and city name. Additionally this data provides variables like the average rating, length, elevation gain, and difficulty rating for each trail.
Summary Statistics: All Trails
#Read csv file
alltrails <- read_csv("/Users/gabriellescibelli/Modified2AllTrails.csv")
#first 6 rows
head(alltrails)
rmarkdown::paged_table(alltrails)
#Number of Trails per state
TrailsperPark <-alltrails%>%
group_by(state_name)%>%
summarize(trails = n_distinct(name))
TrailsperPark%>%
arrange(desc(trails))
nvars <- format(round(ncol(alltrails), 0), nsmall = 0, big.mark = ",")
nobs <- format(round(nrow(alltrails), 0), nsmall = 0, big.mark = ",")
The number of variables is 21; the number of observations 3,313.
National Park Service Data
Public Use Statistics (1979-2021)
- Public Use Statistics give the number of visits per month and year for each National Park. This data includes Recreation and Non-Recreation Visits, as well as the hours the park was open for each group and the number of visitors who camped, stayed overnight, etc.
Hmisc::describe(public_use) %>% Hmisc::html()
public_use
19 Variables 173563 Observations
index
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
173563 | 0 | 64657 | 1 | 24513 | 18564 | 2194 | 4416 | 10948 | 21919 | 36072 | 47792 | 55980 |
lowest : 2 3 4 5 6 , highest: 64654 64655 64656 64657 64658
Park
n | missing | distinct |
173563 | 0 | 372 |
lowest : | Abraham Lincoln Birthplace NHP | Acadia NP | Adams NHP | African Burial Ground NM | Agate Fossil Beds NM |
highest: | Wupatki NM | Yellowstone NP | Yosemite NP | Yukon-Charley Rivers NPRES | Zion NP |
Unit Code
n | missing | distinct |
173563 | 0 | 372 |
lowest : ABLI ACAD ADAM AFBG AGFO , highest: WWII YELL YOSE YUCH ZION
Park Type
n | missing | distinct |
173563 | 0 | 19 |
lowest : | International Historic Site | National Battlefield | National Battlefield Park | National Historic Site | National Historical Park |
highest: | National River | National Seashore | National Wild & Scenic River | Park (Other) | Park Type |
Region
n | missing | distinct |
173563 | 0 | 8 |
lowest : | Alaska | Intermountain | Midwest | National Capital | Northeast |
highest: | National Capital | Northeast | Pacific West | Region | Southeast |
Value Alaska Intermountain Midwest National Capital
Frequency 7032 39458 23678 15105
Proportion 0.041 0.227 0.136 0.087
Value Northeast Pacific West Region Southeast
Frequency 33817 25472 3 28998
Proportion 0.195 0.147 0.000 0.167
State
n | missing | distinct |
173563 | 0 | 55 |
lowest : AK AL AR AS AZ , highest: VT WA WI WV WY
Year
n | missing | distinct |
173563 | 0 | 44 |
lowest : 1979 1980 1981 1982 1983 , highest: 2018 2019 2020 2021 Year
Month
n | missing | distinct |
173563 | 0 | 13 |
lowest : 1 10 11 12 2 , highest: 6 7 8 9 Month
Value 1 10 11 12 2 3 4 5 6 7 8 9
Frequency 14458 14467 14468 14467 14458 14458 14460 14462 14463 14466 14466 14467
Proportion 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
Value Month
Frequency 3
Proportion 0.000
Recreation Visits
n | missing | distinct |
173563 | 0 | 77961 |
lowest : | -20331 | -320 | -39 | -425 | 0 |
highest: | 99986 | 9999 | 99996 | 99998 | Recreation Visits |
Non-Recreation Visits
n | missing | distinct |
173563 | 0 | 24528 |
lowest : | -112 | -16 | -2231 | -32964 | -788 |
highest: | 99989 | 9999 | 99990 | 99991 | Non-Recreation Visits |
Recreation Hours
n | missing | distinct |
173563 | 0 | 96363 |
lowest : | -10068 | -12 | -18 | -192 | -30 |
highest: | 999924 | 99993 | 999971 | 99998 | Recreation Hours |
Non-Recreation Hours
n | missing | distinct |
173563 | 0 | 24528 |
lowest : | -112 | -16 | -2231 | -32964 | -788 |
highest: | 99989 | 9999 | 99990 | 99991 | Non-Recreation Hours |
Concessioner
Lodging
n | missing | distinct |
173563 | 0 | 8336 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 99853 | 9986 | 999 | 9993 | Concessioner
Lodging |
Concessioner
Camping
n | missing | distinct |
173563 | 0 | 4019 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 997 | 998 | 9983 | 999 | Concessioner
Camping |
Tent Campers
n | missing | distinct |
173563 | 0 | 10240 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 9992 | 9993 | 9994 | 9996 | Tent Campers |
RV Campers
n | missing | distinct |
173563 | 0 | 9728 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 9995 | 9996 | 9997 | 9998 | RV Campers |
Backcountry Campers
n | missing | distinct |
173563 | 0 | 7570 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 9990 | 9994 | 9996 | 9999 | Backcountry Campers |
Non-Recreation Overnight Stays
n | missing | distinct |
173563 | 0 | 2071 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 995 | 996 | 9984 | 999 | Non-Recreation Overnight Stays |
Misc. Overnight Stays
n | missing | distinct |
173563 | 0 | 6798 |
lowest : | 0 | 1 | 10 | 100 | 1000 |
highest: | 998 | 99884 | 999 | 9995 | Misc. Overnight Stays |
#Total Visits per Year in each Region
public_use%>%
group_by(Year, Region)%>%
summarize(Total_Visits = n_distinct(Rec_Visits))
Economic Contribution of National Park Visitor Spending
Visitor Spending Data
- Gives us information on the number of visitors in each park and how much they spend in each National Park, the number of jobs provided by parks, and home much income workers make.
Visitor_Spending_2020 <- read_csv("~/Downloads/National Park Project/Visitor Spending State 2020.csv")
head(Visitor_Spending_2020)
skim(Visitor_Spending_2020)
Data summary
|
|
Name
|
Visitor_Spending_2020
|
Number of rows
|
54
|
Number of columns
|
6
|
_______________________
|
|
Column type frequency:
|
|
character
|
1
|
numeric
|
5
|
________________________
|
|
Group variables
|
None
|
Variable type: character
skim_variable
|
n_missing
|
complete_rate
|
min
|
max
|
empty
|
n_unique
|
whitespace
|
State
|
0
|
1
|
2
|
14
|
0
|
54
|
0
|
Variable type: numeric
skim_variable
|
n_missing
|
complete_rate
|
mean
|
sd
|
p0
|
p25
|
p50
|
p75
|
p100
|
hist
|
Visitation
|
0
|
1
|
4390080.22
|
5709542.29
|
4819.0
|
445463.50
|
2300616.50
|
6078418.00
|
28645839.0
|
▇▃▁▁▁
|
Visitor Spending
|
0
|
1
|
269140740.74
|
358821715.61
|
300000.0
|
28650000.00
|
145350000.00
|
324200000.00
|
1716500000.0
|
▇▁▁▁▁
|
Total Output
|
0
|
1
|
381187037.04
|
535369782.64
|
400000.0
|
39325000.00
|
185050000.00
|
474450000.00
|
2693300000.0
|
▇▁▁▁▁
|
spending
|
0
|
1
|
269.14
|
358.82
|
0.3
|
28.65
|
145.35
|
324.20
|
1716.5
|
▇▁▁▁▁
|
output
|
0
|
1
|
381.19
|
535.37
|
0.4
|
39.33
|
185.05
|
474.45
|
2693.3
|
▇▁▁▁▁
|
Research Questions
- Is visitation increasing or decreasing at certain parks?
- How many parks are in the United States? Each Region? Each State?
- How many trails are in each park?
- Are there seasonal trends?
- Does the location of the park correlate to the amount of visitation?
- Does the size and amount of parks correlate with the economic contribution of the park?
Hypothesis
I believe that parks located in the West will get more visitation and result in more economic contribution than parks located in the East.
Visualizations
How many Parks are in the United States?
total_park_count <-public_use%>%
group_by(Year)%>%
summarize(p = n_distinct(Park))
uspark <- ggplot(data = total_park_count,
mapping = aes(x = Year, y = p))+
geom_line(color="#69b3a2", size=2)+
ylim(100, 400)+
ggtitle("Number of National Parks in the United States \n(1979 - 2021)")+
xlab("Year")+
ylab("Total Number of Parks")
uspark
The number of Parks in the United States has increased from 268 in 1979 to 368 in 2021.
Number of Parks Per Region
numparks_region<-public_use%>%
group_by(Region)%>%
summarize(Parks = n_distinct(Park))
numparks_region%>%arrange(desc(Parks))
The number of parks found in each region.
To see a map of the Regions click here.
Visits By Region
regionvis <- public_use%>%
group_by(Region, Year)%>%
summarize(visits = sum(Rec_Visits))
regionvis
ggplot(data = regionvis,
mapping = aes(x = Year, y =visits, fill = Region))+
geom_col()
faceted_region <- ggplot(data = public_use,
mapping = aes(x = Year, y = Rec_Visits))+
geom_point(alpha = 0.2)+
facet_wrap(~Region, ncol = 3)
faceted_region+
theme_bw()+
ggtitle("Recreation Visits By Region")+
ylab("Recreation Visits")+
xlab("")+
scale_y_continuous(labels = comma)
Number of Parks per State and Region
park_state <- public_use%>%
group_by(State, Region)%>%
summarize(park = n_distinct(Park))
park_state%>% arrange(desc(park))%>%ggplot(data = park_state,
mapping = aes(x = State, y = park, color = Region))+
geom_point()+
theme(axis.text.x = element_text(angle = 90))+
labs(x = "State", y = "Park Count",
title = "Amount of Parks by State and Region")
Number of Trails Per Park
TrailsperPark <-alltrails%>%
group_by(state_name)%>%
summarize(trails = n_distinct(name))%>%
arrange(desc(trails))
c <- ggplot(TrailsperPark, aes(x = state_name,
y = trails))+
geom_segment(aes(x = state_name, xend = state_name, y = 0, yend = trails),
color = "skyblue") +
geom_point(color = "blue", size = 4, alpha = 0.6)+
theme_light()+
coord_flip()+
theme(panel.border = element_blank(),
axis.ticks.y = element_blank())+
ggtitle("Number of Trails in each State")+
xlab("")+
ylab("Number of Trails")
c
Visitation in 2021
lastyear <- public_use%>%
filter(Year == 2021)%>%
select(Year, Month, Park, Region, Rec_Visits)
ggplot(data = lastyear,
mapping = aes(x = Month, y = Rec_Visits, fill = Region))+
geom_bar(position = "stack", stat = "identity")+
scale_x_continuous(breaks = breaks_width(1))+
theme_classic()+
scale_y_continuous(labels = label_number(suffix = "K", scale = 1e-6))+
ylab("Recreation Visit (Thousands) ")+
xlab("Month")+
ggtitle("Visitation By Month in 2021")
Visitor Spending Data for 2020
- I made an interactive plot using plotly.
Visitor_Spending_2020 <- read_csv("~/Downloads/National Park Project/Visitor Spending State 2020.csv")
bubble <- Visitor_Spending_2020%>%
arrange(desc(Visitation))%>%
mutate(State = factor(State, State))%>%
mutate(text = paste("State: ", State, "\nVisitation: ", Visitation,
"\n Spending (M):", spending, "\n Output (M):", output, sep = ""))%>%
ggplot(aes(x = spending, y = output, size = Visitation, color = State, text = text))+
geom_point(alpha = 0.7) +
scale_size(name = "Visitation ")+
scale_color_viridis(discrete = TRUE, guide = FALSE)+
theme_ipsum()+
theme(legend.position = "none")
bubble2 <- ggplotly(bubble, tooltip = "text")
bubble2
Conclusion
The main findings were that California has the most amount of parks and trails, though it is not located in the region that has the most parks. California also has the most amount of visitors per year, so this correlates to why they have the most amount of visitor spending and the highest economic contribution. We can conclude that California does not have to rely heavily on government funding. States with the most parks tend to have higher rates of visitation, and therefore the most visitor spending and economic contribution. I would like to look further into why Washington DC has less economic contribution and visitor spending. My thoughts are that a lot of visits are school bus trips, thus the students are not spending as much as a family would on their visits to the parks. If parks are able to increase their visitation, then they would be able to contribute more to the economy and generate more revenue that could in turn increase the funding that goes toward maintenance of these parks. We saw that the COVID-19 pandemic affect the parks, by decreasing visitation in all regions. I would like to look further to see if any specific parks increased in visitation during these times.
I had many limitations while conducting my research, such as not gathering all the data I wanted and having trouble scrapping the specific data. Gathering and cleaning the data took a lot longer than anticipated and therefore set me back in my research and development of economic analysis. I hope to continue this research and look further into variables that made me question the outcomes. I would like to see how visitation changed for each park over the years, and in each month to see if certain parks and park types have become more or less popular over the years. I would also like to look into the climate changes, and travel patterns to see if these factors have to do with visitation rates. I would look further into the average rating of parks and traffic counts in parks. If I can gather all the variables I need, my analysis would be more thorough and be able to make better suggestions to how the parks can change their current mannerisms to make the National Park System better. For now, we can continue to visit these parks and contribute to the maintenance needed. We can do our part by cleaning up parks while we enjoy hikes, to conserve the beauty displayed at these parks.