1. Describir autocorrelación espacial, autocorrelación espacial positiva, y autocorrelación espacial negativa.
La autocorrelación espacial determina la presencia de variación espacial en una variable.
Cuando hay presencia de autocorrelación espacial positiva significa que los espacios cercanos tienen valores similares se puede visualizar con la presencia de clusters en un mapa.
Cuando hay correlación espacial negativa significa que los espacios cercanos tienen valores diferentes entre sí, un ejemplo seria si en un vecindario hay casa de diferentes precios unos siendo muy altos y unos muy bajos.
2. Describir al menos 3-5 diferencias entre análisis exploratorio de datos (EDA) y análisis exploratorio espacial de datos (ESDA).
En el EDA análisis exploratorio de datos se analizan y se busca encontrar patrones entre las variables de una serie de datos por medio de la elaboración de gráficos, correologramas, visualizaciones etc.
En el ESDA se busca encontrar patrones y/o relaciones en una serie de datos espaciales por medio de la elaboración de mapas.
Para elaborar el ESDA es necesario tener datos espaciales ya que se utilizan ubicaciones y atributos geográficos.
3. Describir los conceptos de autocorrelación espacial y no estacionariedad en un contexto de análisis espacial de datos.
La autocorrelación espacial determina la variación espacial o la ausencia de esta entre los espacios cercanos en una variable, es importante determinar la presencia o ausencia de autocorrelación espacial en un análisis espacial de datos ya que permiten un mejor entendimiento del comportamiento de los datos, así mismo nos ayuda a identificar si es necesario a ver un cambio en las variables.
La no estacionariedad se presenta cuando propiedades en una variable o base de datos cambian con el tiempo y no son consistentes a lo largo de un periodo de tiempo. En un contexto espacial de datos la no estacionariedad se presenta cuando existen valores extremos en un área esto puede ser valores muy bajos y muy altos. Un ejemplo de esto podría ser la pobreza en México ya que esta es muy alta en unas regiones y muy baja en otras a lo largo del país.
4. Describir al menos 3-5 diferencias entre GWR y GRF.
GFR utiliza una versión local de machine learning para determinar la existencia de no estacionariedad espacial.
GRF modela la no estacionariedad junto con un modelo no lineal flexible.
El GRF es difícil de overfit debido a sus técnicas de bootstrapping
El propósito de GRF es servir como una conexión entre el machine learning y los modelos geográficos ya que combina la capacidad de hacer inferencias y brindar explicaciones.
5. Describir al menos 3-5 diferencias entre la estimación de modelo de regresión no espacial, espacial global, y espacial local.
El modelo de regresión no espacial no utiliza datos espaciales y se puede conducir con una base de datos regular.
El modelo espacial local ajusta una ecuación de regresión a cada característica presente en el conjunto de datos lo que le permite obtener resultados locales.
El modelo espacial global no diferencia entre las diferentes regiones.
Referencias
Abdishakur. (2019, November 5). What is Exploratory Spatial Data Analysis (ESDA)?. Medium. https://towardsdatascience.com/what-is-exploratory-spatial-data-analysis-esda-335da79026ee
Geographically weighted Random Forest. R Documentation. (n.d.). https://search.r-project.org/CRAN/refmans/SpatialML/html/grf.html
Geographically weighted regression (GWR) (spatial statistics). Geographically Weighted Regression (GWR) (Spatial Statistics)-ArcGIS Pro | Documentation. (n.d.). https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/geographicallyweightedregression.htm
Kanade, V. A. (2022, August 17). What is spatial analysis? definition and examples. Spiceworks. https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-spatial-analysis/
Regression Analysis Basics. Regression analysis basics-ArcGIS Pro | Documentation. (n.d.). https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/regression-analysis-basics.htm#:~:text=Regression%20analysis%20allows%20you%20to,factors%20behind%20observed%20spatial%20patterns
What is exploratory data analysis?. IBM. (n.d.). https://www.ibm.com/topics/exploratory-data-analysis
Presentación (MODULE 1: SPATIAL DATA ANALYSIS)
library(sf)
library(dplyr)
library(tmap)
library(spdep)
library(rgdal)
library(tidyverse)
library(tigris)
library(mapview)
library(GWmodel)
library(regclass)
library(viridis)
library(grid)
library(RColorBrewer)
library(rgeoda)
library(sjPlot)
library(jtools)
library(dlookr)
library(terra)
library(ggpubr)
library(rstatix)
library(maptools)
library(SpatialML)
library(spgwr)
library(grid)
Importar los datos
###
data(columbus) ### dataset
columbus_shp <- readShapePoly(system.file("etc/shapes/columbus.shp",package="spdep"))
## Warning: shapelib support is provided by GDAL through the sf and terra packages
## among others
### shapefile
col.gal.nb <- read.gal(system.file("etc/weights/columbus.gal", package="spdep"))
### spatial connectivity matrix but it requires to calculate the spatial weights.
##
Limpieza de datos
summary(col.gal.nb)
## Neighbour list object:
## Number of regions: 49
## Number of nonzero links: 230
## Percentage nonzero weights: 9.579342
## Average number of links: 4.693878
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10
## 7 7 13 4 9 6 1 1 1
## 7 least connected regions:
## 1 6 31 39 42 46 47 with 2 links
## 1 most connected region:
## 20 with 10 links
summary(columbus)
## AREA PERIMETER COLUMBUS. COLUMBUS.I POLYID
## Min. :0.03438 Min. :0.9021 Min. : 2 Min. : 1 Min. : 1
## 1st Qu.:0.09315 1st Qu.:1.4023 1st Qu.:14 1st Qu.:13 1st Qu.:13
## Median :0.17477 Median :1.8410 Median :26 Median :25 Median :25
## Mean :0.18649 Mean :1.8887 Mean :26 Mean :25 Mean :25
## 3rd Qu.:0.24669 3rd Qu.:2.1992 3rd Qu.:38 3rd Qu.:37 3rd Qu.:37
## Max. :0.69926 Max. :5.0775 Max. :50 Max. :49 Max. :49
## NEIG HOVAL INC CRIME
## Min. : 1 Min. :17.90 Min. : 4.477 Min. : 0.1783
## 1st Qu.:13 1st Qu.:25.70 1st Qu.: 9.963 1st Qu.:20.0485
## Median :25 Median :33.50 Median :13.380 Median :34.0008
## Mean :25 Mean :38.44 Mean :14.375 Mean :35.1288
## 3rd Qu.:37 3rd Qu.:43.30 3rd Qu.:18.324 3rd Qu.:48.5855
## Max. :49 Max. :96.40 Max. :31.070 Max. :68.8920
## OPEN PLUMB DISCBD X
## Min. : 0.0000 Min. : 0.1327 Min. :0.370 Min. :24.25
## 1st Qu.: 0.2598 1st Qu.: 0.3323 1st Qu.:1.700 1st Qu.:36.15
## Median : 1.0061 Median : 1.0239 Median :2.670 Median :39.61
## Mean : 2.7709 Mean : 2.3639 Mean :2.852 Mean :39.46
## 3rd Qu.: 3.9364 3rd Qu.: 2.5343 3rd Qu.:3.890 3rd Qu.:43.44
## Max. :24.9981 Max. :18.8111 Max. :5.570 Max. :51.24
## Y AREA NSA NSB
## Min. :24.96 Min. : 1.093 Min. :0.0000 Min. :0.0000
## 1st Qu.:28.26 1st Qu.: 3.193 1st Qu.:0.0000 1st Qu.:0.0000
## Median :31.91 Median : 6.029 Median :0.0000 Median :1.0000
## Mean :32.37 Mean : 6.372 Mean :0.4898 Mean :0.5102
## 3rd Qu.:35.92 3rd Qu.: 7.989 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :44.07 Max. :21.282 Max. :1.0000 Max. :1.0000
## EW CP THOUS NEIGNO
## Min. :0.0000 Min. :0.0000 Min. :1000 Min. :1001
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1000 1st Qu.:1013
## Median :1.0000 Median :0.0000 Median :1000 Median :1025
## Mean :0.5918 Mean :0.4898 Mean :1000 Mean :1025
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1000 3rd Qu.:1037
## Max. :1.0000 Max. :1.0000 Max. :1000 Max. :1049
## PERIM
## Min. :0.9021
## 1st Qu.:1.4023
## Median :1.8410
## Mean :1.8887
## 3rd Qu.:2.1992
## Max. :5.0775
Cambiar nombre de la variable que se repite.
colnames(columbus)[1] <- "Area"
Cambiar 0 a 0.0001
columbus$OPEN <- replace(columbus$OPEN, columbus$OPEN == 0, 0.0001)
summary(columbus_shp)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 5.874907 11.28742
## y 10.788630 14.74245
## Is projected: NA
## proj4string : [NA]
## Data attributes:
## AREA PERIMETER COLUMBUS_ COLUMBUS_I POLYID
## Min. :0.03438 Min. :0.9021 Min. : 2 Min. : 1 Min. : 1
## 1st Qu.:0.09315 1st Qu.:1.4023 1st Qu.:14 1st Qu.:13 1st Qu.:13
## Median :0.17477 Median :1.8410 Median :26 Median :25 Median :25
## Mean :0.18649 Mean :1.8887 Mean :26 Mean :25 Mean :25
## 3rd Qu.:0.24669 3rd Qu.:2.1992 3rd Qu.:38 3rd Qu.:37 3rd Qu.:37
## Max. :0.69926 Max. :5.0775 Max. :50 Max. :49 Max. :49
## NEIG HOVAL INC CRIME
## Min. : 1 Min. :17.90 Min. : 4.477 Min. : 0.1783
## 1st Qu.:13 1st Qu.:25.70 1st Qu.: 9.963 1st Qu.:20.0485
## Median :25 Median :33.50 Median :13.380 Median :34.0008
## Mean :25 Mean :38.44 Mean :14.375 Mean :35.1288
## 3rd Qu.:37 3rd Qu.:43.30 3rd Qu.:18.324 3rd Qu.:48.5855
## Max. :49 Max. :96.40 Max. :31.070 Max. :68.8920
## OPEN PLUMB DISCBD X
## Min. : 0.0000 Min. : 0.1327 Min. :0.370 Min. :24.25
## 1st Qu.: 0.2598 1st Qu.: 0.3323 1st Qu.:1.700 1st Qu.:36.15
## Median : 1.0061 Median : 1.0239 Median :2.670 Median :39.61
## Mean : 2.7709 Mean : 2.3639 Mean :2.852 Mean :39.46
## 3rd Qu.: 3.9364 3rd Qu.: 2.5343 3rd Qu.:3.890 3rd Qu.:43.44
## Max. :24.9981 Max. :18.8111 Max. :5.570 Max. :51.24
## Y NSA NSB EW
## Min. :24.96 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:28.26 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :31.91 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :32.37 Mean :0.4898 Mean :0.5102 Mean :0.5918
## 3rd Qu.:35.92 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :44.07 Max. :1.0000 Max. :1.0000 Max. :1.0000
## CP THOUS NEIGNO
## Min. :0.0000 Min. :1000 Min. :1001
## 1st Qu.:0.0000 1st Qu.:1000 1st Qu.:1013
## Median :0.0000 Median :1000 Median :1025
## Mean :0.4898 Mean :1000 Mean :1025
## 3rd Qu.:1.0000 3rd Qu.:1000 3rd Qu.:1037
## Max. :1.0000 Max. :1000 Max. :1049
Cambiar 0 a 0.0001
columbus_shp$OPEN <- replace(columbus_shp$OPEN, columbus_shp$OPEN == 0, 0.0001)
Crear un data set con las variables que se usaran en el corplot
corp <- columbus %>% select(AREA,PERIMETER, NEIG, HOVAL, INC, CRIME, OPEN, PLUMB, PLUMB, DISCBD,
NSA, NSB, EW, CP, NEIGNO)
Corplot
correlate(corp, AREA,PERIMETER, NEIG, HOVAL, INC, CRIME, OPEN, PLUMB, PLUMB, DISCBD,
NSA, NSB, EW, CP, NEIGNO) %>% plot()
En base al corplot las variables que elegí para conducir el análisis debido a la correlación que estas tienen con la variable dependiente (HOVAL) son las siguientes:
HOVAL (variable dependiente): Valor de la vivienda (en $1,000) USD
DISCBD: Distancia al business district.
CRIME: Robos residenciales y robos de vehículos por cada mil hogares en el vecindario.
INC: Ingresos del hogar (en $1,000)
AREA: El área del vecindario en millas cuadradas
plot_normality(columbus, HOVAL, DISCBD, CRIME, INC, AREA)
Queen approach
swm_qn <- poly2nb(columbus_shp, queen = TRUE)
Queen contiguity based neighbors map
plot(columbus_shp, borders = 'lightgrey')
plot(swm_qn, coordinates(columbus_shp), pch = 19, cex = 0.6, add = TRUE, col = "blue")
title(main = "Queen Contiguity", cex.main = 0.9)
rook approach
swm_r <- poly2nb(columbus_shp, queen = FALSE)
rook contiguity based neighbors map
plot(columbus_shp, borders = 'lightgrey')
plot(swm_r, coordinates(columbus_shp), pch = 19, cex = 0.6, add = TRUE, col = "blue")
title(main = "Rook Contiguity", cex.main = 0.9)
rswm_qn <- nb2listw(swm_qn, style = "W", zero.policy = TRUE)
rswm_r <- nb2listw(swm_r, style = "W", zero.policy = TRUE)
Global Moran’s I Test
HOVAL
moran.test(columbus_shp$HOVAL, listw = rswm_qn, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: columbus_shp$HOVAL
## weights: rswm_qn
##
## Moran I statistic standard deviate = 2.2071, p-value = 0.01365
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.180093114 -0.020833333 0.008287783
DISCBD
moran.test(columbus_shp$DISCBD, listw = rswm_qn, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: columbus_shp$DISCBD
## weights: rswm_qn
##
## Moran I statistic standard deviate = 8.7904, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.800562617 -0.020833333 0.008731396
Autocorrelación espacial positiva
El resultado del Moran I statistic indica una correlación positiva significativa hacia las observaciones cercanas.
CRIME
moran.test(columbus_shp$CRIME, listw = rswm_qn, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: columbus_shp$CRIME
## weights: rswm_qn
##
## Moran I statistic standard deviate = 5.5894, p-value = 1.139e-08
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.500188557 -0.020833333 0.008689289
INC
moran.test(columbus_shp$INC, listw = rswm_qn, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: columbus_shp$INC
## weights: rswm_qn
##
## Moran I statistic standard deviate = 4.7645, p-value = 9.467e-07
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.415628778 -0.020833333 0.008391926
AREA
moran.test(columbus_shp$AREA, listw = rswm_qn, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: columbus_shp$AREA
## weights: rswm_qn
##
## Moran I statistic standard deviate = 1.7634, p-value = 0.03892
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.135232758 -0.020833333 0.007833009
Identificar autocorrelación espacial local
Crear un spatial lag de las variables de interes.
columbus_shp$sp_AREA <- lag.listw(rswm_qn, columbus_shp$AREA, zero.policy = TRUE)
columbus_shp$sp_INC <- lag.listw(rswm_qn, columbus_shp$INC , zero.policy = TRUE)
columbus_shp$sp_CRIME <- lag.listw(rswm_qn, columbus_shp$CRIME , zero.policy = TRUE)
columbus_shp$sp_DISCBD <- lag.listw(rswm_qn, columbus_shp$DISCBD , zero.policy = TRUE)
columbus_shp$sp_HOVAL <- lag.listw(rswm_qn, columbus_shp$HOVAL , zero.policy = TRUE)
Clusters
HOVAL
qtm(columbus_shp, "HOVAL")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
qtm(columbus_shp, "sp_HOVAL")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
AREA
qtm(columbus_shp, "AREA")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
qtm(columbus_shp, "sp_AREA")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
En comparación con el mapa normal podemos notar un cluster de valores altos al norte y otro al este de Columbus.
Existe presencia de autocorrelación espacial positiva
INC
qtm(columbus_shp, "INC")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
qtm(columbus_shp, "sp_INC")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
Podemos notar clusters de valores altos en las áreas del sur, sureste y oeste de Columbus.
Existe presencia de autocorrelación espacial positiva
CRIME
qtm(columbus_shp, "CRIME")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
qtm(columbus_shp, "sp_CRIME")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
Podemos notar una alta concentración de crimen en el centro y sur de Columbus.
Existe presencia de autocorrelación espacial positiva
DISCBD
qtm(columbus_shp, "DISCBD")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
qtm(columbus_shp, "sp_DISCBD")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
Los cluster en este mapa son muy notorios, podemos observar clusters de valores altos en las áreas norte, este y oeste de Columbus y un cluster de valores bajos en la zona centro.
Existe presencia de autocorrelación espacial positiva
modelo1_no_s <- lm(log(HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, data = columbus)
summary(modelo1_no_s)
##
## Call:
## lm(formula = log(HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, data = columbus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4150 -0.2323 -0.1298 0.2271 0.9392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2551772 0.5792443 5.620 1.22e-06 ***
## DISCBD 0.0410644 0.0528090 0.778 0.4410
## CRIME -0.0102559 0.0052284 -1.962 0.0562 .
## log(INC) 0.2063063 0.1701857 1.212 0.2319
## AREA 0.0009053 0.0127413 0.071 0.9437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3426 on 44 degrees of freedom
## Multiple R-squared: 0.4246, Adjusted R-squared: 0.3723
## F-statistic: 8.117 on 4 and 44 DF, p-value: 5.419e-05
Multicolinealidad (VIF)
VIF(modelo1_no_s)
## DISCBD CRIME log(INC) AREA
## 2.376053 3.129380 1.831095 1.310533
Lagrange Multiplier Diagnostic for Spatial Dependence (LMlag)
lm.LMtests(modelo1_no_s, rswm_qn, test = c("RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log(HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, data
## = columbus)
## weights: rswm_qn
##
## RLMlag = 5.4252, df = 1, p-value = 0.01985
Lagrange Multiplier Diagnostic for Spatial Error Dependence (LMerr)
lm.LMtests(modelo1_no_s, rswm_qn, test = c("RLMerr"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log(HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, data
## = columbus)
## weights: rswm_qn
##
## RLMerr = 8.5066, df = 1, p-value = 0.003539
AIC
AIC(modelo1_no_s)
## [1] 40.80974
Spatial lag de la variable dependiente
columbus$sp_HOVAL <- lag.listw(rswm_qn, columbus$HOVAL, zero.policy = TRUE)
Plot_normality para verificar si es necesario hacer conversión a log.
plot_normality(columbus, sp_HOVAL)
Modelo no espacial con lag en la variable dependiente
modelo2_no_s <- lm(log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, data = columbus)
summary(modelo2_no_s)
##
## Call:
## lm(formula = log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA,
## data = columbus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41999 -0.10719 -0.00726 0.10495 0.49633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9626684 0.3351790 8.839 2.60e-11 ***
## DISCBD 0.1313647 0.0305579 4.299 9.37e-05 ***
## CRIME 0.0039928 0.0030254 1.320 0.194
## log(INC) 0.0563242 0.0984777 0.572 0.570
## AREA -0.0007773 0.0073727 -0.105 0.917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1983 on 44 degrees of freedom
## Multiple R-squared: 0.4086, Adjusted R-squared: 0.3549
## F-statistic: 7.6 on 4 and 44 DF, p-value: 9.565e-05
VIF
VIF(modelo2_no_s)
## DISCBD CRIME log(INC) AREA
## 2.376053 3.129380 1.831095 1.310533
Lagrange Multiplier Diagnostic for Spatial Dependence (LMlag)
lm.LMtests(modelo2_no_s, rswm_qn, test = c("RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA,
## data = columbus)
## weights: rswm_qn
##
## RLMlag = 0.0068409, df = 1, p-value = 0.9341
Lagrange Multiplier Diagnostic for Spatial Error Dependence (LMerr)
lm.LMtests(modelo2_no_s, rswm_qn, test = c("RLMerr"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA,
## data = columbus)
## weights: rswm_qn
##
## RLMerr = 0.6497, df = 1, p-value = 0.4202
AIC
#Modelo original
AIC(modelo1_no_s)
## [1] 40.80974
#Modelo LAG
AIC(modelo2_no_s)
## [1] -12.80212
Spatial Durbin Model
spatial_durb <- lagsarlm(log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, data = columbus, rswm_qn, type = "mixed")
summary(spatial_durb)
##
## Call:lagsarlm(formula = log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) +
## AREA, data = columbus, listw = rswm_qn, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32404083 -0.11549675 -0.00012347 0.09149856 0.41122903
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.59658279 0.79561689 2.0067 0.04478
## DISCBD 0.05921828 0.06613020 0.8955 0.37053
## CRIME 0.00342669 0.00241040 1.4216 0.15514
## log(INC) 0.00723135 0.08128923 0.0890 0.92912
## AREA 0.00094174 0.00617009 0.1526 0.87869
## lag.DISCBD -0.06313781 0.08883258 -0.7108 0.47724
## lag.CRIME -0.00835370 0.00587011 -1.4231 0.15471
## lag.log(INC) -0.00642501 0.16241988 -0.0396 0.96845
## lag.AREA 0.00938143 0.01084016 0.8654 0.38680
##
## Rho: 0.58997, LR test value: 12.045, p-value: 0.00051944
## Asymptotic standard error: 0.12793
## z-value: 4.6118, p-value: 3.9921e-06
## Wald statistic: 21.269, p-value: 3.9921e-06
##
## Log likelihood: 19.50845 for mixed model
## ML residual variance (sigma squared): 0.023979, (sigma: 0.15485)
## Number of observations: 49
## Number of parameters estimated: 11
## AIC: -17.017, (AIC for lm: -6.9724)
## LM test for residual autocorrelation
## test value: 0.0018258, p-value: 0.96592
Autocorrelación Espacial de los residuales estimados
moran.test(exp(spatial_durb$residuals), rswm_qn)
##
## Moran I test under randomisation
##
## data: exp(spatial_durb$residuals)
## weights: rswm_qn
##
## Moran I statistic standard deviate = 0.22394, p-value = 0.4114
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.0003556796 -0.0208333333 0.0083619638
columbus_shp$spatial_durb_res <- exp(spatial_durb$residuals)
qtm(columbus_shp,"spatial_durb_res" )
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
Kernel bandwidth
kb1 <- bw.gwr(log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, approach = "AIC", adaptive = T, data = columbus_shp)
## Adaptive bandwidth (number of nearest neighbours): 37 AICc value: -13.01249
## Adaptive bandwidth (number of nearest neighbours): 31 AICc value: -8.045522
## Adaptive bandwidth (number of nearest neighbours): 42 AICc value: -13.51723
## Adaptive bandwidth (number of nearest neighbours): 44 AICc value: -12.99906
## Adaptive bandwidth (number of nearest neighbours): 39 AICc value: -13.53564
## Adaptive bandwidth (number of nearest neighbours): 39 AICc value: -13.53564
Estimar Geographic Weighted Regression
loc_gwr <- gwr.basic(log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) + AREA, adaptive = T, data = columbus_shp, bw = kb1)
loc_gwr
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2023-05-13 00:36:30
## Call:
## gwr.basic(formula = log(sp_HOVAL) ~ DISCBD + CRIME + log(INC) +
## AREA, data = columbus_shp, bw = kb1, adaptive = T)
##
## Dependent (y) variable: sp_HOVAL
## Independent variables: DISCBD CRIME INC AREA
## Number of data points: 49
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42059 -0.10671 -0.00652 0.10321 0.49592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.957965 0.326457 9.061 1.28e-11 ***
## DISCBD 0.131521 0.030134 4.365 7.60e-05 ***
## CRIME 0.004019 0.002879 1.396 0.170
## log(INC) 0.058234 0.099986 0.582 0.563
## AREA -0.035104 0.239747 -0.146 0.884
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.1982 on 44 degrees of freedom
## Multiple R-squared: 0.4088
## Adjusted R-squared: 0.355
## F-statistic: 7.605 on 4 and 44 DF, p-value: 9.519e-05
## ***Extra Diagnostic information
## Residual sum of squares: 1.729
## Sigma(hat): 0.1918
## AIC: -12.81361
## AICc: -10.81361
## BIC: -27.11177
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: bisquare
## Adaptive bandwidth: 39 (number of nearest neighbours)
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept 2.69111803 2.80613723 2.98142850 3.21448700 4.3285
## DISCBD 0.03961611 0.08341224 0.12148866 0.18931186 0.2027
## CRIME -0.00752674 -0.00051285 0.00142795 0.00420988 0.0075
## log(INC) -0.36051466 -0.04399181 0.05195395 0.23604256 0.2654
## AREA -0.75165394 -0.37419456 -0.04859917 0.09815296 0.3890
## ************************Diagnostic information*************************
## Number of data points: 49
## Effective number of parameters (2trace(S) - trace(S'S)): 15.17891
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 33.82109
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -13.53564
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -39.39323
## BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): -52.23591
## Residual sum of squares: 0.9948435
## R-square value: 0.6598045
## Adjusted R-square value: 0.5024728
##
## ***********************************************************************
## Program stops at: 2023-05-13 00:36:30
En base a los resultados podemos concluir que la variable que explicativa que tiene un mayor efecto en la variable dependiente es DISCBD.
En base al AICc y AIC podemos concluir que es te es un buen modelo para explicar el costo de las casas en Columbus.
Interpretación de la variable explicativa más significativa en base al p-value.
Por cada unidad de incremento en DISCBD habra un aumento de 0.131521 en HOVAL
DISCBD tiene un p-value de 7.60e-05 por lo cual se puede concluir que es significativa.
Mapeo de resultados de GWR
gwr_sf <- st_as_sf(loc_gwr$SDF)
gwr_sf
## Simple feature collection with 49 features and 21 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## First 10 features:
## Intercept DISCBD CRIME log(INC) AREA y yhat
## 0 3.004007 0.05791119 -0.0009324707 0.2383766 -0.5509936 3.568363 3.818592
## 1 3.047234 0.03961611 -0.0013043500 0.2360426 -0.4512176 3.843152 3.796087
## 2 2.956053 0.07087822 -0.0007322015 0.2426707 -0.5630264 3.814735 3.773138
## 3 3.024444 0.04408682 -0.0013030647 0.2408232 -0.4509552 3.490939 3.468537
## 4 2.905689 0.08341224 -0.0004885745 0.2435359 -0.5152399 3.378121 3.454554
## 5 2.804498 0.13254807 0.0005148758 0.2410408 -0.7516539 3.635281 3.774917
## 6 3.024061 0.04428500 -0.0007711186 0.2226175 -0.3388243 3.566888 3.532944
## 7 2.998060 0.04784869 -0.0012672539 0.2457828 -0.4310780 3.568241 3.596074
## 8 2.803997 0.16275008 0.0004189798 0.2147883 -0.7152458 3.677224 3.590358
## 9 4.301910 0.18931186 -0.0075267422 -0.3104348 -0.3649951 3.941907 3.965464
## residual CV_Score Stud_residual Intercept_SE DISCBD_SE CRIME_SE
## 0 -0.25022894 0 -1.9058384 0.3428929 0.03901356 0.003105266
## 1 0.04706448 0 0.3533366 0.3464871 0.04143572 0.003206766
## 2 0.04159787 0 0.2718486 0.3364135 0.03723992 0.003025912
## 3 0.02240253 0 0.2686493 0.3429954 0.04077215 0.003181740
## 4 -0.07643382 0 -0.6780095 0.3289574 0.03603246 0.002951435
## 5 -0.13963596 0 -0.8866642 0.3396405 0.03698552 0.003080601
## 6 0.03394399 0 0.4005234 0.3443429 0.04194285 0.003212458
## 7 -0.02783329 0 -0.1701056 0.3399138 0.04070869 0.003172943
## 8 0.08686670 0 0.6898865 0.3886943 0.03828209 0.003399088
## 9 -0.02355688 0 -0.1689655 0.7398252 0.04776301 0.005811715
## log(INC)_SE AREA_SE Intercept_TV DISCBD_TV CRIME_TV log(INC)_TV AREA_TV
## 0 0.1157388 0.2863561 8.760771 1.4843862 -0.3002869 2.059608 -1.924155
## 1 0.1161966 0.2594950 8.794654 0.9560859 -0.4067493 2.031406 -1.738829
## 2 0.1114812 0.2900881 8.786962 1.9032862 -0.2419772 2.176785 -1.940881
## 3 0.1141928 0.2595064 8.817740 1.0812973 -0.4095447 2.108918 -1.737742
## 4 0.1051221 0.2743601 8.833023 2.3149192 -0.1655380 2.316695 -1.877969
## 5 0.1079146 0.3419024 8.257254 3.5837824 0.1671348 2.233624 -2.198446
## 6 0.1111874 0.2397995 8.782121 1.0558414 -0.2400400 2.002182 -1.412949
## 7 0.1119322 0.2546116 8.820058 1.1753925 -0.3993939 2.195818 -1.693081
## 8 0.1120407 0.3332028 7.213888 4.2513373 0.1232624 1.917057 -2.146578
## 9 0.2186516 0.3642344 5.814766 3.9635668 -1.2950982 -1.419769 -1.002089
## Local_R2 geometry
## 0 0.3644779 POLYGON ((8.624129 14.23698...
## 1 0.3227876 POLYGON ((8.25279 14.23694,...
## 2 0.4538784 POLYGON ((8.653305 14.00809...
## 3 0.3329080 POLYGON ((8.459499 13.82035...
## 4 0.5249904 POLYGON ((8.685274 13.63952...
## 5 0.6863268 POLYGON ((9.401384 13.5504,...
## 6 0.3381373 POLYGON ((8.037741 13.60752...
## 7 0.3782427 POLYGON ((8.247527 13.58651...
## 8 0.6757452 POLYGON ((9.333297 13.27242...
## 9 0.7976905 POLYGON ((10.08251 13.03377...
gwr_sf$y_predicted <- exp(gwr_sf$yhat)
tm_shape(gwr_sf) +
tm_polygons(col = "y_predicted", palette = "RdPu", style = "quantile", n=8, title = "Valor en 1,000 USD") +
tm_layout(title = 'Valor de viviendas', title.position = c('right', 'top'))
## Warning: Currect projection of shape gwr_sf unknown. Long-lat (WGS84) is
## assumed.
Visualización de la predicción local de variables explicativas estadísticamente significativas
Distancia a Business District
tm_shape(gwr_sf) +
tm_polygons(col = "DISCBD_TV", palette = "Reds", style = "quantile" , n=8, title= "t-statistic") +
tm_layout(title = 'Distancia a Business District', title.position = c('right', 'top'))
## Warning: Currect projection of shape gwr_sf unknown. Long-lat (WGS84) is
## assumed.
Taza de Criminalidad
tm_shape(gwr_sf) +
tm_polygons(col = "CRIME_TV", palette = "OrRd", style = "quantile" , n=8, title= "t-statistic") +
tm_layout(title = 'Tasa de Criminalidad', title.position = c('right', 'top'))
## Warning: Currect projection of shape gwr_sf unknown. Long-lat (WGS84) is
## assumed.
R2
tm_shape(gwr_sf) +
tm_polygons(col = "Local_R2", palette = "Blues", style = "quantile" , n=8, title= "R2") +
tm_layout(title = 'Estimación Local R2', title.position = c('right', 'top'))
## Warning: Currect projection of shape gwr_sf unknown. Long-lat (WGS84) is
## assumed.
Regression Residuals
gwr_sf$exp_res <- exp(gwr_sf$residual)
tm_shape(gwr_sf) +
tm_polygons(col = "exp_res", palette="Greens", style="quantile", n=8, title="Residuals") +
tm_layout(title= 'Regression Residuals', title.position = c('right', 'top'))
## Warning: Currect projection of shape gwr_sf unknown. Long-lat (WGS84) is
## assumed.
Autocorrelación Espacial de los residuales estimados
moran.test(gwr_sf$exp_res, rswm_qn)
##
## Moran I test under randomisation
##
## data: gwr_sf$exp_res
## weights: rswm_qn
##
## Moran I statistic standard deviate = 1.4766, p-value = 0.06989
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.113772429 -0.020833333 0.008310139
En base al AIC llegue a la conclusión que el mejor modelo para determinar el precio de las casas en Columbus es el modelo de Regresión espacial local con un AIC de -39.4 en comparación a -17.017 en el modelo global y -12.80212 en el modelo no espacial.
Uno de los principales hayasgos es que la varaible expolicativa con un mayor grado de significancia es DISCBD la cual determina la distancia hasta el Business District sin embargo CRIME no fue una variable significativa en los modelo lo cual me parecio interesante.
Al mapear la variable CRIME en el ESDA se observa una concentración en el centro hacia el oeste de Columbus lo cual tiene sentido ya que las casas mas costosas al analizar la variable dependiente HOVAL se encuentran en su mayoría hacia el este de Columbus.
De acuerdo al modelo espacial local podemos predecir que por cada unidad de incremento en DISCBD habrá un aumento de 0.131521 en el precio de las casas en Columbus.
Al mapear la variable INC Ingresos del hogar (en $1,000) se puede observar la mayor concentración al este de Columbus lo cual tiene sentido en base al precio de las casas en esa región y la cercanía al Business district.
En base al modelo se puede concluir que a lo largo del tiempo habrá un aumento en el precio de las casas que se encuentran en el este de Columbus.
Podemos concluir que el área mas pobre de Columbus es en donde hay una concentración mas alta de crimen esta área se encuentra al centro oeste.
qtm(columbus_shp, "HOVAL")
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
coordinates
coordss <- coordinates(columbus_shp)
head(coordss)
## [,1] [,2]
## 0 8.827218 14.36908
## 1 8.332658 14.03162
## 2 9.012265 13.81972
## 3 8.460801 13.71696
## 4 9.007982 13.29637
## 5 9.739926 13.47463
Data para el Random Forest
rf_dat <- columbus %>% select(AREA, INC, CRIME, DISCBD, sp_HOVAL)
rf_dat <- rf_dat %>% mutate(across(c(sp_HOVAL, INC), function(x) log(x)))
formula_rf <- "sp_HOVAL ~ DISCBD + CRIME + INC + AREA"
Optimal Bandwidth Selection
Lo decidí hacer con un mínimo de 37 y máximo de 42 porque entre esos intervalos se encontraba el AIC más bajo del Kernel bandwidth
bwgrf <- grf.bw(formula = formula_rf, dataset = rf_dat, kernel = "adaptive", coords = coordss, bw.min = 37, bw.max = 42, step = 1, trees = 500, mtry = NULL, importance = "impurity", forests = FALSE, weighted = TRUE, verbose = TRUE)
## Bandwidth: 37
## R2 of Local Model: 0.205213625103152
## Bandwidth: 38
## R2 of Local Model: 0.204831528519056
## Bandwidth: 39
## R2 of Local Model: 0.21517916847563
## Bandwidth: 40
## R2 of Local Model: 0.212283842745192
## Bandwidth: 41
## R2 of Local Model: 0.192391341963029
## Bandwidth: 42
## R2 of Local Model: 0.192600252539304
## Best Bandwidth (Based on the Local Model): 39
GRF Estimation
grf_model <- grf(formula = formula_rf, dframe = rf_dat, bw = bwgrf$Best.BW, ntree = 500, mtry = 2, kernel = "adaptive", forests = TRUE, coords = coordss)
##
## Number of Observations: 49
## Number of Independent Variables: 4
## Kernel: Adaptive
## Neightbours: 39
##
## --------------- Global Model Summary ---------------
## Ranger result
##
## Call:
## ranger(formula_rf, data = rf_dat, num.trees = 500, mtry = 2, importance = "impurity", num.threads = NULL)
##
## Type: Regression
## Number of trees: 500
## Sample size: 49
## Number of independent variables: 4
## Mtry: 2
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.04747162
## R squared (OOB): 0.2207997
##
## Importance:
## DISCBD CRIME INC AREA
## 1.1978931 0.4870933 0.5620217 0.4179139
##
## Mean Square Error (Not OBB): 0.009
## R-squared (Not OBB) %: 84.211
## AIC (Not OBB): -218.567
## AICc (Not OBB): -217.171
##
## --------------- Local Model Summary ---------------
##
## Residuals OOB:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.55667 -0.11630 0.03157 0.01131 0.14820 0.66380
##
## Residuals Predicted (Not OBB):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0909200 -0.0150306 0.0001396 -0.0014477 0.0171505 0.0650780
##
## Local Variable Importance:
## Min Max Mean StD
## DISCBD 0.2995852 1.5901250 0.7205589 0.42539293
## CRIME 0.1683378 0.8473273 0.3717427 0.17635704
## INC 0.2524345 0.7063297 0.4611670 0.11465744
## AREA 0.2201529 0.4300722 0.3247466 0.05528003
##
## Mean squared error (OOB): 0.053
## R-squared (OOB) %: 10.452
## AIC (OBB): -133.529
## AICc (OBB): -132.133
## Mean squared error Predicted (Not OBB): 0.001
## R-squared Predicted (Not OBB) %: 98.322
## AIC Predicted (Not OBB): -328.396
## AICc Predicted (Not OBB): -327
##
## Calculation time (in seconds): 0.9108
Visualización de resultados estimados de GRF
Variable dependiente
columbus_shp$grf_predicteddv <- grf_model$LGofFit$LM_yfitPred
tm_shape(columbus_shp) +
tm_polygons(col = "grf_predicteddv", palette="RdPu", style="quantile", n=8, title="Valor en 1,000 USD") +
tm_layout(title= 'GRF Predicted Valor de viviendas', title.position = c('right', 'top'))
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.
local R2
columbus_shp$grf_localR2 <- grf_model$LGofFit$LM_Rsq100
tm_shape(columbus_shp) +
tm_polygons(col = "grf_localR2", palette="Greens", style="quantile", n=8, title="R2") +
tm_layout(title= 'GRF (Spatial ML) Estimated Local R2', title.position = c('right', 'top'))
## Warning: Currect projection of shape columbus_shp unknown. Long-lat (WGS84) is
## assumed.