{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Manual 03: Manejo de datos del DANE - Inseguridad alimentaria\n", "## Universidad del Norte\n", "### Economía Matemática\n", "\n", "Este manual intenta mostrar un par de paquetes para una eficiente escritura de codigo y comprender la importancia de los procesos al manejar datos desde el **\"seteo\"** correcto de *directorios*, ejecucion de filtrados siguiendo la metodologia del DANE. Se definen funciones a partir de variable seleccionadas, como ademas de la recategorizacion de valores de variables para mejor interpretacion.\n", " \n", "Esta guia tiene como objetivo identifcar como para todo conjunto de datos se deben comprender los conceptos básicos como el uso de *keys*, uso de funciones de paquetes y/o correcta parametrizacion para poder **empalmar** tablas de una misma base de datos y conseguir armar modelos a conveniencia con los datos ya correctamente tratados. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 1: Setear el directorio de trabajo\n", "Para garantizar que los archivos se carguen correctamente, se recomienda establecer el **directorio de trabajo** donde se encuentran los archivos del DANE. Esto se puede hacer con el paquete `os`.\n", "\n", "En este caso la ruta asignada debe ser donde esten los documentos que el paquete pandas va a leer." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['.DS_Store', 'Filtered_masterdata.csv', 'Clase1.ipynb', 'Datosmodelo.csv', 'Opera1.html', 'EjercicioEVA', 'clase.html', 'Prom01.qmd', 'Opera1.ipynb', 'Opera1_files', '.git', 'Opera2.ipynb']\n" ] } ], "source": [ "import os\n", "os.chdir(\"/Users/carlosandresyanesguerra/Documents/Ecomat\")\n", "# Recuerde siempre usar el / correcto \n", "\n", "print(os.listdir())\n", "# La funcion listdir() sirve para verificar que los archivos que vamos a leer realmente existan \n", "# En el directorio, los nombres deben ser exactos de lo contrario pandas no lo reconocera. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 2: Carga de Datos\n", "\n", "Los archivos del DANE vienen en formato `CSV` y utilizan como delimitador (;). Se pueden cargar de la siguiente manera utilizando el paquete `pandas` que nos permite visualizar informacion de los llamados **dataframes**. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "tabla1 = pd.read_csv('ejercicioEVA/caracteristicas.CSV', delimiter=';')\n", "tabla2 = pd.read_csv('ejercicioEVA/condicionesdevida.CSV', delimiter=';')\n", "tabla3 = pd.read_csv('ejercicioEVA/datosdelavivienda.csv', delimiter=';')\n", "tabla4 = pd.read_csv('ejercicioEVA/Educacion.CSV', delimiter=';')\n", "tabla5 = pd.read_csv('ejercicioEVA/serviciosdelhogar.CSV', delimiter=';')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 3: Definicion de llaves e interseccion de datos. \n", "\n", "Entienda este proceso como el inicio del filtrado de posibles datos duplicados en su analisis, el **DANE** cuenta con varias herramientas para identificar el conjunto de datos de diferentes tablas en una misma base datos, para asegurarnos de tener una cantidad de datos que tenga sentido y sean relevantes, usaremos la variable `DIRECTORIO` como nuestra **llave** para el filtrado de datos. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "86063" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# No queremos cambiar la estructura de los archivos anteriormente definidos, los valores de los encuestados\n", "# Es asignado a la variable DIRECTORIO recuerda esta es nuestra LLAVE\n", "\n", "directorio1 = set(tabla1[\"DIRECTORIO\"])\n", "directorio2 = set(tabla2[\"DIRECTORIO\"])\n", "directorio3 = set(tabla3[\"DIRECTORIO\"])\n", "directorio4 = set(tabla4[\"DIRECTORIO\"])\n", "directorio5 = set(tabla5[\"DIRECTORIO\"])\n", "\n", "#Buscando las intersecciones en los sets\n", "\n", "directorios_comunes = directorio1.intersection(directorio2,directorio3,directorio4,directorio5)\n", "#asignamos los valores de la interseccion dentro de la variable directorios comunes\n", "\n", "len(directorios_comunes)\n", "#revisamos el largo de la interseccion para determinar que el numero de datos tenga sentido y sea relevante" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 4: Filtrado por valores comunes usando la funcion .isin() y eliminacion de valores duplicados. \n", "\n", "A continuación se hará uso de la funcion .isin() como una forma eficiente de registrar datos en un **dataframe**. En este caso, basado en una lista de valores dada por la variable `DIRECTORIO`. Sin embargo, tambien se puede usar para registrar datos coincidentes en otra tabla, luego se usa la funcion `drop.duplicates` para eliminar esos datos duplicados que compartian algo del mismo valor de la variable DIRECTORIO" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#utilizamos el comando isin para buscar los datos de los directorios que se encuentren dentro de cada base de datos\n", "\n", "tabla1_filtrada = tabla1[tabla1[\"DIRECTORIO\"].isin(directorios_comunes)] \n", "tabla2_filtrada = tabla2[tabla2[\"DIRECTORIO\"].isin(directorios_comunes)]\n", "tabla3_filtrada = tabla3[tabla3[\"DIRECTORIO\"].isin(directorios_comunes)]\n", "tabla4_filtrada = tabla4[tabla4[\"DIRECTORIO\"].isin(directorios_comunes)]\n", "tabla5_filtrada = tabla5[tabla5[\"DIRECTORIO\"].isin(directorios_comunes)]\n", "#de esta forma se busca dentro de la lista de interseccion identificada anteriormente\n", "\n", "# Eliminamos los valores duplicados en cada una de las tablas, con el fin de no tener datos repetidos\n", "\n", "tabla1_filtrada = tabla1_filtrada.drop_duplicates(subset=\"DIRECTORIO\")\n", "tabla2_filtrada = tabla2_filtrada.drop_duplicates(subset=\"DIRECTORIO\")\n", "tabla3_filtrada = tabla3_filtrada.drop_duplicates(subset=\"DIRECTORIO\")\n", "tabla4_filtrada = tabla4_filtrada.drop_duplicates(subset=\"DIRECTORIO\")\n", "tabla5_filtrada = tabla5_filtrada.drop_duplicates(subset=\"DIRECTORIO\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 5: Sufijos antes de empalmar tablas\n", "\n", "Los sufijos nos permiten evitar la perdida de columnas al hacer un empalme de tablas, para esto debemos usar la función `.add_suffix(\"\")` y luego asegurar que la columna de la variable \"llave\" en este caso no tenga sufijo, de esta forma al realizar el **empalme** mayormente conocido como *merged*, nos quede todo emparejado con los valores en cada una de las tablas que establecimos anteriormente. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "tabla1_filtrada = tabla1_filtrada.add_suffix(\"_archivo1\")\n", "tabla2_filtrada = tabla2_filtrada.add_suffix(\"_archivo2\")\n", "tabla3_filtrada = tabla3_filtrada.add_suffix(\"_archivo3\")\n", "tabla4_filtrada = tabla4_filtrada.add_suffix(\"_archivo4\")\n", "tabla5_filtrada = tabla5_filtrada.add_suffix(\"_archivo5\")\n", "# Utilizamos el comando add_sufix para agregar una etiqueta\n", "\n", "tabla1_filtrada[\"DIRECTORIO\"] = tabla1_filtrada[\"DIRECTORIO_archivo1\"]\n", "tabla2_filtrada[\"DIRECTORIO\"] = tabla2_filtrada[\"DIRECTORIO_archivo2\"]\n", "tabla3_filtrada[\"DIRECTORIO\"] = tabla3_filtrada[\"DIRECTORIO_archivo3\"]\n", "tabla4_filtrada[\"DIRECTORIO\"] = tabla4_filtrada[\"DIRECTORIO_archivo4\"]\n", "tabla5_filtrada[\"DIRECTORIO\"] = tabla5_filtrada[\"DIRECTORIO_archivo5\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 6: Guardar un nuevo documento\n", "\n", "Para guardar todo el procedimiento y obtener la nueva *tabla merged* se usan diferentes funciones como: `.concat([...]`, `axis=1`, `join=\"inner\"`, `.set_index` y `.reset_index()` \n", "\n", "#### Función pd.concat([...], axis=1, join=\"inner\") \n", "\n", "- `pd.concat()` se usa para concatenar (unir) múltiples DataFrames. \n", "- `axis=1` Esto indica que los DataFrames se están combinando por columnas (de manera horizontal)\n", "- `join=\"inner\"` Realiza una intersección de los índices, es decir, solo se conservarán las filas cuyos DIRECTORIO estén presentes en todos los DataFrames.\n", "\n", "#### Funciones .set_index() y .reset_index() \n", "\n", "Cada tabla antes de *concatenarse* cambia su índice a la columna DIRECTORIO, esto permite que la combinación se haga de manera precisa usando esta clave como referencia, esto convierte la columna DIRECTORIO en el índice del **DataFrame**. \n", "\n", "Después de la concatenación, el índice de la tabla resultante sigue siendo DIRECTORIO. Si queremos que DIRECTORIO vuelva a ser una columna normal en lugar del índice, usamos .reset_index().\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIODIRECTORIO_archivo1SECUENCIA_ENCUESTA_archivo1SECUENCIA_P_archivo1ORDEN_archivo1FEX_C_archivo1P6016_archivo1P1894_archivo1P6020_archivo1P6034_archivo1...P3174S1_archivo5P3174S2_archivo5P3174S3_archivo5P3174S4_archivo5P3174S5_archivo5I_HOGAR_archivo5I_UGASTO_archivo5PERCAPITA_archivo5I_OU_archivo5CANT_PERSONAS_HOGAR_archivo5
079101147910114111282.3709451321...NaNNaNNaNNaNNaN775000.0775000.0387500.00.02
1791011579101151113.0542171311...NaN1.0NaNNaNNaN2200000.02200000.0440000.00.05
27910119791011911131.2884771321...NaNNaNNaNNaNNaN410000.0410000.0205000.00.02
37910120791012011173.5777491321...NaNNaNNaNNaNNaN720000.0720000.0360000.00.02
47910121791012111148.5499671321...NaNNaNNaNNaNNaN1892500.01892500.0946250.00.02
\n", "

5 rows × 432 columns

\n", "
" ], "text/plain": [ " DIRECTORIO DIRECTORIO_archivo1 SECUENCIA_ENCUESTA_archivo1 \\\n", "0 7910114 7910114 1 \n", "1 7910115 7910115 1 \n", "2 7910119 7910119 1 \n", "3 7910120 7910120 1 \n", "4 7910121 7910121 1 \n", "\n", " SECUENCIA_P_archivo1 ORDEN_archivo1 FEX_C_archivo1 P6016_archivo1 \\\n", "0 1 1 282.370945 1 \n", "1 1 1 3.054217 1 \n", "2 1 1 31.288477 1 \n", "3 1 1 73.577749 1 \n", "4 1 1 48.549967 1 \n", "\n", " P1894_archivo1 P6020_archivo1 P6034_archivo1 ... P3174S1_archivo5 \\\n", "0 3 2 1 ... NaN \n", "1 3 1 1 ... NaN \n", "2 3 2 1 ... NaN \n", "3 3 2 1 ... NaN \n", "4 3 2 1 ... NaN \n", "\n", " P3174S2_archivo5 P3174S3_archivo5 P3174S4_archivo5 P3174S5_archivo5 \\\n", "0 NaN NaN NaN NaN \n", "1 1.0 NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN \n", "\n", " I_HOGAR_archivo5 I_UGASTO_archivo5 PERCAPITA_archivo5 I_OU_archivo5 \\\n", "0 775000.0 775000.0 387500.0 0.0 \n", "1 2200000.0 2200000.0 440000.0 0.0 \n", "2 410000.0 410000.0 205000.0 0.0 \n", "3 720000.0 720000.0 360000.0 0.0 \n", "4 1892500.0 1892500.0 946250.0 0.0 \n", "\n", " CANT_PERSONAS_HOGAR_archivo5 \n", "0 2 \n", "1 5 \n", "2 2 \n", "3 2 \n", "4 2 \n", "\n", "[5 rows x 432 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "masterdata = pd.concat([tabla1_filtrada.set_index(\"DIRECTORIO\"),tabla2_filtrada.set_index(\"DIRECTORIO\"),tabla3_filtrada.set_index(\"DIRECTORIO\"),tabla4_filtrada.set_index(\"DIRECTORIO\"),tabla5_filtrada.set_index(\"DIRECTORIO\")],axis=1,join=\"inner\").reset_index()\n", "\n", "#El comando head nos permite ver si obtenimos el resultado esperado\n", "masterdata.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 7: Lectura de archivos nuevos para saleccion de variables y lectura de documentos dentro de un directorio en la ruta principal\n", "\n", "Primero verificaremos que la estructura de los datos sean iguales para saber que proceso vamos a seguir en el **empalme** y si se va requerir el uso de llaves para poder hacer el empalme como al principio. Para la lectura de archivos dentro de un directorio vamos a tomar las rutas bases de todos los archivos de los cuales vamos a seleccionar variables en una lista. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estructura de archivo Atlantico" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIOSECUENCIA_ENCUESTASECUENCIA_PORDENDEPARTAMENTOMUNICIPIOREGIONFEX_CCANT_HOG_COMPLETOSCANT_HOGARES_VIVIENDACLASEP3516X01X02X03X04X05X06X07X08
07910520111811496.065748111NaN11111122
17910548111875814.835424112NaN12111112
27910734111811516.738612111NaN22222222
37910990111811509.852073111NaN22222222
479110301118573120.597219112NaN11112222
\n", "
" ], "text/plain": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910520 1 1 1 8 \n", "1 7910548 1 1 1 8 \n", "2 7910734 1 1 1 8 \n", "3 7910990 1 1 1 8 \n", "4 7911030 1 1 1 8 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 496.065748 1 1 \n", "1 758 1 4.835424 1 1 \n", "2 1 1 516.738612 1 1 \n", "3 1 1 509.852073 1 1 \n", "4 573 1 20.597219 1 1 \n", "\n", " CLASE P3516 X01 X02 X03 X04 X05 X06 X07 X08 \n", "0 1 NaN 1 1 1 1 1 1 2 2 \n", "1 2 NaN 1 2 1 1 1 1 1 2 \n", "2 1 NaN 2 2 2 2 2 2 2 2 \n", "3 1 NaN 2 2 2 2 2 2 2 2 \n", "4 2 NaN 1 1 1 1 2 2 2 2 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc1 = r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Atlantico.dta'\n", "doc11 = pd.read_stata(doc1)\n", "doc11.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Estructura archivo Bolivar" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIOSECUENCIA_ENCUESTASECUENCIA_PORDENDEPARTAMENTOMUNICIPIOREGIONFEX_CCANT_HOG_COMPLETOSCANT_HOGARES_VIVIENDACLASEP3516X01X02X03X04X05X06X07X08
079101861111311221.608387112NaN11111222
179101871111311143.166971112NaN22222222
279102801111311549.758837111NaN22222222
37910283111131198.321580112NaN11111222
479103031111311399.300546111NaN11122221
\n", "
" ], "text/plain": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910186 1 1 1 13 \n", "1 7910187 1 1 1 13 \n", "2 7910280 1 1 1 13 \n", "3 7910283 1 1 1 13 \n", "4 7910303 1 1 1 13 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 221.608387 1 1 \n", "1 1 1 143.166971 1 1 \n", "2 1 1 549.758837 1 1 \n", "3 1 1 98.321580 1 1 \n", "4 1 1 399.300546 1 1 \n", "\n", " CLASE P3516 X01 X02 X03 X04 X05 X06 X07 X08 \n", "0 2 NaN 1 1 1 1 1 2 2 2 \n", "1 2 NaN 2 2 2 2 2 2 2 2 \n", "2 1 NaN 2 2 2 2 2 2 2 2 \n", "3 2 NaN 1 1 1 1 1 2 2 2 \n", "4 1 NaN 1 1 1 2 2 2 2 1 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc2 = r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Bolivar.dta'\n", "doc22 = pd.read_stata(doc2)\n", "doc22.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Como los datos cumplen una misma estructura no es necesario usar llaves para el empalme ya que los archivos `.dta` tienen la ventaja de que al reconocer las columnas identificadoras DIRECTORIO, SECUENCIA_P y SECUENCIA_ENCUESTA, los archivos de *tipo* `.dta` pueden crear indices secuenciales agregando el parámetro `ignore_index = TRUE` y los datos empalmaran correctamente, ya que este índice seguira la misma secuencia con todos los archivos al cumplir con la misma estructura de datos." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910520 1 1 1 8 \n", "1 7910548 1 1 1 8 \n", "2 7910734 1 1 1 8 \n", "3 7910990 1 1 1 8 \n", "4 7911030 1 1 1 8 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 496.065748 1 1 \n", "1 758 1 4.835424 1 1 \n", "2 1 1 516.738612 1 1 \n", "3 1 1 509.852073 1 1 \n", "4 573 1 20.597219 1 1 \n", "\n", " CLASE P3516 X01 X02 X03 X04 X05 X06 X07 X08 \n", "0 1 NaN 1 1 1 1 1 1 2 2 \n", "1 2 NaN 1 2 1 1 1 1 1 2 \n", "2 1 NaN 2 2 2 2 2 2 2 2 \n", "3 1 NaN 2 2 2 2 2 2 2 2 \n", "4 2 NaN 1 1 1 1 2 2 2 2 \n" ] } ], "source": [ "file_paths = [\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Atlantico.dta',\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Bolivar.dta',\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Cesar.dta',\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Cordoba.dta',\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida La Guajira.dta',\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Magdalena.dta',\n", " r'/Users/carlosandresyanesguerra/Documents/Ecomat/EjercicioEVA/B data calidad de vida/Base de Calidad de vida Sucre.dta'\n", "]\n", "Datadep = []\n", "\n", "# Haciendo una lectura de cada archivo con el append o agrupación de data dentro de la lista\n", "for file_path in file_paths:\n", " df = pd.read_stata(file_path)\n", " Datadep.append(df)\n", "\n", "# Realizamos finalmente la concatenación de los df\n", "Datadepartamentos = pd.concat(Datadep, ignore_index=True)\n", "\n", "# Revisando que la data tenga sentido\n", "print(Datadepartamentos.head())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Paso 8: Juntar el dataframe masterdata con el nuevo archivo de Datadepratamentos \n", "\n", "Aquí usaremos denuevo la parte de key `DIRECTORIO` y la funcion `.isin`, verificamos si hay residuales despues del *merged* y creamos un archivo `.csv` con la unidad maestra o \"master\" filtrada e integrada con la nueva información de las demás bases de datos." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Todos los datos fueron encontrados\n" ] } ], "source": [ "#Filtrando nuestra primera base de datos basado en los directorios utilizados en el primer estudio\n", "\n", "Filtered_masterdata = masterdata[masterdata[\"DIRECTORIO\"].isin(Datadepartamentos[\"DIRECTORIO\"])]\n", "len(Filtered_masterdata)\n", "\n", "#Verificando el numero de residuales despues del merge\n", "\n", "if len(Filtered_masterdata)- len(Datadepartamentos) == 0:\n", " \n", " print (\"Todos los datos fueron encontrados\")\n", " \n", "else:\n", " \n", " print (\"No todos los datos se encontraron\")\n", "\n", "Filtered_masterdata.to_csv(\"Filtered_masterdata.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 9: Seleccion de variables y creacion del dataframe a trabajar.\n", "\n", "A partir de ahora seleccionamos las variables, con las que buscamos crear la variable Seguridad Alimentaria en el caribe colombiano y buscamos que esta varaible este codificada de tal forma que en la que se pueda categorizar los datos y obtener interpretaciones y modelos interesantes, para esto los datos deben estar en un subconjunto o dataframe por aparte." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
P6081_archivo1P6083_archivo1P6088_archivo1P2061_archivo1P1070_archivo3P6160_archivo4P8586_archivo4P5010_archivo5P791_archivo5P3162_archivo5P3163_archivo5P5032_archivo5
3331.01.011213NaNNaN4
63229.02.011213NaNNaN4
64224.02.011221NaNNaN4
140222.02.0212211.070000.01
14331NaN2.0112211.041582.01
\n", "
" ], "text/plain": [ " P6081_archivo1 P6083_archivo1 P6088_archivo1 P2061_archivo1 \\\n", "3 3 3 1.0 1.0 \n", "63 2 2 9.0 2.0 \n", "64 2 2 4.0 2.0 \n", "140 2 2 2.0 2.0 \n", "143 3 1 NaN 2.0 \n", "\n", " P1070_archivo3 P6160_archivo4 P8586_archivo4 P5010_archivo5 \\\n", "3 1 1 2 1 \n", "63 1 1 2 1 \n", "64 1 1 2 2 \n", "140 2 1 2 2 \n", "143 1 1 2 2 \n", "\n", " P791_archivo5 P3162_archivo5 P3163_archivo5 P5032_archivo5 \n", "3 3 NaN NaN 4 \n", "63 3 NaN NaN 4 \n", "64 1 NaN NaN 4 \n", "140 1 1.0 70000.0 1 \n", "143 1 1.0 41582.0 1 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "VARIABLES_MODELO = [\"P5010\",\n", " \"P791\",\"P3162\",\"P3163\",\"P5032\",\"P6081\",\"P6083\",\n", " \"P6088\",\"P2061\",\"P1070\",\"P6160\",\"P8586\"]\n", "\n", "DIRECTORIO = \"DIRECTORIO\"\n", "\n", "# Filtrando las columnas que coinciden exactamente con los prefijos de exact_variables seguidos de un sufijo\n", "DATOS_MODELO = [col for col in Filtered_masterdata.columns if any(col.startswith(var + '_') for var in VARIABLES_MODELO)]\n", "\n", "# Filtrar el DataFrame para mantener solo esas columnas\n", "DATOS_MODELO_DF = Filtered_masterdata.loc[:, Filtered_masterdata.columns.isin(DATOS_MODELO)]\n", "\n", "# Mostrar el DataFrame filtrado\n", "DATOS_MODELO_DF.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 11: Reasignar la llave Directorio al dataframe de las variables \n", "\n", "Ya hemos creado el dataframe objetivo con las variables de los archivos `.csv`, lo bueno de haber creado el archivo `filtered_masterdata` es que la columna *DIRECTORIO* esta organizado con las columnas de nuestras variables tomadas, así que ahora crearemos otro **dataframe** que integre estas variables para poder trabjar con las variables del nuevo conjuntos de datos que eran de los archivos `.dta.`" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "if DIRECTORIO in Filtered_masterdata.columns:\n", " DATOS_MODELO = [DIRECTORIO] + DATOS_MODELO\n", " \n", "DATAF1 =Filtered_masterdata[DATOS_MODELO]\n", "\n", "DATAF1.head()\n", "\n", "DATAF1.to_csv(\"Datosmodelo.csv\", index= False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 12: Identificacion de variables del conjunto de datos .dta \n", "\n", "Ya con las variables de los archivos `.csv` estan asignados a un archivo **datosmodelo.csv** ahora tenemos que trabajar con las variables de \"calidad de vida.dta\" para poder crear nuestra variable de inseguridad alimentaria, ahora se toma lo que unicamente ocurre cuando los valores de las variables seleccionadas sean estrictamente igual a 1 y asignar el valor de 1 en la variasble. Si por algun motivo el valor es diferente de 1, habrá que asignar el valor 0, ya estamos cerca!\n", "\n", "$$Variable_i = \\left\\{\n", "\\begin{array}{ll}\n", "1 & \\text{si cumple la característica} \\\\\n", "0 & \\text{de lo contrario}\n", "\\end{array}\n", "\\right.$$\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "modificar_columnas = [\"X01\",\"X02\",\"X03\",\"X04\",\"X05\",\"X06\",\"X07\",\"X08\"]\n", "\n", "for columnas in modificar_columnas:\n", " Datadepartamentos[columnas] = Datadepartamentos[columnas].apply(lambda x:1 if x==1 else 0)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIOSECUENCIA_ENCUESTASECUENCIA_PORDENDEPARTAMENTOMUNICIPIOREGIONFEX_CCANT_HOG_COMPLETOSCANT_HOGARES_VIVIENDACLASEP3516X01X02X03X04X05X06X07X08
07910520111811496.065748111NaN11111100
17910548111875814.835424112NaN10111110
27910734111811516.738612111NaN00000000
37910990111811509.852073111NaN00000000
479110301118573120.597219112NaN11110000
579110311118573124.094573112NaN11000000
67911049111811607.704351111NaN00011000
77911050111811497.369340111NaN11111110
87911062111811604.370204111NaN01000000
9791107211182961547.080870111NaN01000000
\n", "
" ], "text/plain": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910520 1 1 1 8 \n", "1 7910548 1 1 1 8 \n", "2 7910734 1 1 1 8 \n", "3 7910990 1 1 1 8 \n", "4 7911030 1 1 1 8 \n", "5 7911031 1 1 1 8 \n", "6 7911049 1 1 1 8 \n", "7 7911050 1 1 1 8 \n", "8 7911062 1 1 1 8 \n", "9 7911072 1 1 1 8 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 496.065748 1 1 \n", "1 758 1 4.835424 1 1 \n", "2 1 1 516.738612 1 1 \n", "3 1 1 509.852073 1 1 \n", "4 573 1 20.597219 1 1 \n", "5 573 1 24.094573 1 1 \n", "6 1 1 607.704351 1 1 \n", "7 1 1 497.369340 1 1 \n", "8 1 1 604.370204 1 1 \n", "9 296 1 547.080870 1 1 \n", "\n", " CLASE P3516 X01 X02 X03 X04 X05 X06 X07 X08 \n", "0 1 NaN 1 1 1 1 1 1 0 0 \n", "1 2 NaN 1 0 1 1 1 1 1 0 \n", "2 1 NaN 0 0 0 0 0 0 0 0 \n", "3 1 NaN 0 0 0 0 0 0 0 0 \n", "4 2 NaN 1 1 1 1 0 0 0 0 \n", "5 2 NaN 1 1 0 0 0 0 0 0 \n", "6 1 NaN 0 0 0 1 1 0 0 0 \n", "7 1 NaN 1 1 1 1 1 1 1 0 \n", "8 1 NaN 0 1 0 0 0 0 0 0 \n", "9 1 NaN 0 1 0 0 0 0 0 0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Datadepartamentos.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 13: Agregar la columna indice al dataframe \n", "\n", "Este dato nos interesa debido a que es la manera en la vamos a poder definir una **función** que nos permita crear la variable inseguridad alimentaria, esta columna lo que hara es sumar a partir de \"X01\" hasta \"X08\" los valores de 1 de cada fila y almacenandolo en la columna indice." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIOSECUENCIA_ENCUESTASECUENCIA_PORDENDEPARTAMENTOMUNICIPIOREGIONFEX_CCANT_HOG_COMPLETOSCANT_HOGARES_VIVIENDA...P3516X01X02X03X04X05X06X07X08INDICE
07910520111811496.06574811...NaN111111006
17910548111875814.83542411...NaN101111106
27910734111811516.73861211...NaN000000000
37910990111811509.85207311...NaN000000000
479110301118573120.59721911...NaN111100004
579110311118573124.09457311...NaN110000002
67911049111811607.70435111...NaN000110002
77911050111811497.36934011...NaN111111107
87911062111811604.37020411...NaN010000001
9791107211182961547.08087011...NaN010000001
\n", "

10 rows × 21 columns

\n", "
" ], "text/plain": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910520 1 1 1 8 \n", "1 7910548 1 1 1 8 \n", "2 7910734 1 1 1 8 \n", "3 7910990 1 1 1 8 \n", "4 7911030 1 1 1 8 \n", "5 7911031 1 1 1 8 \n", "6 7911049 1 1 1 8 \n", "7 7911050 1 1 1 8 \n", "8 7911062 1 1 1 8 \n", "9 7911072 1 1 1 8 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 496.065748 1 1 \n", "1 758 1 4.835424 1 1 \n", "2 1 1 516.738612 1 1 \n", "3 1 1 509.852073 1 1 \n", "4 573 1 20.597219 1 1 \n", "5 573 1 24.094573 1 1 \n", "6 1 1 607.704351 1 1 \n", "7 1 1 497.369340 1 1 \n", "8 1 1 604.370204 1 1 \n", "9 296 1 547.080870 1 1 \n", "\n", " ... P3516 X01 X02 X03 X04 X05 X06 X07 X08 INDICE \n", "0 ... NaN 1 1 1 1 1 1 0 0 6 \n", "1 ... NaN 1 0 1 1 1 1 1 0 6 \n", "2 ... NaN 0 0 0 0 0 0 0 0 0 \n", "3 ... NaN 0 0 0 0 0 0 0 0 0 \n", "4 ... NaN 1 1 1 1 0 0 0 0 4 \n", "5 ... NaN 1 1 0 0 0 0 0 0 2 \n", "6 ... NaN 0 0 0 1 1 0 0 0 2 \n", "7 ... NaN 1 1 1 1 1 1 1 0 7 \n", "8 ... NaN 0 1 0 0 0 0 0 0 1 \n", "9 ... NaN 0 1 0 0 0 0 0 0 1 \n", "\n", "[10 rows x 21 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Datadepartamentos[\"INDICE\"] = Datadepartamentos[modificar_columnas].sum(axis=1)\n", "Datadepartamentos.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 14: Definicion de funciÓn para Inseguridad alimentaria de manera categórica y codificada.\n", "\n", "Como ya se tienen las variables definidas con el conjunto de datos y la columna **indice**, usaremos estos valores que suma la columna indice cuya valor minimo es 0 y valor maximo es cero para crear una columna que represente la variable inseguridad alimentaria y categorizemos los niveles de esa inseguridad alimentaria." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Definición codificada\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIOSECUENCIA_ENCUESTASECUENCIA_PORDENDEPARTAMENTOMUNICIPIOREGIONFEX_CCANT_HOG_COMPLETOSCANT_HOGARES_VIVIENDA...X01X02X03X04X05X06X07X08INDICESN_SA
07910520111811496.06574811...1111110061
17910548111875814.83542411...1011111061
27910734111811516.73861211...0000000000
37910990111811509.85207311...0000000000
479110301118573120.59721911...1111000041
579110311118573124.09457311...1100000020
67911049111811607.70435111...0001100020
77911050111811497.36934011...1111111072
87911062111811604.37020411...0100000010
9791107211182961547.08087011...0100000010
107911080111811536.76420711...1011100041
11791108111187581677.03401711...0100000010
127911093111811516.73861211...1111110172
13791109711182961341.68802211...1111111072
14791110711182961496.37033811...0100000010
\n", "

15 rows × 22 columns

\n", "
" ], "text/plain": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910520 1 1 1 8 \n", "1 7910548 1 1 1 8 \n", "2 7910734 1 1 1 8 \n", "3 7910990 1 1 1 8 \n", "4 7911030 1 1 1 8 \n", "5 7911031 1 1 1 8 \n", "6 7911049 1 1 1 8 \n", "7 7911050 1 1 1 8 \n", "8 7911062 1 1 1 8 \n", "9 7911072 1 1 1 8 \n", "10 7911080 1 1 1 8 \n", "11 7911081 1 1 1 8 \n", "12 7911093 1 1 1 8 \n", "13 7911097 1 1 1 8 \n", "14 7911107 1 1 1 8 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 496.065748 1 1 \n", "1 758 1 4.835424 1 1 \n", "2 1 1 516.738612 1 1 \n", "3 1 1 509.852073 1 1 \n", "4 573 1 20.597219 1 1 \n", "5 573 1 24.094573 1 1 \n", "6 1 1 607.704351 1 1 \n", "7 1 1 497.369340 1 1 \n", "8 1 1 604.370204 1 1 \n", "9 296 1 547.080870 1 1 \n", "10 1 1 536.764207 1 1 \n", "11 758 1 677.034017 1 1 \n", "12 1 1 516.738612 1 1 \n", "13 296 1 341.688022 1 1 \n", "14 296 1 496.370338 1 1 \n", "\n", " ... X01 X02 X03 X04 X05 X06 X07 X08 INDICE SN_SA \n", "0 ... 1 1 1 1 1 1 0 0 6 1 \n", "1 ... 1 0 1 1 1 1 1 0 6 1 \n", "2 ... 0 0 0 0 0 0 0 0 0 0 \n", "3 ... 0 0 0 0 0 0 0 0 0 0 \n", "4 ... 1 1 1 1 0 0 0 0 4 1 \n", "5 ... 1 1 0 0 0 0 0 0 2 0 \n", "6 ... 0 0 0 1 1 0 0 0 2 0 \n", "7 ... 1 1 1 1 1 1 1 0 7 2 \n", "8 ... 0 1 0 0 0 0 0 0 1 0 \n", "9 ... 0 1 0 0 0 0 0 0 1 0 \n", "10 ... 1 0 1 1 1 0 0 0 4 1 \n", "11 ... 0 1 0 0 0 0 0 0 1 0 \n", "12 ... 1 1 1 1 1 1 0 1 7 2 \n", "13 ... 1 1 1 1 1 1 1 0 7 2 \n", "14 ... 0 1 0 0 0 0 0 0 1 0 \n", "\n", "[15 rows x 22 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def seg_alimentaria(value):\n", " if value >= 0 and value <=3:\n", " return 0\n", " elif value >= 4 and value <=6:\n", " return 1\n", " elif value >= 7 and value <=8:\n", " return 2\n", " else:\n", " return \"NA\"\n", " \n", "Datadepartamentos[\"SN_SA\"] = Datadepartamentos[\"INDICE\"].apply(seg_alimentaria)\n", "\n", "Datadepartamentos.head(15) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Definición categórica" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DIRECTORIOSECUENCIA_ENCUESTASECUENCIA_PORDENDEPARTAMENTOMUNICIPIOREGIONFEX_CCANT_HOG_COMPLETOSCANT_HOGARES_VIVIENDA...X02X03X04X05X06X07X08INDICESN_SASA_SA
07910520111811496.06574811...111110061Moderada
17910548111875814.83542411...011111061Moderada
27910734111811516.73861211...000000000Leve
37910990111811509.85207311...000000000Leve
479110301118573120.59721911...111000041Moderada
579110311118573124.09457311...100000020Leve
67911049111811607.70435111...001100020Leve
77911050111811497.36934011...111111072Severa
87911062111811604.37020411...100000010Leve
9791107211182961547.08087011...100000010Leve
\n", "

10 rows × 23 columns

\n", "
" ], "text/plain": [ " DIRECTORIO SECUENCIA_ENCUESTA SECUENCIA_P ORDEN DEPARTAMENTO \\\n", "0 7910520 1 1 1 8 \n", "1 7910548 1 1 1 8 \n", "2 7910734 1 1 1 8 \n", "3 7910990 1 1 1 8 \n", "4 7911030 1 1 1 8 \n", "5 7911031 1 1 1 8 \n", "6 7911049 1 1 1 8 \n", "7 7911050 1 1 1 8 \n", "8 7911062 1 1 1 8 \n", "9 7911072 1 1 1 8 \n", "\n", " MUNICIPIO REGION FEX_C CANT_HOG_COMPLETOS CANT_HOGARES_VIVIENDA \\\n", "0 1 1 496.065748 1 1 \n", "1 758 1 4.835424 1 1 \n", "2 1 1 516.738612 1 1 \n", "3 1 1 509.852073 1 1 \n", "4 573 1 20.597219 1 1 \n", "5 573 1 24.094573 1 1 \n", "6 1 1 607.704351 1 1 \n", "7 1 1 497.369340 1 1 \n", "8 1 1 604.370204 1 1 \n", "9 296 1 547.080870 1 1 \n", "\n", " ... X02 X03 X04 X05 X06 X07 X08 INDICE SN_SA SA_SA \n", "0 ... 1 1 1 1 1 0 0 6 1 Moderada \n", "1 ... 0 1 1 1 1 1 0 6 1 Moderada \n", "2 ... 0 0 0 0 0 0 0 0 0 Leve \n", "3 ... 0 0 0 0 0 0 0 0 0 Leve \n", "4 ... 1 1 1 0 0 0 0 4 1 Moderada \n", "5 ... 1 0 0 0 0 0 0 2 0 Leve \n", "6 ... 0 0 1 1 0 0 0 2 0 Leve \n", "7 ... 1 1 1 1 1 1 0 7 2 Severa \n", "8 ... 1 0 0 0 0 0 0 1 0 Leve \n", "9 ... 1 0 0 0 0 0 0 1 0 Leve \n", "\n", "[10 rows x 23 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def seg_alimentaria_categ(value):\n", " if value == 0:\n", " return \"Leve\"\n", " elif value == 1:\n", " return \"Moderada\"\n", " elif value == 2:\n", " return \"Severa\"\n", " else:\n", " return \"NA\"\n", " \n", "Datadepartamentos[\"SA_SA\"] = Datadepartamentos[\"SN_SA\"].apply(seg_alimentaria_categ)\n", "\n", "Datadepartamentos.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ya tenemos toda la estructura de las bases de datos y hasta acá el trabajo duro que habia por hacer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Opcional " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paso 15: Interpretar los datos o crear modelos. \n", "\n", "Ya teniendo todos los componentes que se necesitaban para dar estructura. Ahora podemos graficar que porcentaje de población en el caribe Colombiano que esta con Inseguridad Alimentaria *leve*, *moderada* y *severa*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "pip install plotly \n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hovertemplate": "Categoría=%{x}
Porcentaje=%{text}", "legendgroup": "Leve", "marker": { "color": "#440154", "pattern": { "shape": "" } }, "name": "Leve", "orientation": "v", "showlegend": true, "text": { "bdata": "82kqKu14TkA=", "dtype": "f8" }, "textposition": "outside", "texttemplate": "%{text:.2f}%", "type": "bar", "x": [ "Leve" ], "xaxis": "x", "y": { "bdata": "82kqKu14TkA=", "dtype": "f8" }, "yaxis": "y" }, { "hovertemplate": "Categoría=%{x}
Porcentaje=%{text}", "legendgroup": "Moderada", "marker": { "color": "#482878", "pattern": { "shape": "" } }, "name": "Moderada", "orientation": "v", "showlegend": true, "text": { "bdata": "OL/OfTp9N0A=", "dtype": "f8" }, "textposition": "outside", "texttemplate": "%{text:.2f}%", "type": "bar", "x": [ "Moderada" ], "xaxis": "x", "y": { "bdata": "OL/OfTp9N0A=", "dtype": "f8" }, "yaxis": "y" }, { "hovertemplate": "Categoría=%{x}
Porcentaje=%{text}", "legendgroup": "Severa", "marker": { "color": "#3e4989", "pattern": { "shape": "" } }, "name": "Severa", "orientation": "v", "showlegend": true, "text": { "bdata": "wNm4W9YhL0A=", "dtype": "f8" }, "textposition": "outside", "texttemplate": "%{text:.2f}%", "type": "bar", "x": [ "Severa" ], "xaxis": "x", "y": { "bdata": "wNm4W9YhL0A=", "dtype": "f8" }, "yaxis": "y" } ], "layout": { "barmode": "relative", "legend": { "title": { "text": "Categoría" }, "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "#C8D4E3", "linecolor": "#C8D4E3", "minorgridcolor": "#C8D4E3", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "#C8D4E3", "linecolor": "#C8D4E3", "minorgridcolor": "#C8D4E3", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "#C8D4E3" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "" }, "bgcolor": "white", "radialaxis": { "gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "#DFE8F3", "gridwidth": 2, "linecolor": "#EBF0F8", "showbackground": true, "ticks": "", "zerolinecolor": "#EBF0F8" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "#DFE8F3", "gridwidth": 2, "linecolor": "#EBF0F8", "showbackground": true, "ticks": "", "zerolinecolor": "#EBF0F8" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "#DFE8F3", "gridwidth": 2, "linecolor": "#EBF0F8", "showbackground": true, "ticks": "", "zerolinecolor": "#EBF0F8" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "#DFE8F3", "linecolor": "#A2B1C6", "ticks": "" }, "baxis": { "gridcolor": "#DFE8F3", "linecolor": "#A2B1C6", "ticks": "" }, "bgcolor": "white", "caxis": { "gridcolor": "#DFE8F3", "linecolor": "#A2B1C6", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "#EBF0F8", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "#EBF0F8", "linecolor": "#EBF0F8", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "#EBF0F8", "zerolinewidth": 2 } } }, "title": { "text": "Distribución Porcentual de Individuos en inseguridad" }, "xaxis": { "anchor": "y", "categoryarray": [ "Leve", "Moderada", "Severa" ], "categoryorder": "array", "domain": [ 0, 1 ], "title": { "text": "Categoría" } }, "yaxis": { "anchor": "x", "domain": [ 0, 1 ], "title": { "text": "Porcentaje (%)" } } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import plotly.express as px\n", "import pandas as pd\n", "\n", "# Calcular los porcentajes de cada categoría en SA_SA\n", "category_counts = Datadepartamentos[\"SA_SA\"].value_counts(normalize=True) * 100\n", "\n", "# Convertir a DataFrame para Plotly\n", "df_plotly = pd.DataFrame({\"Categoría\": category_counts.index, \"Porcentaje\": category_counts.values})\n", "\n", "# Crear la gráfica de barras con Plotly\n", "fig = px.bar(df_plotly, x=\"Categoría\", y=\"Porcentaje\", \n", " text=\"Porcentaje\", # Muestra los valores sobre las barras\n", " color=\"Categoría\", \n", " color_discrete_sequence=px.colors.sequential.Viridis, \n", " title=\"Distribución Porcentual de Individuos en inseguridad\")\n", "\n", "# Mejorar el formato de la gráfica\n", "fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')\n", "fig.update_layout(xaxis_title=\"Categoría\", \n", " yaxis_title=\"Porcentaje (%)\", \n", " template=\"plotly_white\")\n", "\n", "# Mostrar la gráfica\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# TALLER - INTERMEDIO \n", "## Creemos un modelo de regresión logistica multinomial.\n", "\n", "Vamos a crear una **base de datos** común para crear un modelo de entrenamiento y luego crear el modelo real, tendremos que limpiar un poco mas los datos para este objetivo.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PUNTO 1. Seleccionar columnas de interés \n", "\n", "Necesitamos seleccionar dos culumnas de interes del dataframe Datadepartamentos, estas columnas son la llave y la variable que representa la inseguridad alimenticia codificada.\n", "\n", "Complete el codigo." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Seleccionar las columnas de interés de Datadepartamentos\n", "df_departamentos_subset = Datadepartamentos[['', '']]\n", "\n", "# Mostrar las primeras filas del subset\n", "df_departamentos_subset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PUNTO 2. Empalmar la tabla creada con el dataframe DF1 usando la llave correspondiente.\n", "\n", "Debe seleccionar y ubicar una función de pandas que utilice una columna como **llave** para poder empalmar ambas tablas debe elegir la opción correcta y completa el código\n", "\n", "```python\n", "a. pd.isna\n", "b. pd.read_csv\n", "c. pd.melt\n", "d. pd.merge\n", "```\n", "\n", "Ademas escriba la llave donde corresponde ;)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Unir los DataFrames usando la columna\n", "DATAF2 = (DATAF1, df_departamentos_subset, on='', how='left')\n", "\n", "# Mostrar las primeras filas del DataFrame unido para verificar\n", "DATAF2.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PUNTO 3. Rellenar valores NaN \n", "\n", "Resulta que al verificar los valores `NaN` de las variables tomadas y eso es un problema. Por ejemplo:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Contar el número de valores NaN en cada columna\n", "nan_counts = DATAF2.isna().sum()\n", "\n", "# Mostrar el número de NaN en cada columna\n", "print(nan_counts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Queriamos usar la variable \"P3163_archivo5\" que representa cuanto pagaron el mes pasado o la ultima vez por el servicio de gas natural pero tiene un monton de valores `NaN`, Vamos a solucionar esto con una funcion que agrupe las medianas de los valores SN_SA 1, 2 y 3 y rellene los valores NaN con esas medianas. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Al código le hace falta especificar si debe hacer el combinado de valores por columnas o por filas, ademas necesita saber cual es la variable de inseguridad alimentaria para la cual calculara la mediana para los diferentes grupos de **inseguridad alimentaria** de la variaable \"P3163_archivo 5\". \n", "\n", "Completelo..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# Suponiendo que DATAF2 es su DataFrame\n", "# Calcule la mediana de P3163_archivo5 para cada grupo de SN_SA\n", "medianas = DATAF2.groupby('')['P3163_archivo5'].median()\n", "print(medianas)\n", "\n", "# Función para aplicar la mediana según el valor de SN_SA\n", "def reemplazar_nan(row):\n", " if pd.isna(row['P3163_archivo5']):\n", " return medianas[row['']]\n", " else:\n", " return row['P3163_archivo5']\n", "\n", "# Aplicar la función para reemplazar los NaN en P3163_archivo5\n", "DATAF2['P3163_archivo5'] = DATAF2.apply(reemplazar_nan, axis=)\n", "\n", "DATAF2.head(20)\n", "\n", "# Eliminar la columna 'P3162_archivo5' del DataFrame ya que tambien tenia muchos valores NaN\n", "DATAF2.drop(columns=['P3162_archivo5'], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Vamos a terminar de limpiar los valores NaN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Eliminar filas con valores nulos\n", "df_cleaned = DATAF2.dropna()\n", "\n", "# Mostrar las dimensiones del DataFrame original y del limpio para comparación\n", "print(\"Original:\", DATAF2.shape)\n", "print(\"Limpio:\", df_cleaned.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PUNTO 4 Definición de variables del modelo y creación del modelo\n", "\n", "Vamos a definir la variable dependiente y las variables explicativas y vamos a crear el modelo.\n", "\n", "a. La primera parte del codigo es simplemente definir las variables, dato importante la función `.drop` esta excluyendo y no incluyendo por lo tanto cuidado." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Definir la variable dependiente y las variables explicativas\n", "y = df_cleaned['']\n", "X = df_cleaned.drop(columns=[]) # Solo excluyendo 'SN_SA' y 'DIRECTORIO'\n", "\n", "# Añadir una constante a las variables explicativas para el modelo logit\n", "import statsmodels.api as sm\n", "X = sm.add_constant(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "b. Definamos el modelo, las dimensiones de entrenamiento queremos que sean del 80% por lo tanto del test deben ser del 20%. Por favor ajuste el modelo a las dimensiones dichas y verifique que las dimensiones te den como resultados los siguientes valores: (14465, 12) (3617, 12) (14465,) (3617,).\n", "\n", "c. Luego simplemente ajuste el modelo logit multinomial con la variable independiente de entrenamiento y las variables dependientes de entrenamiento. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Dividir los datos en conjuntos de entrenamiento y prueba\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=, random_state=42)\n", "\n", "# Mostrar las dimensiones de los conjuntos de datos\n", "print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)\n", "\n", "# Ajustar el modelo logit multinomial\n", "model = sm.MNLogit()\n", "result = model.fit()\n", "\n", "# Resumen del modelo\n", "print(result.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ¡Listo felicitaciones ya tiene un modelo logit multinomial!\n", "\n", "Le dejo un script que le muestra los efectos marginales de las variable del modelo esta en usted como interpretarlos ;) (Recuerde que los marginales se interpretan como son las utilidades de los consumidores)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Debe Calcular efectos marginales promedio\n", "mfx = result.get_margeff(at='overall')\n", "print(\"Efectos marginales promedio:\")\n", "print(mfx.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Agradecimientos\n", " Al Doctor Jose Luis Ramos y al Estudiante e investigador Junior del OCSA del departamento de Economía de la Universidad del Norte el joven Juan Camilo Pinedo" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 2 }