2 normalizar columnas nuevas #3

complexluise · 2024-10-28T21:40:09Z

Normalización de datos bibliográficos y ampliación de funcionalidades de limpieza

¿Qué?

Se implementó la normalización de dos nuevos campos bibliográficos y se mejoró la normalización de campos existentes:

Nuevo: Normalización del número de clasificación Dewey
Nuevo: Normalización del periodo cronológico a números romanos
Mejoras en la normalización de lugares de publicación, fechas, autores y títulos

¿Cómo?

Se utilizó un enfoque TDD (Desarrollo Dirigido por Pruebas) para implementar las nuevas funcionalidades:

Número Dewey:
- Se extrae solo los primeros tres dígitos del número de clasificación
- Se manejan diferentes formatos de entrada (con puntos, barras, prefijos)
- Se implementaron pruebas para validar la extracción correcta
Periodo cronológico:
- Conversión de años a siglos en números romanos
- Manejo de rangos de años tomando el más reciente
- Normalización de formatos de entrada variables
- Pruebas exhaustivas para diferentes formatos de periodos
Mejoras en normalizaciones existentes:
- Lugar de publicación: mejor manejo de lugares múltiples
- Fechas: mejor detección de años en diferentes formatos
- Autores: normalización mejorada de nombres compuestos
- Títulos: limpieza más robusta de caracteres especiales

¿Por qué?

Estandarización de datos:
- La normalización del número Dewey permite una clasificación más consistente
- La conversión de periodos a siglos facilita el análisis temporal
- Las mejoras en la normalización existente aumentan la calidad de los datos
Mejora en la calidad de datos:
- Reducción de duplicados por variaciones en el formato
- Mayor precisión en la clasificación de materiales
- Mejor soporte para análisis histórico y temporal
Mantenibilidad:
- El uso de TDD asegura que los cambios están bien probados
- La refactorización mejora la legibilidad del código
- Las pruebas documentan el comportamiento esperado

Notas adicionales

Se mantiene compatibilidad con el formato de datos existente
Las nuevas funcionalidades son opcionales y se activan solo si las columnas correspondientes están presentes
Todas las transformaciones son reversibles y mantienen los datos originales

…ssor

This commit includes new test cases for the function `_normalizar_numero_clasificacion_dewey` in the `test_limpiar_tablas.py` file. The new tests cover a range of inputs, including standard, edge, and complex cases to ensure the function can accurately normalize Dewey numbers in various formats.

The method `_normalizar_numero_clasificacion_dewey` was updated to correctly handle raw Dewey numbers that contain a semicolon. It now splits the input string at the semicolon and uses the second part for normalization, if a semicolon is present. Additionally, the argument name was changed from `dewey_number` to `raw_dewey_number` to reflect the fact that the input may not be a normalized Dewey number.

…not handled correctly, and replaced variable name `anos_encontrados` with `años_encontrados` for better readability.

This commit adds dewey number and period normalization function to the data cleaning script. It also checks for the availability of these columns in the data and applies the corresponding normalization function if available.

…n clean tables script

…tion This commit includes two main improvements to the `limpiar_tablas` script related to period normalization: 1. For periods that include multiple years (e.g., "1800-1900"), the script now correctly identifies the most recent year and converts it to a century in Roman numerals. 2. When periods include multiple centuries in Roman numerals (e.g., "XVIII-XXI"), the script now correctly identifies the most recent century. This is achieved by introducing a new helper function `valor_siglo_romano` that converts a Roman numeral century to its numeric value, which is then used to find the max value in the list of centuries. This enhancement should significantly improve the accuracy of period normalization in the data cleaning process.

…blas.py'

…ng zeros in Dewey numbers

complexluise added 15 commits October 26, 2024 18:38

🧪✨ feat: add test cases for normalizing period in BibliotecaDataProce…

9f29b52

…ssor

📝 feat: add period normalization to cleaning script

398647b

✅✏️ Fix regular expression for century patterns in 'limpiar_tablas.py'

d642d98

🔧 feat: add Dewey number normalization method in limpiar_tablas.py

95ab833

🎨 black formatting

5355ed8

🎨 black formatting

86d4002

🔧✅ test: Refactor tests for better readability and organization

61f6ac9

🐛 ♻️ Fixed issue where empty strings in ciudades_normalizadas were …

173bee6

…not handled correctly, and replaced variable name `anos_encontrados` with `años_encontrados` for better readability.

📝 feat: add dewey and period normalization functions

e846e2d

This commit adds dewey number and period normalization function to the data cleaning script. It also checks for the availability of these columns in the data and applies the corresponding normalization function if available.

🎨✨ feat: Add functionality to convert year ranges to Roman numerals i…

b864a1f

…n clean tables script

🐛 🧪 Update test cases for year and century ranges in 'test_limpiar_ta…

8823abd

…blas.py'

🎨 black formatting

55d54bf

complexluise added the enhancement New feature or request label Oct 28, 2024

complexluise self-assigned this Oct 28, 2024

complexluise linked an issue Oct 28, 2024 that may be closed by this pull request

Normalizar columnas nuevas #2

Closed

complexluise added 3 commits October 28, 2024 16:45

🐛📝 : "fix: corrected import statement in __init__.py"

b992a39

📦➕ Added new dependency 'click' to requirements.txt

6b68b29

✅✨ test(limpiar_tablas): Add additional test cases for handling leadi…

f18be46

…ng zeros in Dewey numbers

complexluise merged commit a9071e3 into main Oct 28, 2024
1 check passed

complexluise deleted the 2-normalizar-columnas-nuevas branch October 28, 2024 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2 normalizar columnas nuevas #3

2 normalizar columnas nuevas #3

complexluise commented Oct 28, 2024

2 normalizar columnas nuevas #3

2 normalizar columnas nuevas #3

Conversation

complexluise commented Oct 28, 2024

Normalización de datos bibliográficos y ampliación de funcionalidades de limpieza

¿Qué?

¿Cómo?

¿Por qué?

Notas adicionales