Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOPZ scrapper v1 #49

Merged
merged 6 commits into from
Feb 4, 2024
Merged

BOPZ scrapper v1 #49

merged 6 commits into from
Feb 4, 2024

Conversation

llop00
Copy link
Collaborator

@llop00 llop00 commented Jan 27, 2024

BOPZ scraper v1:

Cambios específicos de este scrapper

  • Creada una nueva clase de Metadatos a partir de la del BOPZ original, este no presenta campos específicos al estar menos enriquecido que el BOE, los metadatos insertados son un subconjunto de estos.
  • Adjunto pantallazo de los metadatos cargados en qdrant.
    image

Pruebas

  • Probados los módulos daily y batch desde 2019
  • fecha_publicación y fecha_disposición tienen el mismo valor al no ser posible diferenciarlos.

Requisitos

  • Es necesario actualizar el requirements para incluir la libreria Unstructured mediante langchain_community utilizada para el scrapping
    from langchain_community.document_loaders import UnstructuredPDFLoader

Para ello es necesario:

  • Instalar tesseract. "sudo apt-get install tesseract-ocr" en sistemas basados en Linux
  • Actualizar langchain a "langchain==0.1.4". Esto crea una serie de warnings de aviso de cara a la versión 0.2.0 respecto a varios imports.
  • Instalar "unstructured[pdf]==0.12.2"

Copy link
Owner

@bukosabino bukosabino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Podemos añadir un fichero src/etls/bopz/README.md en el cuál añadir algo de documentación. Por ejemplo:

  • La provincia a la que hace referencia el módulo bopz (Zaragoza en este caso).
  • Link a la web
  • Algún ejemplo de pdf que estamos scrappeando

from src.etls.common.metadata import MetadataDocument


class BOPZMetadataReferencia(BaseModel):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this class.
It is not used, right?

src/etls/bopz/scrapper.py Outdated Show resolved Hide resolved
return id_links

class BOPZScrapper(BaseScrapper):
BASE_URL = 'http://bop.dpz.es/BOPZ'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use lowercase for class variable.

initialize_logging()

# POST data to filter retrieved BOPZ documents
data_post = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use uppercase for global variable. Also, you can port this dict to a utils.py file.


return metadata_dict

def _list_links_day(url: str, day: date) -> tp.List[str]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right?

Suggested change
def _list_links_day(url: str, day: date) -> tp.List[str]:
def _list_links_day(url: str, day: date) -> tp.List[BeautifulSoup]:

@bukosabino
Copy link
Owner

Modifica el fichero requirements.txt con todo lo que necesites.

llop00 and others added 2 commits January 29, 2024 16:40
@llop00
Copy link
Collaborator Author

llop00 commented Jan 30, 2024

Removido OCR

  • He encontrado la forma de scrapear el texto sin necesidad del ocr.

Incluido url_html en los metadatos

image

Cambios sugeridos

  • Añadido utils.py con DATA_POST necesario para filtrar documentos interesados
  • Creado README.md con rutas al BOPZ y documentos de ejemplo
  • Ya no hay necesidad de actualizar el requirements al no hacer uso de la librería de OCR
  • Actualizado el resto de cambios sugeridos

logger.info("Scrapping day: %s", day_str)
DATA_POST['fechaPubInf'] = day_str
DATA_POST['fechaPubSup'] = day_str
response = requests.post(url, data=DATA_POST)

Check warning

Code scanning / Bandit

Call to requests without timeout Warning

Call to requests without timeout
"""
logger = lg.getLogger(self.download_document.__name__)
logger.info("Scrapping document: %s", url)
response = requests.get(url)

Check warning

Code scanning / Bandit

Requests call without timeout Warning

Requests call without timeout
@bukosabino bukosabino self-requested a review February 4, 2024 11:13
Copy link
Owner

@bukosabino bukosabino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@llop00 llop00 merged commit e80e35b into bukosabino:main Feb 4, 2024
2 checks passed
@llop00 llop00 deleted the develop branch February 17, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants