-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BOPZ scrapper v1 #49
BOPZ scrapper v1 #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Podemos añadir un fichero src/etls/bopz/README.md
en el cuál añadir algo de documentación. Por ejemplo:
- La provincia a la que hace referencia el módulo bopz (Zaragoza en este caso).
- Link a la web
- Algún ejemplo de pdf que estamos scrappeando
src/etls/bopz/metadata.py
Outdated
from src.etls.common.metadata import MetadataDocument | ||
|
||
|
||
class BOPZMetadataReferencia(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this class.
It is not used, right?
src/etls/bopz/scrapper.py
Outdated
return id_links | ||
|
||
class BOPZScrapper(BaseScrapper): | ||
BASE_URL = 'http://bop.dpz.es/BOPZ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use lowercase for class variable.
src/etls/bopz/scrapper.py
Outdated
initialize_logging() | ||
|
||
# POST data to filter retrieved BOPZ documents | ||
data_post = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use uppercase for global variable. Also, you can port this dict to a utils.py
file.
src/etls/bopz/scrapper.py
Outdated
|
||
return metadata_dict | ||
|
||
def _list_links_day(url: str, day: date) -> tp.List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right?
def _list_links_day(url: str, day: date) -> tp.List[str]: | |
def _list_links_day(url: str, day: date) -> tp.List[BeautifulSoup]: |
Modifica el fichero requirements.txt con todo lo que necesites. |
Co-authored-by: Darío López Padial <[email protected]>
Removido OCR
Incluido url_html en los metadatosCambios sugeridos
|
logger.info("Scrapping day: %s", day_str) | ||
DATA_POST['fechaPubInf'] = day_str | ||
DATA_POST['fechaPubSup'] = day_str | ||
response = requests.post(url, data=DATA_POST) |
Check warning
Code scanning / Bandit
Call to requests without timeout Warning
""" | ||
logger = lg.getLogger(self.download_document.__name__) | ||
logger.info("Scrapping document: %s", url) | ||
response = requests.get(url) |
Check warning
Code scanning / Bandit
Requests call without timeout Warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BOPZ scraper v1:
Cambios específicos de este scrapper
Pruebas
Requisitos
from langchain_community.document_loaders import UnstructuredPDFLoader
Para ello es necesario: