-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
green-db table can not be joined with scraping table based on id #78
Comments
Using In the green-db/scraping/scraping/spiders/_base.py Line 205 in 90b631b
and in the
|
A workaround for asos to join both tables is to extract the For example this code does the job:
|
I'm not sure if this is something we want to implement.. I could think of maintaining a "forward dependency" like a Why not query (SQL) for the rows of interest in the scraping database and, if necessary, extract the necessary information from the HTML? I'm assuming the overhead is not the bottleneck here. |
Ok, I see... - if we want to keep the option to run another extraction this wouldn't work. Then the best option might be to create an additional mapping table that maps the And regarding running the extraction again when someone wants to use the HTML - I think this is not very user-friendly and for the older data also not easily doable, because our extractor code is not backwards compatible, so we can not extract the information from old HTML's using the current extractor implementation. I would really appreciate such a feature and probably all others who want to use the HTML in combination with the extracted data at some point too! :) |
Currently it is not possible to relate information of the
scraping
table to its corresponding extracted product information in thegreen-db
table viaid
. If we want to join the tables we currently have to usetimestamp
,url
andcategory
.We already use the
id
, to retrieve a specific row in thescraping
table, but theid
is not used any further when writing the extracted product information into thegreen-db
, see:green-db/workers/workers/extract.py
Lines 36 to 39 in 90b631b
The
green-db
table already has anid
column, but this is autogenerated, see:green-db/database/database/tables.py
Line 203 in 90b631b
So, integrating this shouIdn't be a lot of work and would help whenever we want to use information from
scraping table
together withgreen-db
table. For example using the HTML together with the extracted product information for some ML.The text was updated successfully, but these errors were encountered: