Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

green-db table can not be joined with scraping table based on id #78

Open
BigDatalex opened this issue Jun 24, 2022 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@BigDatalex
Copy link
Contributor

Currently it is not possible to relate information of the scraping table to its corresponding extracted product information in the green-db table via id. If we want to join the tables we currently have to use timestamp, url and category.

We already use the id, to retrieve a specific row in the scraping table, but the id is not used any further when writing the extracted product information into the green-db, see:

scraped_page = CONNECTION_FOR_TABLE[table_name].get_scraped_page(id=row_id)
if product := extract_product(table_name=table_name, scraped_page=scraped_page):
green_db_connection.write(product)

The green-db table already has an id column, but this is autogenerated, see:

id = Column(INTEGER, nullable=False, autoincrement=True, primary_key=True)

So, integrating this shouIdn't be a lot of work and would help whenever we want to use information from scraping table together with green-db table. For example using the HTML together with the extracted product information for some ML.

@BigDatalex BigDatalex added the enhancement New feature or request label Jun 24, 2022
@BigDatalex
Copy link
Contributor Author

Using timestamp, url and category to join the scraping table with green-db table does not work for asos, because the url in scraping table is different from the url in green-db table.

In the scraping table we store the url of the asos API from which we retrieve the product data:

and in the green-db table we store the url of the products website:

url = _get_url(page_json.get("localisedData", []), "fr-FR")

@BigDatalex BigDatalex added the bug Something isn't working label Jun 26, 2022
@BigDatalex
Copy link
Contributor Author

A workaround for asos to join both tables is to extract the product id from the API url (url which is stored in scraping table) and the website url (url that is stored in green-db table) and join based on this product id, timestamp and category.

For example this code does the job:

scraping_asos["product_id"] = scraping_asos["url"].apply(lambda x: x.split("/")[-1].split("?")[0])
greendb_asos["product_id"] = greendb_asos["url"].apply(lambda x: x.split("/")[-1])

@se-jaeger
Copy link
Member

I'm not sure if this is something we want to implement..

I could think of maintaining a "forward dependency" like a created column in the scraping database that has a foreign key to the row of the extracted product (green-db database). However, what if we manually run the extraction again? Overwrite, update, or extend (add an int to an array) this dependency?

Why not query (SQL) for the rows of interest in the scraping database and, if necessary, extract the necessary information from the HTML? I'm assuming the overhead is not the bottleneck here.

@BigDatalex
Copy link
Contributor Author

Ok, I see... - if we want to keep the option to run another extraction this wouldn't work.

Then the best option might be to create an additional mapping table that maps the id of the scraping table to the id in the green-db (both being foreign keys to their respective table). This would not affect our existing table structure at all, but keep track of the corresponding rows and allow for multiple extraction runs.

And regarding running the extraction again when someone wants to use the HTML - I think this is not very user-friendly and for the older data also not easily doable, because our extractor code is not backwards compatible, so we can not extract the information from old HTML's using the current extractor implementation.

I would really appreciate such a feature and probably all others who want to use the HTML in combination with the extracted data at some point too! :)

@se-jaeger se-jaeger removed the bug Something isn't working label Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants