green-db table can not be joined with scraping table based on id #78

BigDatalex · 2022-06-24T14:13:10Z

Currently it is not possible to relate information of the scraping table to its corresponding extracted product information in the green-db table via id. If we want to join the tables we currently have to use timestamp, url and category.

We already use the id, to retrieve a specific row in the scraping table, but the id is not used any further when writing the extracted product information into the green-db, see:

green-db/workers/workers/extract.py

Lines 36 to 39 in 90b631b

    
           scraped_page = CONNECTION_FOR_TABLE[table_name].get_scraped_page(id=row_id) 
        
           if product := extract_product(table_name=table_name, scraped_page=scraped_page): 
        
               green_db_connection.write(product)

The green-db table already has an id column, but this is autogenerated, see:

green-db/database/database/tables.py

Line 203 in 90b631b

id = Column(INTEGER, nullable=False, autoincrement=True, primary_key=True)

So, integrating this shouIdn't be a lot of work and would help whenever we want to use information from scraping table together with green-db table. For example using the HTML together with the extracted product information for some ML.

The text was updated successfully, but these errors were encountered:

BigDatalex · 2022-06-26T13:47:56Z

Using timestamp, url and category to join the scraping table with green-db table does not work for asos, because the url in scraping table is different from the url in green-db table.

In the scraping table we store the url of the asos API from which we retrieve the product data:

green-db/scraping/scraping/spiders/_base.py

Line 205 in 90b631b

url=response.url,

and in the green-db table we store the url of the products website:

green-db/extract/extract/extractors/asos.py

Line 50 in 90b631b

url = _get_url(page_json.get("localisedData", []), "fr-FR")

BigDatalex · 2022-06-27T08:15:19Z

A workaround for asos to join both tables is to extract the product id from the API url (url which is stored in scraping table) and the website url (url that is stored in green-db table) and join based on this product id, timestamp and category.

For example this code does the job:

scraping_asos["product_id"] = scraping_asos["url"].apply(lambda x: x.split("/")[-1].split("?")[0])
greendb_asos["product_id"] = greendb_asos["url"].apply(lambda x: x.split("/")[-1])

se-jaeger · 2022-06-28T13:37:27Z

I'm not sure if this is something we want to implement..

I could think of maintaining a "forward dependency" like a created column in the scraping database that has a foreign key to the row of the extracted product (green-db database). However, what if we manually run the extraction again? Overwrite, update, or extend (add an int to an array) this dependency?

Why not query (SQL) for the rows of interest in the scraping database and, if necessary, extract the necessary information from the HTML? I'm assuming the overhead is not the bottleneck here.

BigDatalex · 2022-06-28T14:26:37Z

Ok, I see... - if we want to keep the option to run another extraction this wouldn't work.

Then the best option might be to create an additional mapping table that maps the id of the scraping table to the id in the green-db (both being foreign keys to their respective table). This would not affect our existing table structure at all, but keep track of the corresponding rows and allow for multiple extraction runs.

And regarding running the extraction again when someone wants to use the HTML - I think this is not very user-friendly and for the older data also not easily doable, because our extractor code is not backwards compatible, so we can not extract the information from old HTML's using the current extractor implementation.

I would really appreciate such a feature and probably all others who want to use the HTML in combination with the extracted data at some point too! :)

BigDatalex added the enhancement New feature or request label Jun 24, 2022

BigDatalex added the bug Something isn't working label Jun 26, 2022

se-jaeger removed the bug Something isn't working label Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

green-db table can not be joined with scraping table based on id #78

green-db table can not be joined with scraping table based on id #78

BigDatalex commented Jun 24, 2022

BigDatalex commented Jun 26, 2022

BigDatalex commented Jun 27, 2022

se-jaeger commented Jun 28, 2022

BigDatalex commented Jun 28, 2022

green-db table can not be joined with scraping table based on id #78

green-db table can not be joined with scraping table based on id #78

Comments

BigDatalex commented Jun 24, 2022

BigDatalex commented Jun 26, 2022

BigDatalex commented Jun 27, 2022

se-jaeger commented Jun 28, 2022

BigDatalex commented Jun 28, 2022