Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 3.22 KB

README.md

File metadata and controls

44 lines (33 loc) · 3.22 KB

Web Scraping in python: From HTML Soup to Tidy Data

A workshop created by Sami Friedrich for the BioData Club Workshop Series.

Overview

The internet is overflowing with data ripe for harvesting. The challenge is that not all of that data is formatted neatly or easily accessible. Enter the web scraping multitool! With the power of web scraping, the contents of virtually any webpage can be transformed into analysis-ready data. During this workshop, you’ll learn using python how to:

  1. Scavenge the contents of an HTML webpage
  2. Extract only the data you want
  3. Format the data into a table

Libraries used:

  • requests
  • BeautifulSoup4
  • pandas

Developer tools used:

Prerequisites

  1. Some basic python knowledge (looping through list elements, passing arguments to functions, writing basic functions) is a prerequisite for this workshop.
  1. We will also be working with HTML, and no prior experience is necessary. However, it will be helpful to have a surface-level understanding of HTML elements - namely, their open/close tag structure, and how they nest within each other.

Files

  • webscraping_workshop.ipynb is the Jupyter Notebook (without solutions) for the workshop. Follow the badge at the top to open in Google Colab, or download and run locally (just make sure you've already installed the libraries listed above.)
  • solutions_to_webscraping_workshop.ipynb contains solutions to the Jupyter Notebook exercises.
  • taphunter_belmont_station.html is the downloaded .html file for the webpage this workshop is designed to scrape. If you're running things locally, be sure to place this file in the same folder as webscraping_workshop.ipynb.

Other materials

The Google Slides presentation accompanying this workshop can be found here.

Author

Sami Friedrich, PhD candidate at Oregon Health and Science University. Please feel free to reach out with questions or comments!

License

This project is licensed under the MIT License (see LICENSE).