-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding in non Archiver scraping web links #13
Comments
Hm, so I'm not clear on what the distinction between this and web_scraping would be? I'm also not entirely clear on what non-archiving scraping means in this context. Could you clarify those points? I think these resources should definitely be included, but perhaps just under the web_scraping section. Perhaps for each category, there should be an issue for suggesting links to include, and those that we think are good resources should be added via PR? |
I don't think there should be a distinction between this and web_harvesting. I was thinking this would be a readme or a google sheet link inside the web_harvesting folder as the location to save this if that makes sense. The only scraping distinction I was thinking of is between links such as these and scraping that we do with datatogether archiving that has archivertools and morph.io usage. I think those examples should be kept out of this research repo. |
You're right, this needs clarification. I hope to get back to this this week. |
@ebenp @jeffreyliu Finally looking at this, I now remember the original idea behind the two directories. One is for cataloging software systems that do web archiving/scraping/etc., and the other is meant to be research on approaches to doing that (i.e., overall approach, algorithms, examples of software that does it, etc.). I struggled with how to name the directories, and clearly failed badly. What if web_harvesting were renamed to Regarding the distinction between scraping and archiving, I might be wrong, but I think there is a difference, because a system to scrape web pages does not necessary have to archive or store the results. For exaple, I've written a system that scrapes pages to get info and store specific bits of info in a custom database, but it doesn't archive the whole page or harvest the page/site in the way that we talk about those things in Archivers & Data Together. IMHO, the term "harvesting" could mean either scraping or archiving, although looking around, I now see that Wikipedia basically makes "web scraping" synonymous with "web harvesting" and "web data extraction", so I guess it's closer to the meaning of scraping. |
web harvesting makes sense to me and I also really like the detail given above about what harvesting and archiving is in terms of Data Together. Maybe those definitions could end up in the directory readme. |
Looking up code syntax I found the following blog post and referenced github repo. I wondered if links such as the example below should be tracked as non Archiving scraping web links under research/web_scraping.
I'm not quite sure what the best format is for folks to add links, comment, and edit and don't have any sense of how frequently such a resource would be updated.
I'm interested in people's thoughts on 1) if this belongs in research/web_scraping or somewhere else and 2) how to go about a useful PR on the topic including preferred tracking format and any document organization.
cc @b5 @jeffreyliu @weatherpattern @mhucka
Example links:
http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/
https://github.com/stanfordjournalism/search-script-scrape
The text was updated successfully, but these errors were encountered: