-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new POST feeds/sitemap mode #132
Comments
THIS IS ALSO A H4F TICKET As @fqrious pointed out... sitemap is a problematic way to do this. The only real advantage is that it's free It seem Bing is the better approach. New pages are indexed almost instantaneously and often more accurate than a stiemap. https://chatgpt.com/share/67a1eadc-81c8-8004-beaf-c1216fb1fcab Essentially we can get all the URLs using a targetted search using the blog path and The only issue with this approach is dates of results do not exist. As such, on each update we will need to get all the results again, and compare to the existing list in the DB to see if any new entries present. This is expensive, but within the bounds of cost. |
even though date of results do not exist, we can still use |
@fqrious got the following from Microsoft
They won't budge. Thus, lets use Google instead. Use var
Regarding https://chatgpt.com/share/67a5aae6-90e8-8004-913c-152dc89c09d2 Can add BING support later on. Also https://serpapi.com/ |
as @fqrious found out Google Search API limits results to 100 each time. This is too low for many blogs SERPAPI is very expensive (but prob the best): https://serpapi.com/blog/compare-serpapi-with-the-alternatives-serper-and-searchapi/ However, given our use case is fairly simple (searching for all indexed URLs matching an input URL) an advanced search API with images, etc is not needed. https://serper.dev/ looks like a good alternative -- lets give this a shot instead |
More and more sites don't have RSS or ATOM feeds.
To solve this problem the sitemap can be used.
To solve this issue I wrote a (very basic at this point, much improvement possible) script to grab the sitemap
https://github.com/muchdogesec/sitemap2posts
It would be good to introduce this logic into Obstracts whereby
Like rss and atom feeds, we should periodically check
sitemap
type feeds for updates (and should be supported by all endpoints)To do this, you can use the update times found in the sitemap
Sitemaps aren\t perfect, (e.g. update times might all be update for posts when a single post change is made), so we should also implement a check to see if URL exists in the database, if it does, we should skip (a user can manually run a reindex on it if needed)
When creating a new feed user should be able to specifiy
This will index the entire sitemap on run one. For update requests, the gap between last blog post time and now should be considered.
The text was updated successfully, but these errors were encountered: