Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new POST feeds/sitemap mode #132

Open
himynamesdave opened this issue Feb 3, 2025 · 4 comments
Open

Add new POST feeds/sitemap mode #132

himynamesdave opened this issue Feb 3, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@himynamesdave
Copy link
Member

himynamesdave commented Feb 3, 2025

More and more sites don't have RSS or ATOM feeds.

To solve this problem the sitemap can be used.

To solve this issue I wrote a (very basic at this point, much improvement possible) script to grab the sitemap

https://github.com/muchdogesec/sitemap2posts

It would be good to introduce this logic into Obstracts whereby

  • we crawl sitemap to get URLs (using proxy)
  • we convert posts to text
  • we run checks on wether they should be indexed (e..g Add AI content check #131 )
  • run extraction
  • check periodically

Like rss and atom feeds, we should periodically check sitemap type feeds for updates (and should be supported by all endpoints)

To do this, you can use the update times found in the sitemap

Sitemaps aren\t perfect, (e.g. update times might all be update for posts when a single post change is made), so we should also implement a check to see if URL exists in the database, if it does, we should skip (a user can manually run a reindex on it if needed)

When creating a new feed user should be able to specifiy

{
  "profile_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "title": "string",
  "description": "string",
  "url": "string",
  "pretty_url": "string",
  "include_remote_blogs": false,
  "ignore_domain_paths": ["list of urls"]
}

This will index the entire sitemap on run one. For update requests, the gap between last blog post time and now should be considered.

@himynamesdave himynamesdave added the enhancement New feature or request label Feb 3, 2025
@github-project-automation github-project-automation bot moved this to Todo in Roadmap Feb 3, 2025
@himynamesdave
Copy link
Member Author

THIS IS ALSO A H4F TICKET

As @fqrious pointed out... sitemap is a problematic way to do this. The only real advantage is that it's free

It seem Bing is the better approach. New pages are indexed almost instantaneously and often more accurate than a stiemap.

https://chatgpt.com/share/67a1eadc-81c8-8004-beaf-c1216fb1fcab

Essentially we can get all the URLs using a targetted search using the blog path and site filter (as shown in chatgpt thread).

The only issue with this approach is dates of results do not exist. As such, on each update we will need to get all the results again, and compare to the existing list in the DB to see if any new entries present. This is expensive, but within the bounds of cost.

@fqrious
Copy link
Contributor

fqrious commented Feb 5, 2025

even though date of results do not exist, we can still use freshness to filter on subsequent runs

@himynamesdave
Copy link
Member Author

himynamesdave commented Feb 6, 2025

@fqrious got the following from Microsoft

Thanks for reaching out. We are evaluating our sign-up process for Bing APIs. New customers who do not have a Bing resource on their subscriptions, will be unable to add a Bing resource. New customers can learn more about the latest Azure offering here: https://learn.microsoft.com/en-us/azure/ai-services/agents/how-to/tools/bing-grounding?tabs=python&pivots=code-example%22.

They won't budge.

Thus, lets use Google instead.

Use var

GOOGLE_SEARCH_API_KEY (to separate from GOOGLE_API_KEY used for AI)

Regarding freshness -- nice find. Google Search has something similar we can employ

https://chatgpt.com/share/67a5aae6-90e8-8004-913c-152dc89c09d2

Can add BING support later on. Also https://serpapi.com/

@himynamesdave himynamesdave moved this from Todo to Blocked in Roadmap Feb 10, 2025
@himynamesdave
Copy link
Member Author

as @fqrious found out Google Search API limits results to 100 each time. This is too low for many blogs

Image

SERPAPI is very expensive (but prob the best): https://serpapi.com/blog/compare-serpapi-with-the-alternatives-serper-and-searchapi/

However, given our use case is fairly simple (searching for all indexed URLs matching an input URL) an advanced search API with images, etc is not needed.

https://serper.dev/ looks like a good alternative -- lets give this a shot instead

@himynamesdave himynamesdave moved this from Blocked to Todo in Roadmap Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants