Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some links on page not crawled #723

Open
robert-1043 opened this issue Nov 15, 2024 · 2 comments
Open

some links on page not crawled #723

robert-1043 opened this issue Nov 15, 2024 · 2 comments

Comments

@robert-1043
Copy link

On a page where extra content is loaded on scrolling down, the 'first' block loaded is captured and displayed in replay. But links in this block aren't crawled. Second block loaded is also captured and displays in replay.

Ran into this on a 42k pages crawl.

Tested to confirm with only this url --scopeType any --depth 1. Adding --postLoadDelay 4 doesn't solve the issue.

Will transfer wacz to info a webrecorder.

@ikreymer
Copy link
Member

Ah, the link extraction happens before autoscrolling at the moment..
However, the new autoclick behavior might be able to address that, now available in 1.5.0 beta.

@robert-1043
Copy link
Author

robert-1043 commented Jan 27, 2025

I actually ran a crawl this weekend on a site to be archived, resulting in the same issue.

I've tested with v150b2, resulting in a difference. Only the "first view" (without scrolling down and loading extra content) is captured. So: all content captured is clickable. But: not all content that is loaded on the page if scrolled down to the end is captured.

The links are actually nice <a href="..."> ; autoclick is enabled by default so I just have to add --selectLinks 'a[href]->href' ? (command line browsertrix-crawler)

(Work around for now is: manually load complete page, get all links, put them in a seedfile.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants