some links on page not crawled #723

robert-1043 · 2024-11-15T07:10:29Z

On a page where extra content is loaded on scrolling down, the 'first' block loaded is captured and displayed in replay. But links in this block aren't crawled. Second block loaded is also captured and displays in replay.

Ran into this on a 42k pages crawl.

Tested to confirm with only this url --scopeType any --depth 1. Adding --postLoadDelay 4 doesn't solve the issue.

Will transfer wacz to info a webrecorder.

ikreymer · 2025-01-27T02:36:36Z

Ah, the link extraction happens before autoscrolling at the moment..
However, the new autoclick behavior might be able to address that, now available in 1.5.0 beta.

robert-1043 · 2025-01-27T10:29:59Z

I actually ran a crawl this weekend on a site to be archived, resulting in the same issue.

I've tested with v150b2, resulting in a difference. Only the "first view" (without scrolling down and loading extra content) is captured. So: all content captured is clickable. But: not all content that is loaded on the page if scrolled down to the end is captured.

The links are actually nice <a href="..."> ; autoclick is enabled by default so I just have to add --selectLinks 'a[href]->href' ? (command line browsertrix-crawler)

(Work around for now is: manually load complete page, get all links, put them in a seedfile.)

github-project-automation bot added this to Webrecorder Projects Nov 15, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some links on page not crawled #723

some links on page not crawled #723

robert-1043 commented Nov 15, 2024

ikreymer commented Jan 27, 2025

robert-1043 commented Jan 27, 2025 •

edited

Loading

some links on page not crawled #723

some links on page not crawled #723

Comments

robert-1043 commented Nov 15, 2024

ikreymer commented Jan 27, 2025

robert-1043 commented Jan 27, 2025 • edited Loading

robert-1043 commented Jan 27, 2025 •

edited

Loading