Page Scraping Ethics

Discussion in 'Artificial Intelligence' started by Arnie Guitar, Jun 30, 2025 at 10:26 AM.

  1. Dismiss Notice
  1. I've only recently became aware of page scraping.

    So when I ask a...gardening question...I'm getting a summation of a bunch of web pages?

    I understand the argument is that those web pages aren't being compensated for their knowledge.

    Content creators have to know that going in, right?

    That's kinda the debate, right?
     
  2. Baron

    Baron Administrator

    That's pretty much the gist of it.

    Not only are sites are getting scraped for content for use by AI, but now we are seeing google search results show an AI summary first, and that causes most people to just use the AI summary instead of clicking on web page links in the search results.

    I learned the other day that the inbound traffic from Google searches to news sites is down almost 50% over the past year since the AI summaries have been running.

    In the past website owners have allowed search engines to scrape their sites because that meant the content was indexed by the search engines so those web pages could be found by people doing searches, so it was kind of a reciprocal arrangement. But these AI companies are essentially taking that same data and displaying a summary of it when queried and so the sites that provided the data are completely cut out of the search and discovery process altogether.
     
  3. If only the content they were providing was original and they didn't copy it from someone else...
    What they complain about is that their ads don't print as they used to anymore.
    People prefer AIs not only because they offer synthesized information, but also because they don't have to deal with all sorts of stupid ads while reading a text.

    As this post says, just 10% of the internet is likely original.

     
  4. Peter8519

    Peter8519

    Just do it with caution as it may affect site bandwidth. Most sites have their robot policy e.g. nasdaq.com/robots.txt. Stringent sites will ip ban for a certain period if abused. Start simple and go slow e.g. delay 5 seconds in between scrape. Most sites will block bots now. Even Edgar. So, automate browser for scraping is the best bet. Excel VBA IE automation is good place to start. Having all the ratios of the stock watchlist in a single sheet is handy.