Amazon Investigates Allegations of Scraping Abuse Leading to Confusion

Amazon Investigates Allegations of Scraping Abuse Leading to Confusion



Amazon’s cloud division is currently investigating Perplexity AI to determine if the artificial intelligence search startup is violating Amazon Web Services rules by scraping websites that have blocked it. The Robot Exclusion Protocol, a web standard that specifies which pages should not be accessed by robots and automated crawlers, is being ignored by Perplexity AI. This protocol is not legally binding, but most companies traditionally adhere to it. AWS customers are required to comply with the robots.txt standard when crawling websites as per Amazon’s terms of service.

Forbes accused Perplexity of stealing content in a June 11 report, which was confirmed by WIRED’s investigations. Further evidence of scraping abuse and plagiarism by systems linked to Perplexity’s AI-powered search chatbot was also found. Condé Nast engineers have blocked the Perplexity tracker on their websites, but the company accessed a server using an unpublished IP address to crawl Condé Nast websites. The IP address was also found on servers of news websites like The Guardian, Forbes, and The New York Times.

The IP address was traced to an Elastic Compute Cloud (EC2) instance hosted on AWS, prompting an investigation by the company. Perplexity CEO Aravind Srinivas initially claimed that the questions raised by WIRED reflected a misunderstanding of how the company operates. He later told Fast Company that the IP address observed on various websites was operated by a third-party company providing web crawling and indexing services. Srinivas refused to name the company due to a confidentiality agreement and stated that stopping the tracking was a complicated issue.

Overall, the investigation into Perplexity AI’s practices regarding web scraping and potential violations of Amazon Web Services rules is ongoing. Companies like The Guardian, Forbes, and The New York Times have detected the IP address associated with Perplexity on their servers, raising concerns about the startup’s data gathering methods. Perplexity’s reliance on scraping data from websites that have blocked it and its ignoring of the Robot Exclusion Protocol could lead to further scrutiny and potential legal consequences.

Article Source
https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/