You are here

Agreguesi i feed

Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers'

Slashdot - Dje, 09/11/2025 - 12:34pd
For more than a decade, the nonprofit Common Crawl "has been scraping billions of webpages to build a massive archive of the internet," notes the Atlantic, making it freely available for research. "In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models. "In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives..." Common Crawl's website states that it scrapes the internet for "freely available content" without "going behind any 'paywalls.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl's executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. "The robots are people too," he told me, and should therefore be allowed to "read the books" for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not. I've discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, "Generative AI in its current form would probably not be possible without Common Crawl." In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed that the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans," and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers' articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers. Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from "Search 2.0" — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. "You shouldn't have put your content on the internet if you didn't want it to be on the internet," he said. Common Crawl doesn't log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you're a subscriber and hides the content if you're not. Common Crawl's scraper never executes that code, so it gets the full articles. Thus, by my estimate, the foundation's archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic.... A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, when in fact there are articles from NYTimes.com in most of these crawls. "In the past year, Common Crawl's CCBot has become the scraper most widely blocked by the top 1,000 websites," the article points out...

Read more of this story at Slashdot.

Scientists Edit Gene in 15 Patients That May Permanently Reduce High Cholesterol

Slashdot - Sht, 08/11/2025 - 11:34md
A CRISPR-based drug given to study participants by infusion is raising hopes for a much easier way to lower cholesterol, reports CNN: With a snip of a gene, doctors may one day permanently lower dangerously high cholesterol, possibly removing the need for medication, according to a new pilot study published Saturday in the New England Journal of Medicine. The study was extremely small — only 15 patients with severe disease — and was meant to test the safety of a new medication delivered by CRISPR-Cas9, a biological sort of scissor which cuts a targeted gene to modify or turn it on or off. Preliminary results, however, showed nearly a 50% reduction in low-density lipoprotein, or LDL, the "bad" cholesterol which plays a major role in heart disease — the No.1 killer of adults in the United States and worldwide. The study, which will be presented Saturday at the American Heart Association Scientific Sessions in New Orleans, also found an average 55% reduction in triglycerides, a different type of fat in the blood that is also linked to an increased risk of cardiovascular disease. "We hope this is a permanent solution, where younger people with severe disease can undergo a 'one and done' gene therapy and have reduced LDL and triglycerides for the rest of their lives," said senior study author Dr. Steven Nissen, chief academic officer of the Sydell and Arnold Miller Family Heart, Vascular & Thoracic Institute at Cleveland Clinic in Ohio.... Today, cardiologists want people with existing heart disease or those born with a predisposition for hard-to-control cholesterol to lower their LDL well below 100, which is the average in the US, said Dr. Pradeep Natarajan, director of preventive cardiology at Massachusetts General Hospital and associate professor of medicine at Harvard Medical School in Boston... People with a nonfunctioning ANGPTL3 gene — which Natarajan says applies to about 1 in 250 people in the US — have lifelong levels of low LDL cholesterol and triglycerides without any apparent negative consequences. They also have exceedingly low or no risk for cardiovascular disease. "It's a naturally occurring mutation that's protective against cardiovascular disease," said Nissen, who holds the Lewis and Patricia Dickey Chair in Cardiovascular Medicine at Cleveland Clinic. "And now that CRISPR is here, we have the ability to change other people's genes so they too can have this protection." "Phase 2 clinical trials will begin soon, quickly followed by Phase 3 trials, which are designed to show the effect of the drug on a larger population, Nissen said." And CNN quotes Nissen as saying "We hope to do all this by the end of next year. We're moving very fast because this is a huge unmet medical need — millions of people have these disorders and many of them are not on treatment or have stopped treatment for whatever reason."

Read more of this story at Slashdot.

Bank of America Faces Lawsuit Over Alleged Unpaid Time for Windows Bootup, Logins, and Security Token Requests

Slashdot - Sht, 08/11/2025 - 10:34md
A former Business Analyst reportedly filed a class action lawsuit claiming that for years, hundreds of remote employees at Bank of America first had to boot up complex computer systems before their paid work began, reports Human Resources Director magazine: Tava Martin, who worked both remotely and at the company's Jacksonville facility, says the financial institution required her and fellow hourly workers to log into multiple security systems, download spreadsheets, and connect to virtual private networks — all before the clock started ticking on their workday. The process wasn't quick. According to the filing in the United States District Court for the Western District of North Carolina, employees needed 15 to 30 minutes each morning just to get their systems running. When technical problems occurred, it took even longer... Workers turned on their computers, waited for Windows to load, grabbed their cell phones to request a security token for the company's VPN, waited for that token to arrive, logged into the network, opened required web applications with separate passwords, and downloaded the Excel files they needed for the day. Only then could they start taking calls from business customers about regulatory reporting requirements... The unpaid work didn't stop at startup. During unpaid lunch breaks, many systems would automatically disconnect or otherwise lose connection, forcing employees to repeat portions of the login process — approximately three to five minutes of uncompensated time on most days, sometimes longer when a complete reboot was required. After shifts ended, workers had to log out of all programs and shut down their computers securely, adding another two to three minutes. Thanks to Slashdot reader Joe_Dragon for sharing the article.

Read more of this story at Slashdot.

Chan Zuckerberg Initiative Shifts Bulk of Philanthropy, 'Going All In on AI-Powered Biology'

Slashdot - Sht, 08/11/2025 - 9:34md
The Associated Press reports that "For the past decade, Dr. Priscilla Chan and her husband Mark Zuckerberg have focused part of their philanthropy on a lofty goal — 'to cure, prevent or manage all disease' — if not in their lifetime, then in their children's." During that decade they also funded other initiatives (including underprivileged schools and immigration reform), according to the article. But there's a change coming: Now, the billionaire couple is shifting the bulk of their philanthropic resources to Biohub, the pair's science organization, and focusing on using artificial intelligence to accelerate scientific discovery. The idea is to develop virtual, AI-based cell models to understand how they work in the human body, study inflammation and use AI to "harness the immune system" for disease detection, prevention and treatment. "I feel like the science work that we've done, the Biohub model in particular, has been the most impactful thing that we have done. So we want to really double down on that. Biohub is going to be the main focus of our philanthropy going forward," Zuckerberg said Wednesday evening at an event at the Biohub Imaging Institute in Redwood City, California.... Chan and Zuckerberg have pledged 99% of their lifetime wealth — from shares of Meta Platforms, where Zuckerberg is CEO — toward these efforts... On Thursday, Chan and Zuckerberg also announced that Biohub has hired the team at EvolutionaryScale, an AI research lab that has created large-scale AI systems for the life sciences... Biohub's ambition for the next years and decades is to create virtual cell systems that would not have been possible without recent advances in AI. Similar to how large language models learn from vast databases of digital books, online writings and other media, its researchers and scientists are working toward building virtual systems that serve as digital representations of human physiology on all levels, such as molecular, cellular or genome. As it is open source — free and publicly available — scientists can then conduct virtual experiments on a scale not possible in physical laboratories. "We will continue the model we've pioneered of bringing together scientists and engineers in our own state-of-the-art labs to build tools that advance the field," according to Thursday's blog post. "We'll then use those tools to generate new data sets for training new biological AI models to create virtual cells and immune systems and engineer our cells to detect and treat disease.... "We have also established the first large-scale GPU cluster for biological research, as well as the largest datasets around human cell types. This collection of resources does not exist anywhere else."

Read more of this story at Slashdot.

World's Largest Cargo Sailboat Completes Historic First Atlantic Crossing

Slashdot - Sht, 08/11/2025 - 8:34md
Long-time Slashdot reader AmiMoJo shared this report from Marine Insight: The world's largest cargo sailboat, Neoliner Origin, completed its first transatlantic voyage on 30 October despite damage to one of its sails during the journey. The 136-metre-long vessel had to rely partly on its auxiliary motor and its remaining sail after the aft sail was damaged in a storm shortly after departure... Neoline, the company behind the project, said the damage reduced the vessel's ability to perform fully on wind power... The Neoliner Origin is designed to reduce greenhouse gas emissions by 80 to 90 percent compared to conventional diesel-powered cargo ships. According to the United Nations Conference on Trade and Development (UNCTAD), global shipping produces about 3 percent of worldwide greenhouse gas emissions... The ship can carry up to 5,300 tonnes of cargo, including containers, vehicles, machinery, and specialised goods. It arrived in Baltimore carrying Renault vehicles, French liqueurs, machinery, and other products. The Neoliner Origin is scheduled to make monthly voyages between Europe and North America, maintaining a commercial cruising speed of around 11 knots.

Read more of this story at Slashdot.

Bombshell Report Exposes How Meta Relied On Scam Ad Profits To Fund AI

Slashdot - Sht, 08/11/2025 - 7:34md
"Internal documents have revealed that Meta has projected it earns billions from ignoring scam ads that its platforms then targeted to users most likely to click on them," writes Ars Technica, citing a lengthy report from Reuters. Reuters reports that Meta "for at least three years failed to identify and stop an avalanche of ads that exposed Facebook, Instagram and WhatsApp's billions of users to fraudulent e-commerce and investment schemes, illegal online casinos, and the sale of banned medical products..." On average, one December 2024 document notes, the company shows its platforms' users an estimated 15 billion "higher risk" scam advertisements — those that show clear signs of being fraudulent — every day. Meta earns about $7 billion in annualized revenue from this category of scam ads each year, another late 2024 document states. Much of the fraud came from marketers acting suspiciously enough to be flagged by Meta's internal warning systems. But the company only bans advertisers if its automated systems predict the marketers are at least 95% certain to be committing fraud, the documents show. If the company is less certain — but still believes the advertiser is a likely scammer — Meta charges higher ad rates as a penalty, according to the documents. The idea is to dissuade suspect advertisers from placing ads. The documents further note that users who click on scam ads are likely to see more of them because of Meta's ad-personalization system, which tries to deliver ads based on a user's interests... The documents indicate that Meta's own research suggests its products have become a pillar of the global fraud economy. A May 2025 presentation by its safety staff estimated that the company's platforms were involved in a third of all successful scams in the U.S. Meta also acknowledged in other internal documents that some of its main competitors were doing a better job at weeding out fraud on their platforms... The documents note that Meta plans to try to cut the share of Facebook and Instagram revenue derived from scam ads. In the meantime, Meta has internally acknowledged that regulatory fines for scam ads are certain, and anticipates penalties of up to $1 billion, according to one internal document. But those fines would be much smaller than Meta's revenue from scam ads, a separate document from November 2024 states. Every six months, Meta earns $3.5 billion from just the portion of scam ads that "present higher legal risk," the document says, such as those falsely claiming to represent a consumer brand or public figure or demonstrating other signs of deceit. That figure almost certainly exceeds "the cost of any regulatory settlement involving scam ads...." A planning document for the first half of 2023 notes that everyone who worked on the team handling advertiser concerns about brand-rights issues had been laid off. The company was also devoting resources so heavily to virtual reality and AI that safety staffers were ordered to restrict their use of Meta's computing resources. They were instructed merely to "keep the lights on...." Meta also was ignoring the vast majority of user reports of scams, a document from 2023 indicates. By that year, safety staffers estimated that Facebook and Instagram users each week were filing about 100,000 valid reports of fraudsters messaging them, the document says. But Meta ignored or incorrectly rejected 96% of them. Meta's safety staff resolved to do better. In the future, the company hoped to dismiss no more than 75% of valid scam reports, according to another 2023 document. A small advertiser would have to get flagged for promoting financial fraud at least eight times before Meta blocked it, a 2024 document states. Some bigger spenders — known as "High Value Accounts" — could accrue more than 500 strikes without Meta shutting them down, other documents say. Thanks to long-time Slashdot reader schwit1 for sharing the article.

Read more of this story at Slashdot.

Faqet

Subscribe to AlbLinux agreguesi