Times insider explains who we are and what we do, and gives a behind-the-scenes look at how our journalism comes together.

As of this morning, programs written by New York Times developers have made more than 10 million requests for Covid-19 data from websites around the world. The data we collect are daily snapshots of the ebb and flow of the virus, including for every US state and thousands of US counties, cities, and zip codes.

You may have snippets of this data in the Day cards and graphics We publish at The Times. Together, these pages, involving more than 100 journalists and engineers from across the company, are the most-viewed collection in the history of nytimes.com and are an integral part of the package of Covid reporting which earned the Times the 2021 Pulitzer Prize for Public Service.

The Times’ coronavirus tracking project was one of several efforts that helped fill the gap in public understanding of the pandemic created by the lack of a coordinated government response. Johns Hopkins University Coronavirus Resource Center collected both national and international case data. And the Covid Tracking Project at The Atlantic assembled an army of volunteers to collect US state data in addition to tests, demographics, and health facility data.

At The Times, we started with a single spreadsheet.

In late January 2020, Monica Davey, Editor of the National Desk, Mitch Smith, asked a Chicago-based correspondent to begin gathering information on every single U.S. Covid-19 case. One line per case, meticulously reported based on public announcements and entered by hand, with details such as age, location, gender, and condition.

In mid-March, the virus’ explosive growth proved too much for our workflow. The spreadsheet grew so large it stopped responding, and the reporters didn’t have enough time to manually report and enter data from the ever-growing list of US states and counties we had to track.

At this point, many domestic health officials began putting in place Covid-19 reporting measures and websites to educate their citizens about the local spread. The federal government was faced with deployment challenges early on a single, reliable federal dataset.

The local data available was literally and figuratively all over the map. Formatting and methodology varied widely from place to place.

Within the Times, a group of software engineers in the newsroom were quickly hired to develop tools to expand the data collection work as much as possible. The two of us – Tiff is a newsroom developer and Josh is a graphics editor – would end up forming this growing team.

On March 16, the core application was largely working, but we needed help finding many more sources. To tackle this colossal project, we recruited developers from across the company, many of whom had no newsroom experience, to temporarily step in to write scrapers.

By the end of April, we were programmatically collecting numbers from all 50 states and nearly 200 counties. But the pandemic and our database both seemed to grow exponentially.

Also, some notable sites changed multiple times in a matter of weeks, which meant we had to keep rewriting our code. Our newsroom engineers adapted by tweaking our custom tools – while they were in daily use.

Up to 50 people outside of the scraping team were actively involved in the day-to-day management and review of the data we collected. Some data is still being entered by hand, and everything is being checked manually by reporters and researchers, a process that takes seven days a week. The accuracy of the reporting and the fluency of the subject have been integral to all of our roles, from reporters to data reviewers to engineers.

In addition to posting data on The Times website, we created our dataset publicly available on GitHub End of March 2020 for everyone.

As vaccinations contain the toll of the virus Across the country – a total of 33.5 million cases have been reported – a number of health departments and other sources update their data less often. Conversely, the Federal Centers for Disease Control and Prevention has expanded its reporting to include comprehensive figures, which were only partially available in 2020.

All of this means some of our own custom data collections be shut down. Since April 2021, the number of our programmatic sources has decreased by almost 44%.

Our goal is to get around 100 active scrapers by late summer or early fall, mainly to track down potential hotspots.

The dream, of course, is to complete our efforts as the threat from the virus subsides significantly.

A execution this article was originally published on NYT open, New York Times blog about designing and building products for news.

Source link

Leave a Reply