Advancing open data

Microsoft aims to close the data divide and help organizations of all sizes to realize the benefits of data and the new technologies it powers.

Data for Society

View a collection of open datasets from Microsoft and how they're being used to address societal challenges.

Industry Data for Society Partnership

The partnership is committed to making private sector data more open and accessible to address societal challenges.

Our approach

Microsoft believes everyone can benefit with collaboration around open and available data.

Enable open innovation

We’re working to promote open innovation and data governance approaches that empower data users and providers to collaborate and create value.

Build partnerships for greater impact

We believe success requires partners—industry, government, and civil society around the world. Together, we promote greater access to data to benefit society and bridge the data gap.

Make data sharing easier

We're committed to investing in the essential assets that will make data sharing easier, including the necessary tools; frameworks; and templates. This is especially important when it comes to opening and collaborating around data to solve important societal issues.

Accelerating access to data

Access to data is a big challenge. We partner with industry and open data leaders to advance open data access and private sector data sharing for societal benefit.

The open data opportunity

The importance behind data sharing explained.

Industry Data for Society Partnership

Working across industry to make private sector data more open and accessible for societal good.

2023 Year in Review

Learn about the Industry Data for Society Partnership’s progress on advancing access to data to tackle societal challenges.

Microsoft Data for Society catalog

Explore datasets, use cases, and more in our Microsoft Data for Society repository. ​ Previous Next

BankNote-Net

Worldwide millions of people have low or no vision. BankNote-Net was created as an open dataset for assistive universal currency recognition to help with daily tasks such as currency recognition.

United States broadband usage dataset

Broadband internet access is critical to providing communities with education, employment, and telecare. The broadband usage percentages dataset shows broadband access at the US county-level to help address gaps in service availability.

MS-ASL American Sign Language (ASL) dataset

In the US, over 500,000 people use ASL for communication. This ASL dataset of over 25,000 annotated videos with sign and action recognition can help researchers build machine learning models to advance sign language recognition.

Tagged hands dataset

Development of a rich hand-gesture-based interface is currently a tedious process. This dataset of 3,500 labeled depth frames of various hand poses and 140 gesture clips helps enable easy development of a gesture-based interface.

Generative Neural Visual Artist (GeNeVA)

Intelligent systems can generate images and video for a range of applications, from education to accessibility. This dataset has sequences of images, associated instructions and linguistic feedback, and a modified version of the Compositional Language and Elementary Visual Reasoning (CLEVR) dataset.

Learning from analog pen use to improve digital ink experiences

To help researchers understand the gaps between analog versus digital pens and improve digital experiences, this dataset contains 493 entries of a diary study with 26 participants using analog pens and 178 entries from 30 participants using digital pens.

Microsoft Machine Reading Comprehension (MS MARCO)

AI and automated assistants need strong machine reading comprehension (MRC) and question answering (QA) capabilities to understand real-world dialog. This dataset contains 1,010,916 questions and 182,669 answers to improve QA and MRC.

Digital Civility Gender Equality Dataset

Microsoft recognizes the importance of advocating for and advancing the release of gender disaggregated data to realize gender equality and to close the data divide. This dataset can be leverage by researchers and organizations to advance better gender data policies and solutions.

Solar farms mapping

The solar farms mapping data can help researchers identify factors driving land suitability for solar projects and help public agencies better plan siting of solar energy development in India.

HKH glacier mapping

Glacier mapping is key to ecological monitoring in the Hindu Kush Himalaya (HKH) region, climate change poses a risk to those dependent on the health of glacier ecosystems. The (HKH) glacier mapping dataset includes imagery with locations of glaciers.

Chesapeake land cover

The Chesapeake Conservancy created a landcover dataset for conservation efforts, this same data containing high-resolution aerial imagery and land cover labels can be used to train ML models to map an even wider area of land cover.

Concentrated Animal Feeding Operations (CAFO)

The poultry CAFO GitHub repository contains US-wide datasets of predicted poultry barn locations to help researchers identify CAFOs for conservation groups to address water and air quality issues.

TorchGeo

TorchGeo is a PyTorch domain library that includes several Geospatial benchmark datasets such as CDL, Landsat7, and Landsat8 to help support research tasks like image classification, semantic segmentation, object detection, instance segmentation, change detection, and more.

Bing COVID-19 data

Bing COVID-19 data includes confirmed, fatal, and recovered cases from all regions, updated daily from multiple reliable sources. This data is reflected in the Bing COVID-19 Tracker.

NCI-PID-PubMed Genomics KB

NCI-PID-PubMed Genomics Knowledge Base Completion Dataset is derived from the National Cancer Institute Pathway Interaction Database, and contains textual mentions extracted from cooccurring pairs of genes in PubMed abstracts, to help support the cancer research community and others interested in cellular pathways.

Exercise recognition from wearable sensors

Exercise is an important part of maintaining good health. This data set contains accelerometer and gyroscope recordings from over 200 participants performing various gym exercises that can be leveraged by researchers developing exercise devices.