NLM Dataset Catalog

Performance and Evaluation Portal

Discover the data behind the Dataset Catalog


Connecting Researchers to Datasets

The Dataset Catalog is a catalog of biomedical datasets for researchers and medical librarians to search, discover, and connect with biomedical datasets.
Unlike existing independent and disconnected repositories in use today, the Dataset Catalog is a metadata repository built on a standard semantic metadata model and adheres to FAIR data management principles.
This approach provides a single interface that enables researchers to search across multiple repositories at once, explore related information that otherwise may not be apparent, and quickly navigate to the most appropriate dataset to meet their needs.
The Dataset Catalog is currently in beta, a public preview and test phase, where the core functionality of the product is tested and its effectiveness is measured and evaluated.
For more information, visit the Dataset Catalog About page.
A series of increasingly smaller concentric circles of varying colors, illustrating usage of the Dataset Catalog. From largest to smallest, the circles are labeled: 7,632 Users, 6,673 Searches, 3,239 Citations Viewed, 1,954 Datasets Accessed.

Accelerating Biomedical Discovery

Helping researchers discover biomedical datasets contributes to NLM being a platform for biomedical discovery and data-powered health as was envisioned in the "National Library of Medicine Strategic Plan 2017-2027".  That plan encourages the development of an integrated digital ecosystem to keep pace with the rapidly increasing scope of the research enterprise and provide a wide range of audiences with access to the information necessary to translate research outputs to insights that improve clinical care, public health practices, and personal wellness.

Top Search Terms as of 9/28/24

A word cloud of varying colored text, illustrating the most frequently searched terms in the Dataset Catalog.  The most prominent terms are: socioeconomic, covid-19, covid, health, treatment, characteristic, cancer, patient, disease, brain.
Search terms from the previous 4-weeks are grouped by terms, as opposed to the individual search queries. The top 80 terms are selected for the word cloud, where size is weighted by frequency.

Evaluating Engagement

The goal of the Dataset Catalog beta Assessment Program is to gain a better understanding of the needs and behaviors of those using the product.  These findings will help us make better content, scalability, and sustainability decisions.  The process involves systematically collecting information about planned strategies and activities, monitoring beta testing progress, and reporting and communicating the expected outcomes to NLM leadership and other stakeholders who have interest in the results of this beta product launch.

Access metrics as of the week ending 9/28/24

Understanding Users

Understanding the challenges and behaviors of those using the Dataset Catalog provides the insight necessary to anticipate and meet the evolving needs of the biomedical research community. By comprehending user expectations and staying abreast of research trends we can more accurately evolve the Dataset Catalog to becomes an indispensable tool in the pursuit of scientific excellence and innovation. 
The insights gained from these metrics shed light on the user journey, engagement, and the most sought-after features and repositories, enabling us to effectively refine the Dataset Catalog and our communication with those it serves. This focused approach not only enhances user experience but also positions the Dataset Catalog as a pivotal resource in the global effort to advance biomedical research and achieve breakthrough discoveries.

Repository metrics as of the week ending 9/28/24

Listening to Users

“It's a great way to make datasets in repositories more discoverable and re-usable.”
― Claire Castle, Research Data Manager, UK
“Can you give some examples of data repositories that will be included in the future?"
― Lesley Skalla, Duke University, USA


“Congratulations on getting the DATMM effort launched, this is exciting and important work.”
― Susan Gregurick, NIH, USA

Adding Repositories to the Dataset Catalog

Supporting Scalability

In addition to understanding our users, it is important to understand system performance and responsiveness.  As the corpus grows and usage increase, we want to maintain reasonable response times to our users.  By monitoring system characteristics like corpus size, response times, and processing requirements, we can ensure that the Dataset Catalog environment is sized appropriately to meet demand.
A graph with icons of the repositories currently in the Dataset Catalog running along an arrow moving up from the bottom-left.  Months labeled on the bottom axis are January 2024, April 2024, July 2024.  Repository icons from the bottom-left to top-right are: Harvard Dataverse, Dryad, dbGap, Import, Borealis, UCLA Dataverse, Texas Data Repository.