AI at Meta: Transparency about our training data

Meta is excited to bring artificial intelligence (AI) experiences to people and businesses across Meta Products worldwide.

What is AI at Meta?

AI at Meta is our collection of generative AI features and experiences, such as Meta AI and AI creative tools, along with the models that power them. This also includes models we make available through an open platform to support researchers, developers and others in the AI community. AI at Meta helps people solve complex problems, be more imaginative and create something never seen before. From bringing real-time answers to chat, to helping people organize and plan for their next vacation, to giving them more ways to express themselves, AI at Meta helps people enhance their everyday activities, experiences and moments.
Here’s more about the information we use for AI at Meta, where it comes from and how it works.
Data sources
Where does the data come from?
The information we use to develop and improve AI at Meta includes a mix of publicly available data, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with AI at Meta features.
Intended purpose
How does this data help AI at Meta do what it’s designed to do?
The datasets used to train AI at Meta are selected to help ensure the models can understand and generate language, audio, images, and video in a way that is accurate, relevant and safe for the people who use our products.
  • Public and licensed data provide the breadth and quality needed for general intelligence.
  • Public content from Meta products helps ensure the models are contextually aware and behave in a way that’s aligned with people’s expectations.
  • Information from people’s interactions allows for continuous improvement of AI features.
Together, these datasets enable AI at Meta to fulfill their intended purposes of providing helpful, engaging and responsible AI features and experiences.
Amount of data
How much information or how many data points were used to create AI at Meta?
AI at Meta uses various model types and sizes to optimize for different performance and use case needs. Generally, models are trained on trillions of tokens, which are pieces of information of different types. A token can be a word or part of a word, a piece of an image or milliseconds of video.
Types of data
What kind of information or data points were used?
We train AI at Meta mostly using unlabeled data that includes raw text, images, audio, video and related information from publicly available and licensed sources. Where labels are applied to the data, they might show things like sentiment, topic, safety, relevance, image captions or human ratings.
Protected data
Is any of the data we use protected by copyright, trademark or patent?
The datasets used for AI at Meta include a mix of publicly available, licensed and user-generated data. The copyright, trademark and patent status of data will depend upon, among other things, the nature of the specific data and the application of relevant intellectual property laws.
Purchased or licensed data
Were datasets purchased or licensed?
Some datasets are purchased or licensed from third parties, including commercial datasets and partner content.
Personal information
Do datasets include personal information
Datasets used to train AI at Meta may include information such as names or other details that people share publicly on our products or provide through their interactions with AI at Meta.
Aggregated consumer information
Do the datasets include information combined from many people that can’t be used to identify them?
Datasets used to train AI at Meta may include aggregate (i.e., combined) consumer information, such as group-level statistics or de-identified, aggregated data.
Data processing
If data was processed or modified, what was the purpose?
Meta at times reformats data and/or applies labels for training and evaluating models. Meta applies extensive cleaning, processing and modification to datasets used for AI systems. These efforts may include , de-identifying and aggregating certain data, removing duplicates and low-quality content and applying privacy and safety safeguards. This is intended to protect people’s privacy, help ensure data quality and safety, and support the responsible and effective development of AI at Meta.
Synthetic data
Is or was any synthetic data used?
AI at Meta generates and uses synthetic data to build and improve its generative artificial intelligence systems and services. Synthetic data is generated and used for several purposes, including:
  • To augment existing data with additional context like labels
  • To help when there isn’t enough data to work with or when we have more of some types of data than others
  • To help ensure that AI at Meta is reliable, safe and efficient
Start of data use
When was the data first used?
AI at Meta is developed and improved using datasets that are collected and incorporated on a rolling, ongoing basis. Because data collection and integration are continuous, there is no single date on which all datasets were first used. To improve AI at Meta, new data is regularly added from sources like:
  • Public sources
  • Licensed partners
  • Public content from Meta’s products
  • People’s interactions with AI at Meta features
  • Synthetic data
This ongoing process helps ensure that AI at Meta continues to develop and improve.
Data collection
What is the time frame during which the data was collected?
Meta collects and uses datasets to train and improve AI at Meta on an ongoing basis. There isn’t just one window of time when all the data was gathered. Instead, Meta keeps updating its data with new information from sources like:
  • Public sources
  • Licensed partners
  • Public content from Meta’s products
  • People’s interactions with AI at Meta features
  • Synthetic data
This ongoing process helps ensure that AI at Meta is developed and improved using the most current and relevant data available.