08.22.25
Can Conversational AI Improve Survey Research?
Early Insights from Testing a Chat-Based Survey
For decades, the "gold standard" in social science survey research has required asking every respondent the same question, worded identically, delivered through the same mode. This approach prioritizes comparability: if everyone receives identical input, researchers can cleanly compare answers across people, places, and time.
While the gold standard approach has enabled valuable research, it assumes people interpret identical questions identically. In reality, people interpret questions through the lens of their own experiences. Questions that seem clear to researchers may carry vastly different meanings for respondents, shaped by cultural background, economic circumstances, and personal history. Fixed response options can obscure important nuance, flattening complex lived experiences into predefined categories. Over-standardization can lead to oversimplification, resulting in data that give a false sense of comparability while obscuring real differences in interpretation and context.
Meanwhile, declining participation rates and rising costs threaten the sustainability of large-scale survey research. Traditional surveys often feel impersonal and burdensome, contributing to survey fatigue and reduced data quality. Many standard instruments were developed decades ago and fail to reflect the evolving complexity of modern life. Conventional employment categories, for example, struggle to capture the reality of gig work, multiple part-time jobs, or the blurred boundaries between work and caregiving that characterize many people's economic lives today. Income questions designed for traditional employment relationships struggle with irregular payments, shared resources, and multiple income streams. Housing stability measures may not adequately distinguish between individuals who are technically housed but living in precarious conditions and those in stable, long-term housing. These blunt tools risk missing or misrepresenting the very experiences we most need to understand.
Part of the solution lies in rethinking not only what is asked, but how the people conducting the research engage with the people providing the data. Surveys should treat respondents not as passive data points but as active partners whose perspectives shape the research process. When questions feel extractive, alienating, or disconnected from people’s priorities, respondents are less likely to feel that the research matters or that their contributions are valued. Improving the participant experience—making surveys feel relevant, respectful, and responsive—is both an ethical and a methodological imperative. When respondents feel understood and have space to share their experiences and perspectives, the resulting data is richer, more accurate, and more meaningful.
A paradigm shift in survey research would unlock tremendous potential. Instead of treating identical questions as the only path to reliable data, we should clearly define what we want to understand, then use the best available tools to capture that information accurately. This might mean adapting questions to the ways participants communicate—how they naturally structure thoughts, the level of detail they prefer, or the contexts in which they're most comfortable sharing—probing for clarification when responses are contradictory or incomplete, or using different modalities to make questions accessible and engaging.
The idea of focusing on meaning over procedural uniformity is not new. Many researchers have called for more adaptive, participant-centered approaches, and some have piloted promising methods in specific contexts. However, scaling these approaches while preserving rigor has historically been difficult. Recent advances in conversational AI are beginning to shift what is possible, enabling adaptive probing and participant-led clarification at a scale that was previously out of reach. The aim is not to replace human interviewers or in-depth qualitative work, but to bring some of the richness of qualitative methods into structured surveys in a way that is scalable.
Early research suggests that conversational approaches can improve data quality and increase engagement, but they also raise valid concerns about maintaining comparability, safeguarding privacy, preventing bias, ensuring transparency, and developing shared analytic standards. The field stands at a formative moment where norms, best practices, and guardrails have yet to be established. We approach this work with both curiosity and caution. The potential benefits are substantial: methods that more accurately capture people’s lived experiences, strengthen the partnership between researchers and participants, and expand the boundaries of what survey research can achieve. But realizing this potential will require collaborative experimentation, openness about successes and failures, and intentional development of standards that ensure these methods are both trustworthy and sustainable. The alternative—continuing with methods that drift further from the realities we’re trying to measure—risks steadily losing both the accuracy of data and the relevance of our research.
Learning from the Unconditional Cash Study #
High-quality survey data has been foundational to the OpenResearch unconditional cash study. We achieved remarkably high response rates—even three to four years after enrolling, 96% of participants completed the endline survey—and we were repeatedly struck by how well survey responses aligned with administrative data on income, spending, and employment. These parallels gave us confidence in the accuracy of our measures and the robustness of our results.
Yet even with high quality data, the limitations of traditional survey methods were unmistakable. Participants reached out to us saying fixed-response options did not accurately represent their situations. Others requested clarification, unsure if they were interpreting the questions correctly. Even our research team debated the meaning of well-validated survey questions and scales.
From the outset, we paired surveys with in-depth qualitative interviews involving 156 participants, each interviewed six times over the course of the program. These conversations revealed the “how” and the “why” behind survey responses: why someone prioritized caregiving over income, how financial stress influenced day-to-day decision-making, and how people understood their own progress and goals. Sometimes the narratives confirmed the patterns observed in survey data; other times they challenged our interpretations or revealed entirely new dimensions.
While qualitative interviews offer rich insights, they are time-consuming and expensive to conduct at scale. Open-ended survey questions add nuance but suffer from substantially higher nonresponse rates, wide variation in the length and depth of responses, and systematic differences in who responds and how thoroughly.1 This leaves a gap: how can we preserve the scalability of surveys while capturing some of the depth, responsiveness, and partnership that qualitative methods provide?
Contributing to the Evolution of AI-Powered Conversational Surveys #
Recognizing both the importance of survey data and the shortcomings of traditional methods, we began exploring whether artificial intelligence could help create a fundamentally new kind of survey that retains the structure and scalability of conventional instruments while incorporating some of the nuance and depth of qualitative inquiry.
We envision AI-powered, conversational surveys that adapt dynamically to respondents' answers, probe for clarification when responses are incomplete, unclear, or contradictory, and allow people to express themselves in their own words and at their own pace. Rather than forcing experiences into rigid categories, this method would let participants shape the conversation while maintaining the systematic approach that makes survey data valuable for research and policy.
We are not the first to test this idea. Since 2019, researchers have demonstrated that chatbot-based surveys can produce higher quality responses with reduced satisficing behavior compared to traditional web surveys.2 Recent work by Wuttke and colleagues shows that large language models can conduct adaptive interviews that potentially resolve the traditional depth-scale tradeoff in survey research.3 Commercial platforms like CloudResearch, SurveySparrow, and Conjointly report 40% higher completion rates and significantly improved data quality through conversational interfaces.
Although these early efforts are encouraging, most remain limited in scope, and other uses of AI largely focus on generating synthetic data or on post-survey data analysis rather than transforming the data collection experience itself. We aim to build on this early work to help advance the field.
Testing the Approach: Conversational Survey on Stress #
We chose stress as the first test case because it is both widespread and deeply personal, with causes and consequences that vary across people and contexts. In the unconditional cash study, we found that stress often emerged from intersecting factors like unstable housing, caregiving demands, health crises, unpredictable work schedules, and persistent financial insecurity.
Existing stress measures fail to capture specific sources, why those sources matter to particular individuals, or how stress affects daily life and decision-making. They don't measure perceived coping ability, actual coping mechanisms, or how experiences of stress change over time in response to different life circumstances. While these gaps could theoretically be addressed by expanding survey modules, doing so would lengthen the survey considerably and increase participant burden. That trade-off is especially problematic for a concept like stress, where details matter but relevance can vary greatly between respondents.
Implementation #
To test the feasibility of adaptive conversational surveys with a lower-income population, we embedded an AI-enabled chatbot stress module in our Q3 2025 follow-up survey for the unconditional cash study. Participation is voluntary; all respondents receive an explanation of how the AI module works and what to expect. Those who decline are routed to a conventional survey module on stress and also asked a set of questions about what led them to opt out.
For those who consent to participate in the pilot, the interaction occurs within the Qualtrics platform through a call to the OpenAI API. The architecture combines structure for analytic rigor with flexibility for relevance and engagement.
- Conversation Structure: A rotating three-prompt system maintains flow without drifting off-topic. Prior chat history is loaded to preserve continuity across turns, avoiding abrupt topic shifts.
- Model and Tuning: We selected GPT-4o with a high temperature (0.8) to maximize conversational flexibility while constraining responses to the stress domain.
- Participant Control: Commands like “SKIP” and “STOP” allow participants to bypass questions or exit to the standard survey instantly.
- Reliability Engineering: We implemented multiple fallback layers for network, API, and JSON parsing errors. Frequent saves ensure no data loss. Failures are logged and categorized for post-field troubleshooting.
- Data Privacy: All interactions are recorded as embedded data within the Qualtrics platform. No personally identifiable information or study IDs are shared with the model, and data is not retained by OpenAI or used to train models.
Learn more about the implementation here #
Following the chatbot interaction, participants complete a brief set of survey questions to share their reflections on the experience, including comparisons to previous survey experiences.
Early Feedback #
Data collection launched in early July and will continue through early October. We will not conduct formal analyses until the dataset is complete, but preliminary participation and engagement patterns offer early insight into how respondents are experiencing the conversational format.
Participation #
Participation rates have been high: as of August 15, we received 1,918 responses, with about 78% opting to try the conversational format. This high uptake is likely influenced by the trust built with our panel over five years and may not fully extend to first-time respondents or other settings. As chatbots become more ubiquitous and AI familiarity grows, comfort levels may increase naturally—among opt-outs, 55% reported never knowingly using an AI tool or chatbot. Conversely, heightened privacy and data security concerns could dampen uptake.
Opt-in rates are broadly consistent across demographic groups. In our 26–46 sample, we see no significant age differences, suggesting that older participants are equally willing to try the chatbot format. Participation is modestly higher among Hispanic and female respondents, and the education gap is small (under three percentage points between college graduates and those with a high school diploma or less)—noteworthy given ongoing polling challenges engaging lower-education respondents.4
Among the 22% who opted out:
- Why they opted out: A preference for traditional surveys was the top reason by a wide margin, with 44% indicating it was the main reason they opted out. AI distrust was a distant second at 15%.
- Future willingness to opt in: When asked if they are willing to try an AI-supported conversational survey in the future, 37% say they are “very willing” or “somewhat willing,” while 13% are “somewhat unwilling” and 27% are “very unwilling.” This suggests there is a subgroup with firmly held reservations that will likely require more targeted strategies, such as additional transparency, trust-building, or hybrid designs, to engage.
- What would increase likelihood of opting in: 42% of opt-outs said they would be more likely to participate if the chat offered multiple-choice responses. This aligns with the stated preference for traditional surveys and suggests that a blended conversational/structured design might increase uptake without sacrificing flexibility.
Completion and Engagement #
Among those who opted in, 88% completed the full module. Most non-completions were due to technical issues (network failures were the most common, with a much smaller share of API errors), rather than participant exits. Engagement levels vary but are generally encouraging. While roughly one quarter offered only brief responses, the majority engaged in sustained dialogue with the chatbot—often providing more detail than a standard open-ended survey field. Participants spent a median of 16 minutes in the conversation, completing a median of 24 back-and-forth turns.
We also examined the depth of responses more systematically, calculating the share of each participant’s answers that were minimal or superficial (e.g., two words or fewer, dismissive language, or repetitive non-answers). On average, about half of a participant’s responses fell into this category. The distribution shows notable variation: about 8% of participants were fully engaged throughout, with no responses tagged as minimal, while roughly 11% gave only minimal responses to every question. The rest were spread in a roughly normal distribution around the 50% average.
Participant Experience and Satisfaction #
So far overall experience is generally positive, with most indicating they would be open to trying the format again. 67% rated their overall experience as “excellent” or “good,” while only 8% described it as “not good” or “terrible.” When asked if they would choose this type of survey again, 68% said they would (vs. 15% who marked “probably not” or "definitely not”), and the majority (58%) felt it was a better experience than the traditional survey format.
Participants widely agreed that the conversational module enhanced the depth and quality of what they could share. Large majorities agreed or strongly agreed that the chat format:
- Let them share more specific details about their stress (82%)
- Gave researchers a better understanding of their stress and wellbeing (77%)
- Felt comfortable enough during the conversation to honestly describe their experiences with stress (76%)
At the same time, about 21% either strongly agreed or somewhat agreed that interacting with the AI model felt creepy. In follow-up open-ended responses, most of these participants cited some version of one of two reasons: discomfort with being probed on a deeply personal topic like stress, or the uncanny feeling of having what felt like a human-like conversation while knowing it was actually a computer.
Subgroup patterns are modest but consistent: women, older participants, those with lower incomes, and those with less formal education on average rated the experience slightly more favorably. Ratings by race generally tracked with opt-in rates. Across subgroups, willingness to choose the format again closely mirrors overall satisfaction, suggesting that positive experiences translate directly into willingness to participate in the future.
Designing for a sensitive topic like stress presented unique challenges. Early testing revealed that when difficult experiences shared weren’t met with acknowledgement, the interaction felt cold or alienating. To address this we programmed explicit empathy instructions, balancing validation of participant experiences with research focus. However, participant feedback revealed highly diverse preferences for empathetic responses. Some participants appreciated this feature while others specifically disliked it. This divergence extended throughout the feedback: nearly every element criticized by some was praised by others, including the balance between warmth and formality, the depth of probing, and the human-like qualities of the interaction.
These early results demonstrate that the conversational module is generally well-received and broadly accessible across diverse demographic groups, but there is no universal “best” design. Features that make the experience better for some can make it worse for others. Future implementations should offer greater customization so that the conversational style can adapt to individual preferences. Building adaptability not only into the questions we ask but also into the conversational style itself may be key to maximizing the potential of conversational data collection tools across varied participant populations.
What We’re Learning #
Looking beyond satisfaction scores and engagement metrics to the substance of what participants shared, early findings are similarly encouraging. The adaptive, conversational approach tailors follow-up questions in real time based on participants' responses, surfacing deeper narratives. This is not simply about eliciting more words—it’s about revealing how stressors connect, why they matter in the participant’s life, and clarifying ambiguous or incomplete statements.
Three patterns have emerged consistently in early review of the data.
Expanding on Brief Responses #
Even when participants begin with a one-word or very short answer, adaptive follow-ups can draw out complex, multi-dimensional accounts of their experiences. In the example below, the participant began with a one-word response, but follow-ups helped uncover a nuanced, overlapping set of stressors.
- Survey: "Let's start: What are the top three sources of stress in your life lately?"
- Participant: “Bills”
- Survey: "Thank you for sharing that bills have been stressful. Are there any other things that have been feeling stressful for you recently?"
- Participant: "no job Kids"
- Survey: "It sounds like bills, not having a job, and kids are all weighing on you right now. Can you please rank these from most to least stressful?"
- Participant: "No job bills kids"
- Survey: “Thank you for ranking them. Let’s start with “no job,” since you listed that first. Can you tell me a little more about what’s been going on with not having a job?"
- Participant: "I lost my job because my car broke down and no way there. Now my bills are piling up and my rent went up and I have no help with my kids"
In a traditional survey, a one-word answer like “bills” might have been the end of the response. Here, adaptive follow-up surfaced interrelated dimensions of stress tied to job loss, transportation barriers, basic needs insecurity, and caregiving stress.
Understanding Context #
Allowing the conversation to follow the respondent’s cues (rather than rigidly adhering to a fixed order) can reveal how different stressors interact and which ones feel most urgent in the moment. One participant initially listed “finances,” “family,” and “time management” as top sources of stress and was trying to communicate that these stressors were all interrelated. Instead of rigidly adhering to a fixed order, the adaptive survey followed the participant’s conversational cues, allowing them to clarify, elaborate, and provide context.
- Survey: "Thanks for sharing those. Can you please rank these stressors—finances, family, and time management—from most to least stressful?"
- Participant: "Finance”
- Survey: "Thank you for ranking that. Let’s start with finances since you ranked it as most stressful. Can you tell me a little more about what's been going on with your finances?"
- Participant: "Time mgmt"
- Survey: "Got it—if time management feels more pressing to talk about, we can begin there instead. Can you tell me a little more about what's been stressful with managing your time?"
- Participant: "Family"
- Survey: "Understood—if family feels like the bigger stressor right now, we can focus on that first. Can you share a bit more about what's been going on with family?"
- Participant: "Sorry ranking is: finance, time mgmt, family. No changes with our financial situation just bad at budgeting and impulses. Time mgmt is a stressor due to my own person. Wanting to get xy done but at the same time not wanting to touch xyz. I'd much rather read a book than do any household chores or prep work. Family is a stressor just due to having 4 kiddos and the challenges that come with sibling arguments, sports, and summer."
The conversation revealed a deeper context that financial strain was linked to impulse control, time stress was shaped by internal conflict and procrastination, and family pressure stemmed from the logistics of parenting multiple children. This richer understanding helps explain not just what is stressful, but why it matters in the participant’s life.
Clarifying Ambiguous Language #
Many participants used vague, shorthand, or socially acceptable phrases to describe stress (“tired,” “it doesn’t affect me”), which can obscure its true impact. Words like “anxiety” may feel unfamiliar or irrelevant, leading some to skip questions or downplay their stress. Adaptive probing starts with participants’ own words and gently probes to uncover what they really mean.
- Survey: "Now I’d like to understand how those stressors and responses have affected different areas of your life—how do you think stress has affected your daily life overall?"
- Participant: "it doesn’t affect my daily life because I still have to continue to work. I still have to take care of my daughter and I still have to continue my daily duties."
- Survey: “That sounds like a lot to manage, staying consistent with work, taking care of your daughter, and your daily duties despite the stress. Thinking about this week, was there a moment when stress made any of those responsibilities feel harder?”
- Participant: "no, not really well actually yes I was worried that my light bill would get cut off and I couldn’t really focus at work"
- Survey: “That sounds really challenging, worrying about your light bill while trying to focus at work. Can you share more about how that worry affected your work performance or tasks that day?”
- Participant: "I can’t focus on documenting, which is something that’s really important at work"
Many people, especially those juggling multiple responsibilities, come to normalize stress as something they just have to push through. Saying “it doesn’t affect my daily life” may reflect that survival mindset, rather than an absence of stress. Without adaptive follow-up, the emotional and cognitive toll of financial pressure would likely have gone unnoticed.
The examples show how adaptive probing can transform minimal or vague responses into richer narratives that capture nuance, context, and interconnections. While our analysis is still in its early stages, these insights illustrate some of the limitations of traditional surveys and the potential of conversational, adaptive approaches to address them.
Reimagining Social Science Survey Research #
While stress is our first pilot topic, the underlying goal is to test a research model that adapts to how people interpret and communicate about their lives. In this model, the gold standard is not rigid uniformity, but precision in capturing meaning. We begin by defining what we want to measure, then select the most effective combination of tools—conversational AI, traditional fixed-response questions, adaptive follow-ups—to elicit accurate, nuanced, and complete information.
In Q4 we plan to launch a second pilot focused on employment, a topic that presents different challenges and opportunities. Unlike the purely conversational stress module, the employment module will integrate fixed-response options with adaptive dialogue. This hybrid approach recognizes that not every question benefits from conversation and not every participant prefers open-ended dialogue. Some information is efficiently captured through structured options; other details emerge only through exploration and follow-up. The key is learning when and how to use each approach and giving participants the agency to share their experiences in their own way.
Addressing the Challenges #
Delivering on the promise of adaptive, AI-enabled surveys requires grappling with a set of methodological, ethical, and practical questions that the field is only beginning to address. Key questions include:
- Reproducibility: How do we ensure findings can be replicated when conversations are dynamic and follow-up questions vary by participant?
- Bias: How do we prevent AI systems from introducing or amplifying existing biases in who is understood, how responses are interpreted, or which narratives are prioritized?
- Comparability: How do we compare results across hybrid approaches and different conversational paths while still honoring the value of personalized questioning?
- Analysis: How do we combine conversational and structured responses in a way that preserves complexity but yields interpretable, comparable measures? How do we validate new measures developed from conversational data?
- Privacy and Ethics: How do we safeguard participant data, ensure truly informed consent, and be transparent about how AI processes and stores information — including potential risks?
- Model Consistency: How do we account for changes in AI models’ capabilities, behaviors, and training data over time so that findings remain valid?
- Accountability and Transparency: Who is responsible for errors, misinterpretations, or harm caused by AI-mediated interactions, and how do we make those processes visible to participants and the research community?
- Equity and Access: How do we ensure adaptive surveys work equally well for participants with different languages, literacy levels, technological access, and comfort with AI?
Some of these questions will be answered through experimentation and iterative design. Others will require shared norms—reporting standards, validation frameworks, and ethical guidelines—that only a coordinated community effort can create. No single team, discipline, or industry can address these challenges alone.
We aim to work in partnership with other researchers, technologists, and participants themselves to help chart the path forward. That means learning from others testing similar approaches, sharing our own successes and failures openly, and co-developing standards and best practices that can keep pace with both rapid technological change and the evolving realities of people’s lives. We are also open to more direct collaboration with others interested in pursuing similar work, whether through joint projects or coordinated efforts to develop common tools and frameworks.
The future of survey research lies not in choosing between traditional and emerging approaches, but in weaving them together with intention—starting from the questions we most need to answer and building methods that align with how people naturally share their experiences. If successful, we will generate data that are both methodologically rigorous and reflective of the complexity of human lives, strengthening the partnership between researchers and the participants whose experiences give data meaning.