Can AI Help Surveys Keep Up with How People Work Today?

Lessons from a Second Conversational Survey Pilot

Last year, we published Can Conversational AI Improve Survey Research?, describing our first test of an adaptive, AI-powered conversational survey module: a chatbot embedded in a follow-up survey of participants in the OpenResearch Unconditional Cash Study on the topic of stress. About 79% of respondents opted in, engagement was sustained, and adaptive probing often drew out complex, layered accounts of how stressors connect and compound in people’s lives. Participants who began with a one-word answer frequently ended up describing interrelated dimensions of financial strain, caregiving, health, and employment.

The stress pilot showed us that many participants were willing to engage with this format and that adaptive probing could surface depth that structured questions often miss. At the same time, it raised practical questions about when conversational exploration adds value and when structured responses are more effective. Some types of information are captured more efficiently and comparably through predefined options, and several participants indicated they would be more likely to engage if the experience included multiple-choice responses.

That insight shaped our second pilot focused on employment. Unlike the stress module, which was fully conversational, the employment pilot used a hybrid design. A conversational opening allowing respondents to describe their situation in their own terms was followed by structured multiple-choice questions on dimensions where predefined response options work well, with targeted AI follow-up when participants selected “other” or gave answers that did not fit the available categories.

This piece describes why we chose employment as the test case, how the hybrid design differed from the stress version, what kinds of information the conversations surfaced, how we are approaching analysis, and what new methodological questions emerged.

Why Employment #

Across earlier waves of the Unconditional Cash Study, participants frequently struggled with standard employment classifications. Respondents who combined contract work, gig platforms, occasional informal work, and caregiving responsibilities often could not identify a category that accurately described their situation. Many selected “other,” even when their descriptions suggested that existing categories might apply. We also observed inconsistencies and contradictions within the same survey. For example, some participants reported full-time employment at a single job but said they typically work less than 20 hours per week, while others identified as employees rather than contractors despite driving for rideshare platforms that classify their drivers as independent contractors. In other cases, participants did not recognize certain income-generating activities as jobs at all.

These are not challenges unique to our study. Many labor surveys were designed around an economy where most people held a single, stable primary job. While the Current Population Survey includes questions about multiple jobs and the Bureau of Labor Statistics added the Contingent Worker Supplement in May 2017 to identify platform-based work that might not appear in the standard job structure, these additions only go so far in capturing the complexity of people’s work arrangements. In practice, employment classification is often interpretive rather than purely categorical: respondents must translate their lived experience into categories that may not align with how they think about their own work. Employment, therefore, offered a clear opportunity to test whether conversational methods could improve measurement.

The Hybrid Design #

For this pilot, we designed a hybrid survey that combined conversational and structured elements within a single chat-based experience. Participants were routed into one of five paths based on their response to an employment status question. A single-job holder was first asked to describe the work they currently do. A gig worker was asked about the types of gig or occasional work they had done in the past three months. A job seeker was asked about the kinds of work they were currently looking for. Someone not seeking employment was asked how they typically spend their time. A participant with multiple jobs was asked to begin by describing the job they considered most important. Each path used a tailored opening prompt followed by adaptive follow-ups designed to build on the respondent’s answers and situation.

After the conversation, participants received predetermined multiple-choice questions specific to their path. These were designed to complement the conversational data with structured, comparable measures on topics where fixed categories are efficient and effective.

When participants selected "other" or gave responses outside the predefined options, the system allowed a fixed number of LLM-based follow-up questions to help them identify which existing category fit, or to capture why none did. This is a small-scale version of something we want to explore further: AI-assisted multiple choice, where the model draws from a larger bank of questions to provide more adaptive structured questioning at scale. What we tested in the employment pilot was deliberately constrained, but it let us observe whether participants would engage with this kind of interaction and whether it produced useful data.

Implementation #

Conversation Structure: Five path-specific conversational flows, routed based on employment status from a multiple choice survey question (single job, multiple jobs, gig work, not employed and seeking employment, not seeking employment). Each path had a tailored opening prompt and adaptive follow-ups calibrated to the respondent's situation.
Hybrid Architecture: The conversational opening was followed by predetermined multiple-choice questions specific to each path, embedded within the same chat window. When participants selected "other" or gave responses outside predefined options, the system allowed a fixed number of LLM-based follow-up questions.
Model and Tuning: We selected GPT-4o with a high temperature (0.8) to allow for conversational flexibility while constraining responses to the employment domain topics.
Participant Control: Commands like “SKIP” and “STOP” allow participants to bypass questions or exit to the standard survey instantly.
Reliability Engineering: We implemented multiple fallback layers for network, API, and JSON parsing errors. Frequent saves ensure no data loss. Failures are logged and categorized for post-field troubleshooting.
Data Privacy: All interactions are recorded as embedded data within the Qualtrics platform. No personally identifiable information or study IDs are shared with the model, and data is not retained by OpenAI or used to train models.

Following the chatbot interaction, participants completed a brief set of survey questions reflecting on the experience, including comparisons with previous survey experiences.

Participant Experience and Preferences #

Of 2,120 respondents to the employment survey, 1,396 (65.8%) opted in to the conversational format and 724 (34.2%) opted out.¹ This opt-in rate is lower than in the previous pilot, where approximately 79% chose to try the conversational module on stress. Several factors plausibly contributed to the difference. Employment may simply be a less engaging conversation topic than stress. Some participants who tried the format previously may also have made a more informed decision to opt out this time. Beyond the survey itself, the broader context around AI shifted between the two surveys: public discourse about AI grew more polarized, and some participants may have opted out based on privacy concerns or evolving perceptions of the technology rather than anything about the survey experience.

Despite the lower overall opt-in rate, interest in the conversational format remains strong among those who tried it. When asked whether they would want to use the format for other survey topics, 50.3% said yes, 40.9% said maybe, and only 8.8% said no. Once participants experience the format, most see potential in it. To understand where participants see that potential, we asked respondents to select all topics where they would want to use a conversational format. Although selecting all topics was the most common response, interest varied meaningfully across domains. As the chart shows, some topics appear to invite conversational engagement more readily than others, suggesting that the value of a conversational approach may depend on the nature of the topic itself.

Beyond participation rates, the feedback data offers insight into what people want from the format. Among opted-in respondents, preferences spanned the full spectrum: about 46% preferred an experience that is somewhat or mostly like texting with a person, while roughly 24% wanted something somewhat or mostly like filling out a form. The largest single group, at 30%, preferred an experience roughly halfway between the two. There is no one-size-fits-all design.

(Please click on the arrows to view additional graphs)

(Please click on the arrows to view additional graphs)

Design Implications #

This variation in what people want from the format has direct design implications, since features that work well for some participants can alienate others. Emotional validation is a clear example. In the stress pilot, explicitly allowing the model to include brief acknowledgment and validation in its responses improved the experience for many, but about 21% found it at least somewhat "creepy." We adjusted the prompt for employment partly in response to feedback and partly because employment conversations carry a different emotional register than stress. Even so, some participants used the format to share difficult experiences like job loss, financial strain, or discrimination. What feels supportive to one person can feel intrusive or performative to another, and designing for this range within a single system remains an open question that warrants further testing.

A related challenge involves the safeguards governing the interaction. Participants describing difficult experiences may use blunt or emotionally charged language that automated safeguards are not always designed to interpret in context. If these safeguards are too restrictive, they can prevent participants from describing their experiences in their own terms—and potentially from sharing the experiences at all. But without sufficient constraints, conversations can drift off track or into territory that is unproductive for research purposes. The prompts and safeguards that shape how the model conducts the interaction are central to the methodology, yet we still lack a systematic framework for evaluating whether they are well calibrated. This is an area that would benefit from greater collaboration and more systematic testing across topics and populations.

A different kind of design challenge emerged from the structured portion of the instrument. Routing errors affected a subset of respondents across the employment paths, placing single-job workers in the multiple-jobs flow or non-seekers in the job-seeker path. The routing was based on participants' responses to a standard multiple-choice employment question earlier in the survey, meaning the misclassification originated in the structured portion of the instrument itself. Standard surveys that rely on early classification to branch respondents face the same vulnerability: an incorrect response at the outset can send them down the wrong path for the rest of the instrument. This is precisely the kind of measurement problem conversational approaches are designed to address. However, data from our pilot made these issues visible, helping us detect incorrect answers and identify respondents who were simply clicking through without engaging.

Some participants also raised a concern not about the experience itself, but about what happens afterward. How would their responses be analyzed? How would researchers move from a conversation to findings? As conversational surveys ask more of participants, participants may reasonably ask more of researchers in return. Greater transparency about the analytic process, how conversations are coded, what frameworks are applied, and how human review is involved may be essential for maintaining the trust that makes this kind of data collection possible. In future tests, we plan to provide such transparency while preserving the opt-in design that gives participants an explicit choice between conversational and traditional formats, with command-based controls that allow them to skip questions or exit the interaction at any point. These are not incidental features. Participant control and analytic transparency are foundational to the trust this approach requires.

Early Observations from the Employment Data #

Initial review of the employment conversations has surfaced several recurring patterns.

One of the clearest is the gap between nominal and effective earnings. Across the gig work and multiple-jobs paths, participants frequently described unpaid time, vehicle costs, platform fees, and other expenses required to perform their work. Several delivery workers identified unpaid waiting time as a primary challenge. Others described shifts where the cost of fuel exceeded their earnings. Traditional labor surveys typically ask about wages or total earnings. They rarely ask about the costs required to earn those wages. For participants working through platforms or irregular contracts, those costs can be substantial, and the distinction between gross and effective earnings is central to understanding the economic realities of gig work.

A second pattern emerges from the job-seeker path. In many conversations, participants described receiving job offers that they ultimately could not accept. One respondent reported receiving an offer but being unable to take the position because their car broke down before the start date. Another declined an offer because the required schedule conflicted with childcare responsibilities. Others described offers where the pay was insufficient to cover transportation or existing financial obligations. Standard labor surveys do not typically track job offers directly, instead classifying individuals based on employment status and job search activity. These conversations reveal an intermediate category: offers that were technically received but practically impossible to accept.

When offers are out of reach #

Transportation, childcare, health, and background constraints often affected interview attendance, offer acceptance, schedule fit, or the ability to work regularly.

Survey: "In the past month, about how many job applications have you submitted?"
Participant: "More than 1,000 applications."
Survey: “That’s an incredible number of applications. Have you had any interviews in the past month?"
Participant: "No."
Survey: "Thank you for letting me know. Have you received any job offers during your search?"
Participant: "Once, for a job at the airport."
Survey: "Thanks for sharing that. Did you end up turning down that offer, and if so, would you mind sharing why?"
Participant: "My child's schedule cannot work with a rotating schedule and I have no one to pick him up in the morning to drop him off or for afternoon pick up and caring for him in the afternoon or evening."

Barriers bind at specific points #

This participant spends 9 hours per day searching for work, yet lack of reliable transportation remains a persistent barrier to employment:

Survey: "Have you had any interviews in the past month? If so, roughly how many?"
Participant: “Yes my car broke down.”
Survey: "I'm sorry to hear about your car—it sounds like that added difficulty for you. Were you able to attend any interviews despite that? If so, about how many?"
Participant: “Yes a couple.”
Survey: "I see. Have you received any job offers recently?"
Participant: "Yes but did not have transportation."

This participant describes a benefits cliff that prevents them from accepting full-time work unless immediate health coverage is available:

Survey: "What do you think has been the biggest challenge or barrier in your job search so far?"
Participant: "I would lose my health insurance if I took anything full-time, but most jobs that I found don’t offer health insurance for 90 days if at all."
Survey: "That sounds like a really difficult situation to navigate. What would your ideal work situation look like?"
Participant: "Something full-time, that paid a livable wage with health insurance and retirement."

A third pattern, related but broader, is that barriers operate at specific points in the employment process, and that specificity matters. Transportation did not simply constrain job search in the abstract; it restricted viable job geographies, prevented interview attendance, and blocked acceptance of offers already in hand. Childcare operated similarly, shaping not only whether participants could work, but whether they could accept particular schedules or remain in jobs once hired. Health limitations also surfaced in this way, constraining the types of roles participants could accept or their ability to maintain consistent schedules. In other cases, the underlying constraint was financial: wages that were nominally sufficient did not cover the combined costs of transportation, childcare, and existing obligations. Without capturing where and how these constraints bind, measures of employment barriers risk overlooking the conditions that actually shape outcomes.

A fourth pattern involves employment classification itself. Across multiple paths, participants described work arrangements that do not fit cleanly into standard employment categories. One participant was paid through a state program to care for a disabled family member but described the activity simply as "taking care of family." Another occasionally charged relatives for hair styling "if they have it or not." A third reported being trained for two under-the-table jobs but never receiving payment. Another runs a nonprofit full time without pay and would likely be classified as "not working" by standard definitions despite performing full-time labor. These are not idiosyncratic edge cases. They reflect the complexity of work arrangements that standard instruments often compress into broad categories like "self-employed" or "not in the labor force."

Work that feels that hard to classify #

Survey: "Are your clients mostly individuals, or do you work through other arrangements?"
Participant: "I only do my family members hair, locs to be specific. I do not cut or color only style and resist. "
Survey: "Got it. So you focus on styling and retwisting locs for your family members. Is this something you charge them for, or is it more an unpaid arrangement?"
Participant: "Most of the time they pay me but sometimes I do free of charge."
Survey: "Understood. How do you typically decide when to charge for your work versus when to do it for free?"
Participant: "If they have it or not."

Survey: "You mentioned you sometimes do work for pay. In the past 3 months, what kinds of occasional work have you done?"
Participant: "I trained it two jobs that I never got paid for that were under the table."
Survey: "Thank you for sharing that. To clarify, were these training opportunities meant to lead to paid work, or are they examples of unpaid time spent working?"
Participant: "They were supposed to be paid jobs. They didn’t pay me."
Survey: "I'm sorry to hear that you weren't paid for those jobs. That’s a very difficult situation. Were these the only types of jobs you’ve done in the past three months, or have you had other gigs or occasional work during this time?"
Participant: "I’ve done cleaning jobs. There’s not a lot of side work because the economy sucks and people don’t have money."

The cost of flexibility #

This participant left stable employment to provide elder care and sought flexible work that could accommodate medical appointments:

Survey: "What do you like least or find most challenging about the food delivery work?"
Participant: "The fact that with current inflation people are not tipping as much so earnings have reduced significantly. I also don’t like the wait times at restaurants."
Survey: "What makes these challenges—lower tips and long wait times—particularly difficult for you?"
Participant: "Because of the longer wait times at restaurants it reduces my earnings per hour as I am able to deliver less orders. And when people tip less it also reduces my earnings. This is stressful as my income has reduced significantly and I am barely able to afford paying bills or food."
Survey: "That does sound stressful, and I can understand how reduced earnings would make it difficult to manage expenses. How would you describe the pacing and stress level of your work overall? For instance, does it feel manageable, overwhelming, or something in between?"
Participant: "There are days when it is more stressful than others. There are weeks where I feel very overwhelmed. Most of the time it’s manageable but when I can’t afford food or bills I feel an overwhelming sense of dread."
Survey: "It sounds like the financial strain adds significant pressure. How secure do you feel this work is in terms of being able to rely on it for income?"
Participant: "Not very secure. Sometimes I’d like to apply for a regular job with more secure earnings but as a caregiver I can’t do that and I feel stuck."
Survey: "I understand how that could leave you feeling stuck. Are there any trade-offs you feel you're making by doing this work as a food delivery driver?"
Participant: "I could be earning more and securing my own future for when I’m older but nothing matches the flexibility of this job."
Survey: "Would you say you're getting about the hours you want from this work, or would you prefer to work more hours, fewer hours, or is it hard to say because your hours vary."
Participant: "It’s hard to say because some days are busier than others and those days I work longer hours."
Survey: "How is your pay calculated as a food delivery driver? For example, do you earn per delivery, receive tips, or have other components to your earnings?"
Participant: "$2 per delivery plus tips"
Survey: "On average, about how much do you earn in a typical month from this work before taxes."
Participant: "My earnings used to be $2,000 a month give or take but this year I struggle to make $800 a month."

Finally, conversational probing consistently surfaced income streams that respondents did not initially frame as jobs. Activities such as brand ambassador work, content creation, occasional cleaning, market vending, and informal caregiving often appeared only after follow-up questions. In several cases, respondents initially reported a single job but later described additional sources of income when prompted. Standard survey routing typically begins with a question such as "How many jobs do you have?" and does not recover income that respondents do not think to report without probing.

These observations are preliminary. But each points to a specific kind of information—the cost of earning, the gap between receiving an offer and being able to accept it, work that resists standard classification, income that goes unreported without prompting—that has direct policy implications.

How We’re Approaching Analysis #

We are developing analytic approaches designed to extract structured, comparable data from the conversational transcripts while preserving the nuance that makes this format valuable. Our approach combines LLM-assisted coding with human review. An LLM classifier processes each conversation transcript, assigns codes from a predefined schema, and extracts verbatim evidence from the participant's own language for each classification. Human reviewers then examine those assignments, confirming them, correcting them, flagging false positives, or adding codes the model missed. The process anchors classification to a documented schema and uses human review to ensure that each code is supported by what the participant actually said.

A central challenge in this process is distinguishing meaningful patterns from artifacts of response style. Participants who write briefly may yield fewer coded items simply because less textual evidence is available, while participants who provide longer responses may generate more codes even when their underlying experiences are not more complex. The review process tracks the types and reasons for corrections across reviewed transcripts, allowing us to identify recurring patterns in how the model interprets certain kinds of language and refine the classifier accordingly.

Additionally, for the employment survey, we are comparing what the conversational and structured components captured for the same respondents. Because the hybrid design collected both conversational and multiple-choice data on related topics, we can examine where they converge and where they diverge.

We will share our coding framework and findings in the future, alongside a fuller discussion of the analytic methodology and its limitations.

Challenges #

These pilots are exploratory, designed to help us understand what conversational and hybrid survey approaches can and cannot do. In our previous piece, we outlined a set of open questions raised by this work: reproducibility, bias, comparability, model consistency, privacy, accountability, and equity. Those questions remain open, and answering them will require shared experimentation and more open exchange across teams working with these methods. Our pilots have made one question increasingly salient: who engages with this format, how deeply they do so, and how that shapes whose experiences are represented in the data.

Technical and access barriers #

Not everyone who encounters this format can participate equally. Conversational survey formats require sustained connectivity, a device capable of supporting an interactive chat experience, and a basic level of comfort with the technology itself. These resources are not evenly distributed. Lower-income participants, those in rural areas, and those using older devices may face structural disadvantages that have nothing to do with their willingness to engage. In our surveys, this appeared in part through technical failures, which were more common in the employment pilot than in the stress pilot and prevented some participants from completing the interaction. In the employment pilot, about 29% of opt-in participants encountered an error; approximately 57% of recorded errors were attributed to network failures, affecting 16% of all opt-in participants. While some errors likely reflect connectivity constraints, others may relate to browser or device differences, and two were identified as OpenAI server errors. Overall, the pattern of errors points to technical friction rather than a single source of failure.

Engagement within the conversation #

A second challenge is that engagement varies widely across participants. Participating in an extended written exchange requires time, attention, and a willingness to share and elaborate in this medium. Some participants provide detailed responses, while others respond briefly or disengage before the interaction is complete. The resulting dataset is not simply a record of people’s experiences. It is also a record of who was willing and able to elaborate in this format.

Tailoring the interaction to different communication styles is one of the core promises of conversational methods, but we are not there yet. Model behavior varies in ways that affect data quality. Probing differs in depth and relevance across participants: some received sustained, well-targeted follow-up, while others with similarly complex situations received minimal probing.

Another promise of conversational methods is the opportunity to probe inconsistencies and clarify responses that appear implausible. In our experience, however, this did not occur reliably. We observed several income responses that were almost certainly on the wrong time scale, yet the model generally accepted them without seeking clarification.

Several design adjustments could help address this. A lightweight approach would extend the prompt or surrounding logic to flag possible inconsistencies, including mismatches in assumed units, time periods, or expected ranges. When such a response is detected, the system could generate a concise follow-up that presents the prior context and current answer side by side and invites the participant to confirm this accurately reflects their situation. For repeated surveys with the same participants, prior responses could be loaded to support more participant-specific plausibility checks, such as identifying unusually large changes from earlier surveys or inferring the respondent’s typical units of response. Such checks could help identify and clarify likely errors, although we have not yet tested these approaches.

A more structural approach would introduce a supervisory layer that reviews responses as the conversation unfolds and prompts the interviewing model to resolve inconsistencies, fill gaps, or probe areas that warrant more depth before moving on. Separating the task of asking questions from the task of monitoring substantive coverage may yield more consistent results than current single-model setups, like ours, which appeared to handle both unevenly. The tradeoff, however, is added system complexity. Depending on implementation, supervisory checks could increase latency or introduce additional failure points.

What’s Next #

We are completing our analysis of the employment data and will share full results in a future publication. Analysis of the stress data, using the coding framework we developed for classifying stress sources and impacts from conversational narratives, is also underway and will be shared separately.

Methodologically, the employment experience has pointed us toward several concrete directions. One is expanding AI-assisted multiple choice — designing interactions where the model draws from a larger bank of questions to provide more adaptive structured questioning at scale, rather than the deliberately constrained version tested in the employment survey. A second is moving toward narrower conversational tasks with clearer analytic targets: instead of extended open-ended exchanges, shorter segments focused on a single classification or measurement goal. A third is developing better tools for evaluating probing quality, understanding when and why follow-up succeeds or fails across different communication styles and participant situations.

Across two pilots and nearly four thousand survey interactions, we are seeing that conversational, adaptive approaches can surface information that traditional instruments often miss: the structural reasons behind non-standard work arrangements, the cascading relationships between stressors, the gap between a job offer received and one that can actually be accepted, and income streams people do not initially think to report. These methods have the potential to improve categorization accuracy and generate richer data at scale, but realizing that potential requires careful design regarding who engages and how, along with analytical approaches capable of handling the volume and variability of responses without sacrificing consistency or comparability. Without this, conversational systems risk amplifying the voices they reach most easily while muting others. Differences in technological access, communication style, and willingness to engage with AI-mediated research all shape whose experiences are represented in the data. These are not footnotes to the methodology. They are design challenges that must be addressed for this approach to deliver on its promise.

We remain committed to sharing our findings openly, reporting both what worked well and what we would refine in future iterations. The questions raised by this work are best addressed collaboratively, and we are eager to learn from others testing similar approaches in different contexts and with different populations. Ultimately, the goal is not to replace traditional surveys, but to extend them—developing tools that are rigorous enough to rely on, flexible enough to capture how people actually live and work, and transparent enough to earn the trust of those who share their experiences.

In this report