Tracking AI Citation Performance

Tracking AI Citation Performance refers to the systematic monitoring, measurement, and analysis of how artificial intelligence systems attribute, reference, and utilize source materials when generating responses or content 12. This emerging discipline addresses the critical need for transparency and accountability in AI-generated outputs, particularly as large language models and generative AI systems become increasingly integrated into research, content creation, and information retrieval workflows 3. The primary purpose is to establish reliable metrics for evaluating whether AI systems properly acknowledge sources, maintain attribution accuracy, and provide verifiable references that users can trace back to original materials 14. This matters profoundly because citation integrity directly impacts the trustworthiness of AI systems, influences their adoption in academic and professional contexts, and determines whether these technologies can meet scholarly standards for attribution and intellectual property recognition 25.

Overview

The emergence of tracking AI citation performance as a distinct field stems from the rapid proliferation of large language models (LLMs) and their deployment in knowledge-intensive applications beginning in the early 2020s 13. As organizations integrated AI systems into research workflows, content generation pipelines, and decision-support tools, a fundamental challenge became apparent: these systems frequently generated plausible-sounding content without reliable attribution to source materials, and in some cases, fabricated entirely fictitious citations that appeared legitimate but referenced non-existent sources 27.

The fundamental problem this practice addresses is the tension between the probabilistic nature of neural language generation and the deterministic requirements of scholarly attribution 110. Traditional citation practices evolved over centuries within human scholarly communities, governed by established conventions and ethical norms. AI systems, however, generate text through statistical patterns learned from training data, creating outputs that may synthesize information from multiple sources in ways that defy straightforward attribution 311. This challenge intensified as retrieval-augmented generation (RAG) systems emerged, combining language models with external knowledge retrieval to ground responses in verifiable sources 59.

The practice has evolved significantly from early ad-hoc evaluations to sophisticated frameworks incorporating automated verification, human evaluation protocols, and continuous monitoring systems 17. Initial approaches focused primarily on citation accuracy—whether cited sources existed and contained attributed information. Contemporary frameworks now encompass multidimensional assessment including citation relevance, source diversity, temporal consistency, and the appropriateness of attribution density for different content types and user contexts 210.

Key Concepts

Citation Recall and Precision

Citation recall measures the proportion of relevant sources that an AI system actually cites when generating content, while citation precision assesses the accuracy of the sources that are attributed 12. These metrics, borrowed from information retrieval, provide fundamental quantitative indicators of citation performance. High recall ensures comprehensive attribution, while high precision prevents misleading or fabricated references.

Example: A medical AI assistant generating a summary about diabetes treatment might consult 15 relevant clinical studies in its retrieval process. If the system cites 12 of these studies in its response, the citation recall is 80%. If 11 of those 12 citations accurately represent information from the cited sources while one citation misattributes findings, the citation precision is approximately 92%. An organization tracking this system's performance would monitor these metrics across thousands of queries to identify patterns—for instance, discovering that citation precision drops to 75% for queries about emerging treatments where source materials are more ambiguous.

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation combines neural language models with external knowledge retrieval systems, enabling AI to ground responses in specific, verifiable sources rather than relying solely on patterns learned during training 59. RAG architectures retrieve relevant documents or passages in response to queries, then condition the language model's generation on this retrieved context, creating natural opportunities for citation.

Example: A legal research AI using RAG receives a query about precedents for employment discrimination cases. The retrieval component searches a database of case law, identifying and ranking relevant decisions such as Griggs v. Duke Power Co. and McDonnell Douglas Corp. v. Green. The system retrieves specific passages from these cases, feeds them as context to the language model, and generates a response that synthesizes the legal principles while citing the specific cases and even pinpointing relevant page numbers or paragraph references. Tracking systems monitor whether the retrieval component consistently surfaces the most relevant precedents and whether the generation component accurately attributes legal principles to the correct cases.

Attribution Fidelity

Attribution fidelity measures the strength and accuracy of the connection between generated content and cited sources, going beyond simple verification to assess whether citations appropriately support the claims they're meant to substantiate 110. This concept addresses subtle misattribution where a source is cited correctly but doesn't actually support the specific claim being made.

Example: An AI system generating content about climate change might cite a 2023 IPCC report when stating "global temperatures have risen 1.1°C since pre-industrial times." While the IPCC report exists and discusses temperature increases, tracking systems verify not just that the citation is real, but that the specific figure and timeframe match what the report actually states. If the system cited the report for a claim about sea level rise when that particular report focused on atmospheric temperatures, attribution fidelity would be low despite the citation being technically valid. Organizations tracking this metric might discover that their system maintains 95% attribution fidelity for direct factual claims but only 70% fidelity for interpretive or synthesized statements.

Hallucinated Citations

Hallucinated citations occur when AI systems generate plausible-looking references to sources that don't exist or misrepresent actual sources in ways that make verification impossible 27. This represents one of the most serious citation performance failures, as it undermines trust and can mislead users who assume cited sources are legitimate.

Example: A research assistant AI asked about recent advances in quantum computing might generate a response citing "Johnson, M. et al. (2023). 'Breakthrough in Room-Temperature Quantum Coherence.' Nature Quantum Information, 8(4), 234-247." The citation appears professionally formatted with realistic journal name, volume, and page numbers. However, tracking systems attempting to verify this citation discover that no such article exists, the journal volume doesn't match 2023 publication patterns, and no author named Johnson published on this topic that year. Organizations implement detection systems that flag citations for verification before presenting them to users, maintaining databases of known valid sources, and training models with explicit penalties for citation fabrication.

Source Authority and Ranking

Source authority refers to the credibility, reliability, and appropriateness of cited sources for particular claims, while ranking determines the priority and ordering of multiple potential citations 45. Tracking systems assess whether AI systems preferentially cite authoritative sources and appropriately rank citations when multiple references are available.

Example: An AI system answering questions about vaccine efficacy might have access to peer-reviewed studies in medical journals, government health agency reports, news articles, and social media posts. A well-performing system would prioritize citations from randomized controlled trials published in journals like The Lancet or JAMA, followed by CDC or WHO guidance, while deprioritizing or excluding social media sources. Tracking reveals that when the system cites three sources for a claim about vaccine effectiveness, it ranks a peer-reviewed meta-analysis first, a CDC report second, and a news article third—demonstrating appropriate authority-based ranking. Organizations monitor whether this ranking behavior remains consistent across topics and whether the system ever inappropriately elevates low-authority sources.

Temporal Consistency

Temporal consistency measures whether citation performance remains stable over time as models are updated, training data evolves, and source materials change 111. This concept addresses the challenge that AI systems may exhibit citation performance degradation as they're modified or as the information landscape shifts.

Example: A financial analysis AI system demonstrates 88% citation accuracy in January 2024 when first deployed. By June 2024, after several model updates and changes to the underlying document database, tracking systems detect that citation accuracy has declined to 79%. Investigation reveals that recent updates prioritized response speed over retrieval thoroughness, and that several previously reliable financial data sources changed their URL structures, breaking citation links. The organization's tracking infrastructure flagged this degradation through statistical process control, triggering alerts when performance dropped below established thresholds and enabling the team to diagnose and address the underlying causes before users experienced significant impact.

Citation Density and Appropriateness

Citation density refers to the number and distribution of citations relative to content length and complexity, while appropriateness assesses whether citation levels match user needs and content type 210. Tracking systems evaluate whether AI systems provide sufficient attribution without overwhelming users with excessive references.

Example: An educational AI tutoring system generates explanations for high school students learning about photosynthesis. For a 200-word explanation of basic concepts, the system includes two citations to authoritative biology textbooks—a density of one citation per 100 words. When the same system responds to a graduate student's query about the quantum efficiency of photosystem II, it generates a 300-word response with eight citations to recent research papers—approximately one citation per 37 words. Tracking systems monitor whether citation density appropriately scales with content complexity and user expertise, discovering through A/B testing that high school students engage more with moderately cited content (1-2 citations per explanation) while graduate students expect and value higher citation density (5-8 citations for technical topics).

Applications in AI-Powered Research and Content Systems

Academic Research Assistance

AI systems supporting academic research require rigorous citation tracking to meet scholarly standards and prevent academic integrity violations 13. These applications implement comprehensive tracking frameworks that monitor citation accuracy, source authority, and attribution completeness across literature reviews, research summaries, and hypothesis generation.

A university deploys an AI research assistant to help doctoral students conduct literature reviews in biomedical engineering. The tracking system monitors every generated summary, verifying that cited papers exist in databases like PubMed and Google Scholar, confirming that attributed findings actually appear in the cited papers, and assessing whether the system appropriately cites seminal works versus recent publications. Over six months, tracking reveals that the system maintains 94% citation accuracy for established research areas but only 81% accuracy for emerging topics published within the past year, leading to targeted improvements in the retrieval system's handling of recent publications.

Legal Research and Case Analysis

Legal AI applications demand exceptional citation performance due to the profession's strict attribution requirements and the consequences of citing incorrect precedents 47. Tracking systems in this domain verify not only that cases are cited correctly but that citations include proper legal citation format, pinpoint references to relevant sections, and accurately represent legal holdings.

A law firm implements an AI system for case law research, with tracking infrastructure that validates every citation against official court databases, verifies that quoted language exactly matches court opinions, and confirms that case holdings are accurately represented. The system tracks citation performance across different legal domains, discovering that citation accuracy reaches 97% for well-established areas like contract law but drops to 89% for rapidly evolving areas like cryptocurrency regulation, where precedents are sparse and legal principles are still developing.

Medical Information and Clinical Decision Support

Healthcare AI systems require meticulous citation tracking because inaccurate medical information can directly harm patients 25. These applications implement multi-layered verification that confirms citations reference peer-reviewed medical literature, clinical guidelines from recognized authorities, and evidence-based treatment protocols.

A hospital system deploys an AI clinical decision support tool that provides treatment recommendations with supporting citations. The tracking framework verifies that all cited studies are indexed in medical databases, confirms that cited treatment protocols come from organizations like the American Medical Association or specialty-specific professional societies, and monitors whether the system appropriately indicates evidence levels (randomized controlled trials versus observational studies versus case reports). Tracking over 10,000 clinical queries reveals that the system maintains 96% citation accuracy for established treatments but requires human review for novel therapies where evidence is limited.

Journalism and Fact-Checking Applications

AI systems supporting journalism must balance speed with accuracy, requiring tracking systems that verify source credibility, detect potential misinformation, and ensure proper attribution of quotes and factual claims 710. These applications often implement real-time citation verification that flags questionable sources before content is published.

A news organization uses an AI system to draft initial versions of breaking news stories, with citation tracking that verifies every factual claim against trusted news sources, official statements, and verified social media accounts. The tracking system maintains a database of source authority scores, prioritizing citations from established news agencies, government sources, and verified eyewitness accounts while flagging citations from unverified social media or known misinformation sources. Analysis of 500 AI-drafted stories reveals that the system appropriately cites primary sources 87% of the time but occasionally relies on secondary reporting when primary sources aren't immediately available in the retrieval database.

Best Practices

Implement Multi-Layered Verification

Organizations should deploy multiple complementary verification methods rather than relying on single approaches, combining automated checks with human evaluation and user feedback mechanisms 17. The rationale is that different verification methods catch different types of errors—automated systems excel at scale and consistency but miss nuanced misattributions, while human evaluators catch subtle errors but can't review every output.

Implementation Example: A research AI platform implements three verification layers: (1) automated real-time checks that verify cited sources exist and are accessible, flagging any citations that can't be retrieved within 2 seconds; (2) daily batch processing that uses text similarity algorithms to confirm that attributed information actually appears in cited sources, sampling 5% of all citations; and (3) weekly human expert review of 100 randomly selected responses, assessing citation appropriateness and completeness. This multi-layered approach catches 98% of citation errors compared to 76% when using only automated verification.

Establish Domain-Specific Benchmarks

Citation performance standards should reflect the specific requirements and conventions of different domains rather than applying universal metrics across all contexts 24. Different fields have distinct citation expectations—medical literature requires peer-reviewed sources, legal research demands precise case citations, and general knowledge queries may appropriately cite encyclopedic sources.

Implementation Example: A multi-domain AI platform establishes separate citation benchmarks for each major application area: medical queries require 95% citation accuracy with sources limited to peer-reviewed journals and clinical guidelines; legal queries require 98% accuracy with proper legal citation format; general knowledge queries accept 90% accuracy with broader source types including reputable encyclopedias and educational websites. The tracking system applies appropriate benchmarks based on query classification, preventing inappropriate evaluation of general knowledge responses against medical research standards.

Monitor Temporal Trends and Trigger Alerts

Organizations should implement continuous monitoring with statistical process control that detects performance degradation and triggers alerts when metrics deviate from established baselines 111. This practice enables rapid response to emerging issues before they significantly impact users.

Implementation Example: A content generation platform establishes baseline citation accuracy of 91% based on three months of initial deployment data, with control limits set at ±3 standard deviations. The monitoring system tracks daily citation accuracy across 1,000 randomly sampled responses. When accuracy drops to 87% for three consecutive days—exceeding the lower control limit—the system automatically alerts the engineering team and increases human review sampling from 5% to 20% until the issue is diagnosed. Investigation reveals that a recent update to the retrieval system inadvertently changed ranking parameters, which is corrected within 48 hours.

Incorporate User Feedback Loops

Tracking systems should capture and systematically analyze user reports of citation errors, integrating this feedback into performance metrics and improvement priorities 710. Users often identify citation problems that automated systems miss, particularly nuanced misattributions or contextually inappropriate citations.

Implementation Example: An AI research assistant implements a citation feedback mechanism where users can flag problematic citations with categories: "source doesn't exist," "source doesn't support claim," "better source available," or "citation formatting error." The system tracks feedback rates (percentage of responses where users flag citations) and categorizes issues. Analysis reveals that 3% of responses receive citation feedback, with "source doesn't support claim" representing 60% of reports. This insight drives development of improved attribution fidelity verification, reducing this category of errors by 40% over three months.

Implementation Considerations

Tool and Technology Selection

Organizations must evaluate and integrate appropriate tools for citation verification, source validation, and performance analytics 59. Choices include academic citation databases (CrossRef, PubMed, Google Scholar APIs), fact-checking services, text similarity engines, and custom-built verification systems. The selection depends on domain requirements, scale, budget, and integration complexity.

Example: A biomedical AI company implements citation tracking using PubMed APIs for verifying medical literature citations, CrossRef for DOI validation, and a custom-built text similarity service using sentence transformers to confirm that attributed findings appear in cited papers. For 100,000 monthly queries, API costs total approximately $2,000 monthly, while the custom similarity service runs on dedicated infrastructure costing $5,000 monthly. This investment is justified by the criticality of citation accuracy in medical applications and the reputational risk of citation errors.

Audience-Specific Customization

Citation tracking and presentation should adapt to different user audiences, recognizing that experts require different citation density and detail than general users 210. Implementation involves user profiling, adaptive citation systems, and A/B testing to optimize citation presentation for different segments.

Example: An educational AI platform implements user profiling that categorizes users as "novice" (high school students), "intermediate" (undergraduate students), or "expert" (graduate students and researchers). For novice users, the system provides 1-2 citations per response with simplified formatting and explanatory context about why sources are trustworthy. For expert users, the system provides comprehensive citations with full bibliographic details, DOIs, and direct links to source materials. Tracking reveals that novice users engage with cited sources 12% of the time, while experts engage 47% of the time, validating the differentiated approach.

Organizational Maturity and Resource Allocation

Implementation scope should align with organizational maturity, available resources, and risk tolerance 14. Early-stage implementations might focus on basic citation verification, while mature organizations with high-stakes applications require comprehensive tracking frameworks.

Example: A startup developing a general-purpose AI assistant begins with basic citation tracking: automated verification that cited URLs are accessible and manual review of 2% of responses. As the company grows and enters enterprise markets, it progressively enhances tracking: implementing text similarity verification (month 6), establishing domain-specific benchmarks (month 12), deploying real-time monitoring dashboards (month 18), and creating a dedicated citation quality team (month 24). This phased approach aligns investment with business growth and increasing customer expectations.

Integration with Model Development Lifecycle

Citation performance tracking should integrate with model development, training, and deployment processes rather than being treated as a separate post-deployment concern 311. This involves incorporating citation metrics into model evaluation during development, using citation performance data to inform training objectives, and establishing citation quality gates for production deployment.

Example: An AI research lab developing a new language model incorporates citation performance into the development lifecycle: establishing citation accuracy benchmarks that new model versions must meet before deployment (minimum 90% accuracy on standard test sets), using reinforcement learning from human feedback (RLHF) with citation quality as an explicit reward signal, and conducting citation-focused red-teaming where evaluators specifically attempt to elicit hallucinated citations. Models that fail to meet citation benchmarks are not deployed to production, ensuring that citation performance is treated as a core capability rather than an optional feature.

Common Challenges and Solutions

Challenge: Hallucinated Citations

AI systems frequently generate plausible-looking but entirely fabricated citations, particularly when asked about topics where genuine sources are sparse or when the model lacks relevant information 27. These hallucinated citations appear professionally formatted with realistic author names, publication years, journal titles, and page numbers, making them difficult for users to identify without verification. The challenge intensifies because hallucination rates vary unpredictably across topics and query types, making it difficult to predict when the system will fabricate references.

Solution:

Implement multi-stage verification before presenting citations to users, combining real-time source validation with confidence scoring and selective withholding 110. Organizations should maintain databases of known valid sources (DOIs, verified URLs, academic database identifiers) and check generated citations against these databases before display. For citations that can't be immediately verified, implement confidence scoring based on factors like source retrievability, text similarity between generated content and retrieved sources, and model uncertainty indicators. Establish thresholds where low-confidence citations are flagged with warnings ("This source could not be verified—please confirm before relying on this information") or withheld entirely, with the system instead indicating "I don't have verified sources for this claim."

A practical implementation involves integrating citation verification into the generation pipeline: after the model generates a response with citations, a verification service attempts to retrieve each cited source within a 2-second timeout. Successfully retrieved sources are validated (confirming they contain attributed information), while unretrievable sources are flagged. The system then presents verified citations normally, flags uncertain citations with warnings, and removes citations that fail verification entirely, replacing them with acknowledgment of uncertainty. Organizations implementing this approach report reducing hallucinated citations reaching users by 85-95%.

Challenge: Source Access and Licensing Constraints

Verifying citations requires accessing cited sources, but many valuable sources are behind paywalls, subject to licensing restrictions, or dynamically changing 45. Organizations may lack legal access to verify citations from academic journals, proprietary databases, or subscription-based news sources. Even when access exists, sources may be updated or removed, breaking citation links and making historical verification impossible.

Solution:

Establish strategic partnerships with content providers, implement caching strategies for verification purposes, and develop tiered verification approaches based on source accessibility 19. Organizations should negotiate API access agreements with major content providers (academic publishers, news organizations, professional databases) that enable citation verification without violating licensing terms. For sources where direct access isn't feasible, implement metadata-based verification that confirms sources exist and match expected patterns (correct journal volume/issue for publication year, author names matching known researchers) even when full-text verification isn't possible.

Implement caching systems that store snapshots of cited sources at the time of citation, enabling future verification even if sources change or become unavailable. This requires careful attention to copyright and fair use considerations, typically limiting caching to metadata and small excerpts sufficient for verification rather than full documents. For sources that can't be accessed or cached, implement transparent disclosure, indicating to users which citations have been fully verified versus those confirmed only through metadata checks.

A research AI platform implements this through tiered verification: Tier 1 (full verification) for sources with API access, confirming attributed information appears in the source; Tier 2 (metadata verification) for paywalled sources, confirming the source exists and metadata matches; Tier 3 (format verification) for inaccessible sources, confirming citation formatting is plausible. The system displays verification status to users, enabling them to assess citation reliability appropriately.

Challenge: Balancing Citation Thoroughness with User Experience

Comprehensive citation supports transparency and verification but can overwhelm users with excessive references, particularly for complex topics requiring multiple sources 210. Users may ignore or distrust responses that appear cluttered with citations, while insufficient citation undermines credibility and prevents verification. The optimal citation density varies significantly across user expertise levels, content types, and use cases, making one-size-fits-all approaches ineffective.

Solution:

Implement adaptive citation systems that adjust density and presentation based on user context, content type, and interaction patterns 711. Use progressive disclosure where initial responses include primary citations (typically 2-4 key sources) with options for users to expand and view additional references. Develop user profiling that infers expertise levels from interaction history and adjusts citation presentation accordingly—novice users receive fewer, more carefully selected citations with explanatory context, while expert users receive comprehensive citations with full bibliographic details.

Conduct A/B testing to optimize citation density for different user segments and content types. Test variations including citation count (1-2 vs. 3-5 vs. 6+ citations), presentation format (inline vs. footnotes vs. expandable sections), and detail level (simple URLs vs. full bibliographic citations). Measure engagement metrics including time spent reading responses, click-through rates on citations, and user satisfaction ratings.

An educational AI platform implements this through user-adaptive citation: new users receive responses with 1-2 carefully selected citations presented as "Learn more" links with brief source descriptions. As users demonstrate engagement with citations (clicking through, spending time on source materials), the system progressively increases citation density and detail. Expert users who consistently engage with citations receive comprehensive references with full bibliographic information. This approach increases overall citation engagement by 34% compared to fixed citation density.

Challenge: Domain-Specific Citation Requirements

Citation conventions, expectations, and standards vary dramatically across domains—medical literature requires peer-reviewed sources, legal research demands precise case citations with pinpoint references, journalism prioritizes primary sources and verification, while general knowledge queries may appropriately cite encyclopedic sources 24. Applying uniform citation standards across domains results in either overly restrictive requirements for general queries or insufficiently rigorous standards for specialized domains.

Solution:

Develop domain-specific citation frameworks with tailored metrics, source authority hierarchies, and verification requirements 15. Implement query classification that identifies the domain and stakes level of each request, routing queries to appropriate citation frameworks. For medical queries, enforce requirements that sources must be peer-reviewed publications, clinical guidelines from recognized authorities, or government health agencies, with verification confirming that cited studies support specific medical claims. For legal queries, require proper legal citation format (Bluebook or jurisdiction-specific standards), verification against official court databases, and confirmation that case holdings are accurately represented.

Create domain-specific source authority databases that rank potential sources by credibility within each field. Medical authority rankings prioritize randomized controlled trials over observational studies over case reports; legal rankings prioritize binding precedent over persuasive authority over secondary sources; news rankings prioritize primary sources and verified eyewitness accounts over secondary reporting. Implement domain-specific verification: medical citations undergo checks against PubMed and clinical trial registries; legal citations are verified against court databases and legal citation validators; news citations are checked against fact-checking databases and verified source lists.

A multi-domain AI platform implements this through modular citation frameworks: a medical citation module enforcing peer-review requirements and evidence-level classification; a legal citation module implementing Bluebook formatting and case law verification; a general knowledge module accepting broader source types with appropriate authority weighting. Query classification routes requests to appropriate modules, with tracking systems monitoring performance separately for each domain and applying domain-appropriate benchmarks.

Challenge: Temporal Drift and Performance Degradation

Citation performance often degrades over time due to model updates, training data drift, changes in source availability, or shifts in the information landscape 111. Organizations may deploy systems with strong initial citation performance only to discover months later that accuracy has declined significantly. This degradation can occur gradually, making it difficult to detect without systematic monitoring, or suddenly following system updates.

Solution:

Implement continuous monitoring with statistical process control, establishing baseline performance metrics and control limits that trigger alerts when performance deviates significantly 710. Deploy automated daily or weekly sampling that evaluates citation performance on representative query sets, tracking metrics including citation accuracy, source authority scores, hallucination rates, and user feedback rates. Use time-series analysis to detect gradual trends that might indicate emerging issues before they become severe.

Establish citation performance regression testing as part of the deployment pipeline, requiring that new model versions or system updates maintain citation performance within acceptable ranges of current production systems. Create holdout test sets representing diverse domains and query types, evaluating citation performance on these sets before deploying updates. Implement staged rollouts where updates are initially deployed to small user percentages while monitoring citation metrics, expanding deployment only after confirming performance stability.

Develop root cause analysis protocols for performance degradation, systematically investigating potential causes including retrieval system changes, ranking algorithm modifications, source database updates, or model behavior shifts. Maintain detailed change logs correlating system modifications with performance metrics to enable rapid diagnosis.

A content generation platform implements this through comprehensive monitoring: daily automated evaluation of 1,000 sampled responses measuring citation accuracy, weekly human review of 200 responses assessing attribution quality, and monthly comprehensive audits comparing current performance to baseline. Statistical process control charts track citation accuracy with control limits at ±2 standard deviations from baseline. When accuracy drops below the lower control limit for two consecutive days, automated alerts trigger investigation. This system detected a 6% accuracy decline within 48 hours of a retrieval system update, enabling rapid rollback before significant user impact.

References

  1. Gao, L., et al. (2023). Enabling Large Language Models to Generate Text with Citations. arXiv. https://arxiv.org/abs/2305.14627
  2. Menick, J., et al. (2022). Teaching language models to support answers with verified quotes. arXiv. https://arxiv.org/abs/2211.09110
  3. Bohnet, B., et al. (2023). Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models. arXiv. https://arxiv.org/abs/2302.07842
  4. Lieberum, T., et al. (2023). Does ChatGPT have a scientific citation problem? Nature. https://www.nature.com/articles/s41586-023-06291-2
  5. Metzler, D., et al. (2023). Rethinking Search: Making Domain Experts out of Dilettantes. Google Research. https://research.google/pubs/pub52158/
  6. Liu, N. F., et al. (2023). Evaluating Verifiability in Generative Search Engines. arXiv. https://arxiv.org/abs/2310.01558
  7. Rashkin, H., et al. (2023). Measuring Attribution in Natural Language Generation Models. ACL Anthology. https://aclanthology.org/2023.acl-long.386/
  8. Gao, T., et al. (2023). Enabling Large Language Models to Generate Text with Citations. arXiv. https://arxiv.org/abs/2304.09848
  9. Anthropic. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic. https://www.anthropic.com/index/measuring-faithfulness-in-chain-of-thought-reasoning
  10. Nakano, R., et al. (2023). WebGPT: Browser-assisted question-answering with human feedback. arXiv. https://arxiv.org/abs/2301.00234
  11. Olah, C., et al. (2021). Multimodal Neurons in Artificial Neural Networks. Distill. https://distill.pub/2021/multimodal-neurons/