Back to Home

ClinArena Orientation Guide

Thank you for participating in this important research. This guide will help you understand how to complete evaluations effectively and ensure your contributions are properly tracked for co-authorship.

Getting Started

Login Options

IMPORTANT: You can log in using either your National Provider Identifier (NPI) or your email address. Both methods ensure your evaluations are tracked and counted toward co-authorship.

Option 1: NPI Login

  1. Navigate to https://clinarena.com/
  2. Select "NPI" as your verification method
  3. Enter your 10-digit NPI in the "National Provider Identifier" field
  4. Click "Verify Identity"
  5. Your identity will be verified against the CMS NPI Registry

Option 2: Email Login

  1. Navigate to https://clinarena.com/
  2. Select "Email" as your verification method
  3. Enter your email address and full name
  4. Click "Verify Identity"
  5. Your identity will be verified via email

Note: If you do not have an NPI or prefer not to log in, you can vote as a guest. However, guest votes are recorded anonymously and will not be counted toward co-authorship.

How ClinArena Works

The Evaluation Interface

For each case, you will see:

  1. Ground Truth Explanation: A brief patient summary at the top
  2. FHIR Data (expandable): Complete EHR data including:
    • Lab results
    • Active conditions and diagnoses
    • Current medications
    • Family history
    • Vital signs
    • Other clinical data
  3. Two Model Responses (Model A and Model B): Anonymized AI-generated analyses displayed side-by-side

Your Task

Compare the two model responses and select which one is better. You will evaluate 500 pairwise comparisons total (50 cases × 10 comparisons per case).

Evaluation Criteria

When comparing two model responses, consider the following dimensions:

1. Clinical Summary Quality

  • Does the response accurately summarize the patient's clinical picture?
  • Are the most important findings highlighted?
  • Is the summary clear and well-organized?

2. Identification of Clinically Significant Findings

  • Does the response identify critical lab abnormalities (e.g., acute kidney injury, hyperkalemia, severe anemia)?
  • Are dangerous drug interactions or contraindications flagged?
  • Are disease progression patterns or complications recognized?

3. Actionable Clinical Insights

  • Does the response provide recommendations that would be useful to a clinician?
  • Are the suggested next steps appropriate and evidence-based?
  • Does the response prioritize urgent issues appropriately?

4. Evidence and Citations

  • Are claims supported by citations from the medical literature?
  • Are the citations relevant and from reputable sources?
  • Is the evidence appropriately applied to this specific patient?

5. Depth and Completeness

  • Does the response address the full complexity of the patient's condition?
  • Are important comorbidities and their interactions considered?
  • Is the analysis thorough without being unnecessarily verbose?

Making Your Selection

After reviewing both responses, you have several options:

  • Model A if the left response is superior
  • Model B if the right response is superior
  • Tie if both responses are truly equivalent in quality
  • Skip if the case is outside your specialty or you're not comfortable evaluating it

Skip Functionality

If a case falls outside your area of expertise or you're not comfortable evaluating it, you can use the "Skip" button. This allows you to:

  • Move to the next case without submitting a vote
  • Focus on cases within your specialty or comfort zone
  • Ensure evaluations are completed by clinicians with appropriate expertise

Skipped cases are not recorded as votes, so you can skip as many cases as needed.

Optional Feedback

When making your selection, you can optionally provide feedback about the comparison. Click the "Feedback" button in the voting interface to expand a text area where you can:

  • Share your thoughts on why you chose one model over another
  • Note any concerns or observations about the responses
  • Provide context about your clinical reasoning
  • Highlight particularly strong or weak aspects of the responses

Feedback is completely optional but helps improve the quality of our research. Your feedback will be stored with your vote and can provide valuable insights for model development.

Important Guidance on Ties

Use the "Tie" button sparingly. We recognize that in some cases, two responses may be very similar, but we encourage you to make a choice whenever possible. Ask yourself:

  • If I had to choose one response to present to a colleague, which would it be?
  • Which response would I trust more in a real clinical scenario?
  • Even if both are good, which has a slight edge in any dimension?

Only use "Tie" if:

  • Both responses are truly indistinguishable in quality
  • Both responses have equivalent strengths and weaknesses
  • You absolutely cannot determine a preference after careful review

In practice, ties should be rare (ideally <10% of comparisons).

Best Practices

Review the Full EHR Data

While the ground truth explanation provides a summary, expand and review the full FHIR data to understand the complete clinical picture. The model responses may reference findings that are not in the summary but are present in the detailed EHR.

Take Your Time

Each comparison should take 1-2 minutes. Don't rush—your clinical judgment is what makes this evaluation valuable.

Work at Your Own Pace

You can complete evaluations in multiple sessions. The platform will save your progress. We recommend completing at least 50-100 comparisons, but more is always better.

Stay Objective

The models are anonymized (labeled only as "Model A" and "Model B") to prevent bias. You won't know which model is which, and the same model may appear as "Model A" in one comparison and "Model B" in another.

Trust Your Clinical Judgment

You are the expert. If something in a model response seems wrong, clinically inappropriate, or potentially harmful, that should heavily influence your decision.

Skip When Appropriate

Don't hesitate to skip cases that are outside your specialty or where you don't feel comfortable making an evaluation. It's better to skip than to provide an evaluation without appropriate expertise.

Provide Feedback When Helpful

While feedback is optional, it can be valuable for understanding your reasoning and improving the models. Consider providing feedback when you notice something particularly noteworthy or when your clinical judgment differs significantly from what might be expected.

Technical Tips

If You Encounter Issues

  • Comparison won't load: Refresh the page and log in again (with your NPI or email)
  • Can't expand FHIR data: Try a different browser (Chrome or Firefox recommended)
  • Lost your place: The platform tracks your progress automatically

Browser Recommendations

  • Recommended: Chrome, Firefox, Safari (latest versions)
  • Screen size: Desktop or laptop recommended for optimal viewing

Progress Tracking

The platform will show you how many comparisons you've completed. We're asking for at least 50-100 comparisons per evaluator, but you're welcome to complete as many as you'd like. More evaluations mean more robust data and a stronger paper.

Thank You!

Your participation in ClinArena is essential to advancing the safe and effective deployment of AI in clinical practice. By contributing your expertise, you're helping to:

  • Validate a new synthetic dataset for the research community
  • Assess the accuracy and safety of current AI systems
  • Guide the development of next-generation clinical AI tools

We're grateful for your time and expertise, and we look forward to collaborating with you on this important work.