How We Used Claude API to Extract Physician Data for 7 US Health Systems

A practical walkthrough of the AI pipeline we built to parse unstructured PDF and legacy data into structured physician profiles at scale.

The problem with physician directory data is that no two health systems store it the same way. When we started the Commonwealth Health project, we were looking at seven separate systems, each with physician records spread across PDFs, Microsoft Word documents, old HTML tables scraped from legacy intranets, and spreadsheets where the column labelled "Specialty" in one file was "Clinical Area" in another, and "Primary Discipline" in a third. Manual data entry was the default assumption from the client side. We pushed back immediately.

At hundreds of physicians per system with ongoing staff changes, manual entry was not a one-time cost. It was a recurring debt. Every time a physician changed locations, accepted new patients, or added a board certification, someone would need to update a spreadsheet, and that person would make mistakes. The data would drift from the source documents within weeks.

Why Manual Entry Was Never Viable

The scale alone ruled it out. Across seven systems, we were looking at over 1,400 physician profiles on initial ingestion. Each profile required at minimum eight fields: full name, primary specialty, secondary specialties, board certifications, affiliated locations, accepting new patients status, gender, and contact information. That is 11,200 individual data points before you factor in variation and ambiguity.

Then there is the document quality problem. Some PDFs were well-structured forms. Others were scanned paper documents with handwritten annotations. Some Word docs had merged table cells. Some spreadsheets had headers in row three because someone used rows one and two for a title banner. A human could read all of these, but the time cost was significant, and the error rate on repetitive data entry is well-documented across any industry.

The ongoing update requirement was the deciding factor. Commonwealth Health needed a workflow that non-technical staff could operate to keep the directory current. That meant the extraction pipeline had to produce output in a format editors could work with directly, not a developer artifact.

The Four-Step Pipeline

We structured the pipeline in four sequential steps.

Step one was document ingestion. Every source file was converted to plain text or structured HTML before touching Claude. PDFs went through a two-stage process: first attempting programmatic text extraction via PyMuPDF, then falling back to OCR via Tesseract for scanned documents. Word documents were converted to HTML using LibreOffice headless. Spreadsheets were read with pandas, with column normalisation applied to map variant header names to a canonical schema before the data reached the model. HTML tables from legacy intranets were parsed with BeautifulSoup.

Step two was Claude API extraction. Each normalised document chunk was sent to Claude with a structured extraction prompt. We used the tool_use parameter to enforce JSON output, passing a physician profile schema as the tool definition. This is the most reliable way to get consistent structured output from Claude: the model treats the schema as a function signature and fills in the fields rather than generating free-form text that then needs to be parsed.

Step three was a validation pass. The extracted JSON went through a Python validation layer before any data touched the CMS. Required fields were checked for presence and basic type correctness. Specialty values were matched against a controlled vocabulary. Physician names were cross-referenced against a master list provided by the client's HR system.

Step four produced the Excel output. Validated records were written to a structured Excel file that matched the format the client's administrative staff already worked with. This was deliberate. The goal was to fit the AI pipeline into an existing workflow, not replace it with a new one.

Prompt Engineering for Structured Extraction

The extraction prompt followed a consistent structure across all seven systems. The system message defined the model's role as a medical data extraction specialist with specific knowledge of physician credentialing terminology. The user message provided the document text and the task.

The core of the approach was the tool definition. We defined an extract_physician_profile tool with a JSON schema specifying every field:

{
  "name": "extract_physician_profile",
  "description": "Extract structured physician data from document text",
  "input_schema": {
    "type": "object",
    "properties": {
      "full_name": { "type": "string" },
      "primary_specialty": {
        "type": "string",
        "enum": ["Internal Medicine", "Cardiology", "Orthopaedic Surgery"]
      },
      "board_certifications": {
        "type": "array",
        "items": { "type": "string" }
      },
      "locations": {
        "type": "array",
        "items": { "type": "string" }
      },
      "accepting_new_patients": {
        "type": "boolean",
        "nullable": true
      },
      "confidence_flags": {
        "type": "array",
        "items": { "type": "string" }
      }
    },
    "required": ["full_name", "primary_specialty"]
  }
}

Temperature was set to 0 throughout. Physician data extraction is not a creative task. Variance in output is a defect, not a feature.

The confidence_flags field was central to the review workflow. Claude was explicitly instructed to populate this array when a field value was inferred rather than stated directly, when two conflicting values appeared in the same document, or when document quality made a field uncertain. A record with any confidence flags went into a human review queue rather than passing directly to validation.

Handling Inconsistency Across Systems

The specialty vocabulary was the biggest inconsistency problem. "Cardiothoracic Surgery" appeared as "CT Surgery," "Cardiac Surgery," and "Thoracic and Cardiac Surgery" across different source systems. We could not enumerate every variant in the schema enum, so we took a two-step approach: Claude extracted the value as written in the source document, and a post-extraction normalisation layer mapped it to the canonical vocabulary using a controlled crosswalk table.

For accepting new patients status, the language variation was significant. "Currently accepting," "open panel," "accepting referrals," "limited availability," and simple "Yes" all appeared across documents. We gave Claude explicit instructions on how to interpret each variant and instructed it to set the field to null rather than guess when the document was ambiguous. Null values triggered a human review flag automatically.

Location data was the messiest field. Some documents listed full addresses. Others listed facility names only. Others listed both but inconsistently formatted. We broke location into two sub-fields: location_name and location_address, with neither required. Claude filled what was present and flagged gaps.

The Validation Layer

Validation ran in Python before any record touched the output Excel file. Required fields were checked for presence and non-empty string values. Specialty values were matched against canonical vocabulary after normalisation. Name format went through a basic regex check for at least two words. Accepting patients was restricted to boolean or null only.

The cross-reference check was the most valuable step. The client provided a list of active physician NPIs from their credentialing system. Any extracted name that could not be fuzzy-matched to the NPI list within a 0.85 similarity threshold was flagged for review. This caught extraction errors where Claude had picked up a referenced physician from a biography paragraph instead of the subject physician of the document, which happened more than once.

What the Pipeline Replaced

Before the pipeline, the client's internal team estimated two to three weeks of data entry per health system for initial ingestion. Across seven systems, that was a potential 14 to 21 weeks of manual work, excluding quality review and corrections. The pipeline reduced initial ingestion to roughly two days per system: one day for document processing and extraction, one day for human review of flagged records.

Ongoing updates were more significant. A single physician update previously required locating the source document, finding the correct spreadsheet row, making the change, and notifying the web team. With the pipeline, an updated document re-runs through extraction, produces a new record, validation checks it against the existing record and flags differences for review, and the editor sees a clean diff in the Excel output.

The Staff Editing Workflow

The Excel output was structured specifically for non-technical editors. Each row was one physician. Columns matched the canonical field names. A colour-coded status column indicated whether the record was auto-validated, in review, or had a confidence flag requiring human input.

Editors worked in Excel, corrected or confirmed flagged fields, and saved the file. A lightweight Python script converted the edited Excel to structured JSON and ran the same validation layer before pushing to the CMS API. The script rejected any row that still had unresolved confidence flags, which forced editors to make explicit decisions rather than leaving ambiguous data in the output.

The net result was a directory that reflected source documents accurately, updated on a reliable cadence, and required no developer involvement for routine maintenance. That last point mattered most to the client. The pipeline's value was not just speed on initial ingestion. It was removing the technical bottleneck from an ongoing business process.

If you work in healthcare or any other field with large volumes of unstructured document data, this kind of AI automation pipeline is worth exploring. Get in touch to talk through your data situation.