Jurij Tokarski

Jurij Tokarski

Errors That Never Left The Device

Eleven silent failure paths in a clinical iOS app. None of them surfaced as a backend alert.

The note that vanished

A clinician recorded an audio note. The app said it uploaded. The note wasn't there the next morning.

That's how the audit started. No crash report, no stack trace — a support ticket and a missing document. The clinician was certain they'd recorded it. The UI had shown the completion animation. The backend had no record of it.

I pulled the relevant service and started reading AudioUploadService.verifyServerCompletion(). After a successful upload, the app calls a status endpoint to confirm the backend processed the note — defense-in-depth, the kind of thing you'd add after a prior incident. The function calls the endpoint in a retry loop. Five attempts. If all five fail, it reaches this block:

} else {
    // Fallback: If status check fails repeatedly, assume completion after reasonable time
    Logger.audio("⚠️ Status check failed, but assuming completion for document: \(documentId)")
    let finalResult = UploadResult(success: true, documentId: documentId, error: nil)
    self.delegate?.uploadStageChanged(.completed)
    self.delegate?.uploadCompleted(finalResult)
    completion(finalResult)
    self.notificationService.showUploadCompletionNotification(documentId: documentId)
}

success: true. Completion notification fired. The UI animates closed.

The comment says "handles cases where the server returns empty responses but the upload was successful." What it doesn't say is that if the server is unreachable — overloaded, degraded, briefly down — there is no way to distinguish that from a transient empty response. The function guesses success, marks the document done, and exits. No retry scheduled. No error state. No signal leaving the device. The distinction between "upload succeeded, verification failed" and "upload is gone" is permanently lost.

The clinician who filed the ticket couldn't have known this was happening. The app was designed to tell them it worked, and nothing left the device to say otherwise.

How "assume success" became the house style

Once I found the verification fallback, I went looking for the same shape elsewhere. Twenty minutes, nine more.

AudioRecorderViewModel.startRecording() sets isRecording = true and calls delegate?.recordingStateChanged(.recording) early in the function — before validation. Then, if engine initialization fails:

guard let audioEngine = audioEngine else {
    Logger.debug("❌ Failed to create audio engine")
    return  // Early return — isRecording still true
}

The function returns. isRecording is still true. The UI shows a recording in progress. Nothing is being recorded. The AVAudioSession configuration earlier in the same function can also fail silently — catches, logs to debug, continues. Audio file creation can fail and the function continues. Four failure paths in one function, each one moving on without rollback.

deleteNote has a variant:

do {
    try FileManager.default.removeItem(at: audioFilePath)
} catch {
    Logger.warning("Failed to delete offline audio file: \(error)")
    // Continue anyway — no return, no error thrown
}

do {
    try FileManager.default.removeItem(at: jsonFilePath)
} catch {
    Logger.error("Error deleting note: \(error)")
}

The audio file deletion throws. The function logs and keeps going. The metadata deletion runs. Now the audio file exists on disk with no metadata pointing to it — an orphan that nothing will clean up. From the outside, the delete succeeded.

Each of these is defensible in isolation. Don't block the entire recording flow because the audio session threw. Don't fail the whole delete because one file had a permissions issue. The problem isn't any single decision — it's that the same decision, the device decides what counts as success, was made repeatedly until it became the default.

Logged locally, reported nowhere

Here's the part that ties the verification fallback, the state leak, and the orphaned files together: none of them left the device.

The codebase had eight-plus critical catch blocks. Every one called Logger.debug, Logger.warning, or Logger.error. None of them called out. No endpoint. No signal to the backend. No correlation ID support could use to pull a backend log.

The web version of this product already had a /feedbackV2 endpoint — auth-gated, accepts errors, stores them for support triage. iOS never wired it up.

Crash reporters like Sentry or Crashlytics don't help here. They catch uncaught exceptions and fatal crashes. Every error in this codebase was a handled error. From a crash reporter's perspective, the app was healthy. All those catch blocks were doing their job — the errors were caught, they weren't going anywhere.

When a support ticket came in — "my note disappeared," "the recording seemed to stop" — the support team had nothing. No timestamp. No error code. No HTTP status. No indication of which of the eight failure paths had triggered. They could ask the clinician to reproduce it with a device attached to Xcode (not realistic), or guess.

A complete logging setup and an absent reporting setup feel like the same thing when you're writing Logger.error(...). They're not.

The same bug at the configuration layer

Halfway through the audit I hit something that wasn't about error handling at all — same failure mode, different layer.

In Constants.swift, around line 322:

// #if DEBUG
// return .dev
// #elseif PREPROD
// return .preprod
// #else
// return .prod
// #endif
AppEnvironment.current = return .dev

The conditional compilation guards were commented out. Every build — debug, TestFlight, App Store release — unconditionally set the environment to .dev. Production clinician traffic had been routed to the development API.

Consequences beyond missing notes. PHI in a HIPAA-adjacent product, routed to the development environment, means audit segregation is broken. Production data contaminating dev logs. The infrastructure isolation compliance depends on isn't there.

The compiler had no objection. Tests passed — they ran against the dev environment too. The app functioned normally because there was nothing to notice.

The underlying problem is a property of Swift compiler directives. #if DEBUG is compile-time, not runtime. When it's commented out, it's not disabled — it's erased:

// This is a runtime conditional — commenting it out is a syntax error
if DEBUG { }

// This is compile-time — commenting it out silently removes the branching
// #if DEBUG
// #endif

A commented-out if is broken code that fails to compile. A commented-out #if is syntactically valid code that runs the fallback unconditionally. The distinction is invisible in code review. Commented code looks intentional. Reviewers assume it was deliberate. Nobody asks why #if DEBUG is wrapped in comments.

Wiring the endpoint without leaking PHI

The fix for the missing reporting is not "send more to the server." In a HIPAA-adjacent product, that sentence is where you stop and think.

The first instinct in Swift is to include error.localizedDescription. Don't. OS-generated error strings drag in context you don't control — file paths, error chains, localized strings from third-party libraries, any of which can contain document names, template names, or identifiers with patient context:

// UNSAFE — never do this:
report(code: .uploadFailed, context: ["error": error.localizedDescription])

The sanitizer that ended up in CriticalErrorReporter is strict by design:

private func sanitizeContext(_ context: [String: Any]) -> [String: Any] {
    var safe = [String: Any]()
    for (key, value) in context {
        guard key.range(of: "^[a-zA-Z0-9_]+$", options: .regularExpression) != nil else { continue }
        switch value {
        case let bool as Bool:
            safe[key] = bool
        case let int as Int where int >= -1_000_000 && int <= 1_000_000:
            safe[key] = int
        case let string as String:
            if string.range(of: "^[a-zA-Z0-9_:-]{1,48}$", options: .regularExpression) != nil {
                safe[key] = string
            }
        default:
            continue
        }
    }
    return safe
}

Primitives only. Strings must match a tight alphanumeric pattern under 48 characters. That allows "5xx", "http_401", "attempt_3". It blocks "HTTP/1.1 401 Unauthorized", anything with spaces or slashes, anything that could encode an identifier.

The error codes themselves are stable enums, not free-text strings. .falseCompletionAssumed, .audioSessionConfigFailed, .uploadFailed. A code the support team can look up in a table, not a prose description that might contain whatever was in memory when the error fired.

documentId is safe to include — the endpoint is auth-gated, the document ID is the same UUID the user already sees in their note list, and support needs it to correlate a client error with backend logs.

The cross-platform contract you didn't know you signed

When I wired up the iOS reporter, the first pass used a pipe-delimited comment for the body:

let comment = "audio_error:code|\(code)|ctx:\(context)|doc:\(documentId)|platform:ios"
let body = ["type": "audio_error", "comment": comment]

Looked reasonable. The endpoint accepted it — 200 OK. Test passed.

Then I cross-referenced the actual merged web code:

const body = {
    feedback: false,
    comment: JSON.stringify({ code, platform, context }),
    type: "audio_error",
    document_id,
    template_type,
    user_agent
};

Different body shape entirely. document_id at the top level, not inside the comment. comment as a JSON string, not pipe-delimited. The analytics consumer downstream was parsing document_id from the top-level field — the iOS version was sending it inside the comment string, where the parser wasn't looking.

Both versions compiled. Both sent 200 OK. Both "worked" locally. The iOS version would have created malformed records: missing document_id in the analytics table, broken parsing, phantom errors with no associated document.

This is the kind of bug that doesn't fail locally because each platform is tested against its own understanding of the contract, not the other platform's implementation. No crashes. Quietly wrong data — missing fields in dashboards, broken analytics joins, records that look valid but don't correlate with anything.

The only way I found it was by reading the actual web code. The spec said "send error data to /feedbackV2." The spec didn't define the exact body shape. I assumed I understood it. I was wrong about one field's location and the entire encoding. In a multi-platform product, the primary platform's merged code is the contract. Not the docs, not a conversation from six months ago. The code.

Don't replace silence with noise

Once the reporter existed, the next problem was obvious. processAudioBuffer() runs per audio frame. A 2-minute recording generates thousands of frames. If there's a persistent write failure, every frame fires the catch block.

Without dedup, that's potentially 7,000+ error reports from one bad recording session. Support inbox floods. DynamoDB bill climbs. Signal becomes useless — you can't tell if this was one bad recording or seven thousand separate users.

A per-session Set in the reporter:

private var reportedCodes = Set<String>()

func report(code: MobileErrorCode, ...) {
    guard !reportedCodes.contains(code.rawValue) else { return }
    reportedCodes.insert(code.rawValue)
    // Send the report
}

First occurrence sent. Every subsequent occurrence in the same session silently dropped. Support gets one record that the buffer write failed — the signal they need. The set resets on app restart. If the bug is persistent, you get one report per session — the right signal for systematic vs. transient failures.

A bypassDedup flag covers the cases where one code needs to fire even if a related one already did — the false-completion path uses it, because .falseCompletionAssumed is a distinct failure mode from .uploadFailed even when both fire in the same session.

Reports that fail to send queue in memory, capped at 20, and flush when the network returns. That matters because the most common time to see these errors is during degraded network conditions — the same conditions that would drop the report itself.

What it added up to

Eleven bugs by the count: the verification fallback assuming success, four silent paths in startRecording, two in deleteNote, eight-plus catch blocks with no outbound signal, the commented-out environment guards, and the cross-platform contract mismatch. Some are variants of the same thing depending on how you count.

They weren't eleven independent bugs. They were one decision — or rather the absence of one — repeated across the codebase. Nobody sat down and decided "let's make sure errors don't reach the backend." The web app had an answer: /feedbackV2. The iOS port never carried it across. Without it, every catch block defaulted to device-only logging, not because anyone chose that, but because there was nothing else to reach for.

A catch block without an outbound signal is a feature flag set to "silent." Every time you write one, you're making a product decision — that this failure mode is something support doesn't need to know about, that users affected by it will either not notice or won't need help. Sometimes that's the right call. It should be a call, not a default.

Eleven path-of-least-resistance decisions, one structural gap underneath them.

Got thoughts on this post?Reply via email

Subscribe to the newsletter:

About Jurij Tokarski

I run Varstatt and create software. Usually, I'm deep in work shipping for clients or building for myself. Sometimes, I share bits I don't want to forget.

x.comlinkedin.commedium.comdev.tohashnode.devjurij@varstatt.comRSS