TODO: Internet Archive Integration Documentation

Priority: Critical Status: Not Started Estimated Effort: 6-8 hours

Problem

The app's core functionality revolves around streaming Grateful Dead concerts from the Internet Archive, but there is no documentation explaining: - Which Internet Archive collections are used - How the data is discovered and fetched - The data normalization process mentioned in the app description - Metadata structure and how it's parsed - Audio streaming implementation - How the app handles Internet Archive's specific quirks

This makes it difficult to: - Understand what data is available - Debug issues with specific recordings - Add support for additional collections - Handle changes in Internet Archive's data structure - Explain to users why certain shows appear or don't appear

What Needs Documentation

1. Internet Archive Collections

Document which specific collections are used: - GratefulDead collection? - etree (Live Music Archive)? - Other related collections?

For each collection, document: - Collection identifier - Number of recordings - Audio quality standards - Metadata quality - Update frequency - Access restrictions (if any)

Example:

Collection: GratefulDead
Identifier: GratefulDead
URL: https://archive.org/details/GratefulDead
Contains: ~13,000 recordings
Formats: FLAC, MP3, OGG
SoundBoard: Yes/No
Audience: Yes/No

2. Metadata Structure

The Internet Archive has complex metadata. Document:

Core Metadata Fields

identifier: Unique show ID
title: Show title
date: Performance date (format?)
venue: Venue name
coverage: Location (city, state)
creator: Band members
year: Year of performance
collection: Parent collection
subject: Tags and keywords
description: Show notes

Audio File Metadata

format: File formats available
track: Track number and title
length: Duration
size: File size
bitrate: Audio quality

How to Access

Document the API calls to retrieve metadata:

GET https://archive.org/metadata/{identifier}
Response: Full metadata JSON

Show examples of actual responses and how they're parsed.

3. Data Normalization Process

The app description claims "normalized data for seamless searching". Document:

What Gets Normalized

Dates: Different formats → Standard ISO date
Venues: Various spellings → Canonical names
Band Members: Name variations → Standard names
Locations: City/state variations → Consistent format
Song Titles: Different spellings → Standard names
Audio Quality: Different descriptors → Standard ratings

Where Normalization Happens

Is there a preprocessing step?
Does it happen in the mapper layer?
Is normalized data cached in the local database?
Is there a manual curation component?

Example Transformations

Raw: "Grateful Dead Live at Fillmore West on 1970-02-28"
Normalized: {
  band: "Grateful Dead",
  venue: "Fillmore West",
  city: "San Francisco",
  state: "CA",
  date: "1970-02-28",
  venue_id: "fillmore-west-sf"
}

4. Search Implementation

Document how search works across Internet Archive:

Search Methods

Advanced Search API: Using advancedsearch.php
Full-Text Search: Searching show descriptions
Field-Specific Search: By venue, date, song, etc.

Search Query Construction

Show how user searches translate to API queries: - Search by song name → ?q=subject:"Dark Star" - Search by venue → ?q=coverage:*Fillmore* - Search by date range → ?q=date:[1970 TO 1980] - Search by band member → ?q=creator:"Jerry Garcia"

Search Results Processing

How results are ranked
Pagination handling
Result deduplication
Filtering applied

5. Audio Streaming

Document how audio is streamed from Internet Archive:

File Discovery

How to find audio files for a show
Preferred format selection (FLAC > MP3 > OGG?)
Multiple versions handling (soundboard vs audience)

Streaming URLs

https://archive.org/download/{identifier}/{filename}
Example:
https://archive.org/download/gd70-02-28.sbd.miller.91178.flac16/gd70-02-28d1t01.flac

Media Player Integration

How Media3/ExoPlayer is configured
Buffering strategy
Format switching (quality levels)
Playlist generation from setlists

Offline Capabilities

Can shows be downloaded?
How is offline playback implemented?
Storage management

6. Setlist and Track Information

Document how setlists are extracted:

Setlist Sources

Internet Archive metadata
External sources (setlist.fm?)
Manual curation?

Track Parsing

How individual songs are identified
Track timing extraction
Set breaks identification
Encore handling

Song Database

Is there a local song database?
How are songs matched across shows?
Handling of song variations ("Dark Star" vs "Dark Star >")

7. Caching Strategy

Document how Internet Archive data is cached:

What Gets Cached

Metadata for recently viewed shows
Search results
Favorite shows
Playlist data
Album art/imagery

Cache Invalidation

How often is data refreshed?
Forced refresh mechanism
Handling stale data

Local Database

What's stored in Room database?
Sync strategy
Offline-first approach?

8. Performance Optimization

Document optimizations for Internet Archive integration:

Request Optimization

Batch metadata requests?
Parallel fetching strategy
Request prioritization

Data Compression

Are responses compressed?
Parsing optimization

Image Loading

Show imagery/posters
Thumbnail generation
Image caching

9. Error Handling

Document specific Internet Archive errors:

Common Issues

Show no longer available
Audio files missing or corrupted
Metadata incomplete
Rate limiting (if any)
Archive.org downtime

User-Facing Messages

How errors are communicated
Retry mechanisms
Fallback behaviors

10. Internet Archive Compliance

Document compliance with Internet Archive terms:

Attribution

Is attribution displayed?
Where and how?

Terms of Service

Any restrictions on usage
Rate limiting compliance
Caching limitations
Commercial use considerations

Licensing

Content licensing
Display of licensing info
User expectations

Structure

Create: docs/docs/developer/internet-archive.md

Suggested outline:

# Internet Archive Integration

## Overview
[What is Internet Archive, why it's perfect for this app]

## Collections
[Which collections are accessed]

## Metadata
[Structure, fields, examples]

## Data Normalization
[Process, examples, where it happens]

## Search
[How search is implemented]

## Audio Streaming
[Streaming implementation]

## Setlists and Tracks
[Track information extraction]

## Caching
[What's cached and why]

## Performance
[Optimization strategies]

## Error Handling
[Common issues and solutions]

## Compliance
[Terms of service, attribution, licensing]

## Testing
[How to test with Internet Archive data]

## References
[Links to Internet Archive docs]

Research Required

To write this documentation, investigate:

API calls: Search codebase for archive.org URLs
ArchiveApiService: What endpoints are actually used?
ArchiveMetadataResponse: What fields are parsed?
ArchiveMapper: What transformations occur?
Database schema: What Internet Archive data is stored locally?
Search implementation: How are queries constructed?
Media player: How are streaming URLs generated?
Normalization code: Where does data cleanup happen?
Error handling: How are IA-specific errors handled?
Caching: What's cached and for how long?

Code References

Key files to examine: - androidApp/v2/core/network/archive/api/ArchiveApiService.kt - androidApp/v2/core/network/archive/model/ArchiveMetadataResponse.kt - androidApp/v2/core/network/archive/mapper/ArchiveMapper.kt - androidApp/v2/core/database/ (database entities) - androidApp/v2/core/search/ (search implementation) - androidApp/v2/core/media/ or androidApp/v2/core/player/ (streaming) - Any normalization utilities

External Resources

Research and link to: - Internet Archive API Documentation - Internet Archive Metadata - Advanced Search - Live Music Archive - GratefulDead Collection

Checklist

Success Criteria

A developer should be able to: - Understand exactly which Internet Archive collections are used - Explain how data flows from IA to the UI - Add support for new collections - Debug issues with specific recordings - Understand the normalization process - Modify search functionality - Troubleshoot streaming problems - Explain to users why certain features work the way they do

Notes

This documentation is critical because the Internet Archive is not a typical API. It has: - Unique metadata structure - Multiple audio formats and quality levels - Complex search capabilities - Specific terms of service - Occasional data quality issues

Understanding how the app works with Internet Archive is essential for maintaining and extending the app's core functionality.

This documentation should be written for both developers (technical implementation) and for product people (understanding capabilities and limitations).