Skip to Content
Developer DocsTroubleshootingCommon Issues

Common Issues

This page covers the most frequently encountered problems when developing with the AEGIS platform, along with diagnosis steps and solutions.

Service Won’t Start

Symptoms

  • uvicorn exits immediately with an import error
  • Service starts but crashes during lifespan initialization
  • Port already in use error

Diagnosis

# Check if the port is already occupied lsof -i :8001 # Try starting the service manually with verbose output cd services/orchestration-engine poetry run uvicorn orchestration.main:app --port 8001 --host 0.0.0.0

Solutions

Import error / ModuleNotFoundError:

# Reinstall dependencies including the shared library cd services/{service-name} poetry install

The shared library (aegis-shared) is a path dependency at ../../shared. If you see ModuleNotFoundError: No module named 'aegis_shared', the shared package needs to be installed:

cd services/{service-name} poetry install # This installs aegis-shared as a path dep

Port already in use:

# Kill the process on the port lsof -ti:8001 | xargs kill -9 # Or use the stop script ./infrastructure/scripts/start-all.sh stop

Lifespan crash (database/Redis not reachable):

Make sure infrastructure is running:

docker compose up -d docker compose ps # Verify all containers are healthy

Database Connection Errors

Symptoms

  • asyncpg.exceptions.ConnectionDoesNotExistError
  • connection refused on port 5432
  • password authentication failed for user "aegis"
  • Service hangs during startup (waiting for pool connection)

Diagnosis

# Check if PostgreSQL container is running docker compose ps postgres # Test connectivity directly psql -h localhost -U aegis -d aegis -c "SELECT 1;" # Check container logs docker compose logs postgres

Solutions

Container not running:

docker compose up -d postgres # Wait for health check to pass docker compose ps # Look for "(healthy)" status

Container running but connection refused:

Verify the DATABASE_URL environment variable matches Docker Compose config:

DATABASE_URL=postgresql://aegis:aegis_local@localhost:5432/aegis

Check your .env file or environment. The default credentials are aegis / aegis_local.

Schema not initialized:

If the database exists but tables are missing, the init scripts may not have run. The init SQL files are mounted as Docker volumes and run on first container creation:

# Force re-initialization by removing the volume docker compose down rm -rf docker-volumes/postgres docker compose up -d postgres

Removing the postgres volume deletes all data. Only do this in local development.

Apache AGE extension issues:

AGE requires special initialization. If you see errors about ag_catalog or graph functions, check that the AGE extension loaded:

psql -h localhost -U aegis -d aegis -c "LOAD 'age'; SET search_path = ag_catalog, \"\$user\", public; SELECT * FROM ag_catalog.ag_graph;"

AGE Graph oid does not exist (catalog OID drift)

Symptoms

  • GET /context/assemble/{well_api} and other knowledge-graph reads return 500 for every entity
  • Root error: asyncpg.exceptions.UndefinedObjectError: graph with oid <N> does not exist
  • POST /seed fails with the same oid <N> does not exist
  • A service restart does not fix it (it’s catalog state, not connection cache)
  • The KG service refuses to start with a message about the oilgas graph being corrupt

Cause

create_graph('oilgas') sets ag_catalog.ag_graph.graphid equal to the oid of the graph’s backing namespace (schema), and AGE relies on that invariant. A logical pg_dump restore reattaches the namespace by name (so reads keep working) but freezes graphid as the old literal oid, while CREATE SCHEMA oilgas is assigned a fresh oid on restore. The two drift apart, and any path that resolves the graph by graphid then dereferences a dead oid → graph with oid <N> does not exist. See issue #18 .

The relational tables restore fine — only the Apache AGE graph is affected. Logical backups are restorable; they just need the graph rebuilt afterwards (handled by restore.sh).

Diagnosis

# Is graphid still aligned with the live namespace oid? Expect 't'. psql -h localhost -U aegis -d aegis -c \ "SELECT graphid, namespace::oid AS schema_oid, (graphid = namespace::oid) AS aligned \ FROM ag_catalog.ag_graph WHERE name='oilgas';"

The KG service runs this same check at startup (_verify_graph_integrity).

Solutions

Manual repair (rebuild the graph through AGE, then reseed):

psql -h localhost -U aegis -d aegis -c \ "LOAD 'age'; SET search_path=ag_catalog,public; \ SELECT drop_graph('oilgas', true); SELECT create_graph('oilgas');" curl -X POST http://localhost:8003/seed

On restore: infrastructure/deploy/scripts/restore.sh detects the drift after the replay and rebuilds + reseeds the graph automatically — no manual step needed.

Self-healing on boot (dev/demo only): set KG_GRAPH_AUTO_REPAIR=true so the KG service repairs the drift (drop + create + reseed) on startup instead of failing loud. This is destructive to graph data, so leave it false anywhere the graph holds non-seed data.


Redis Connection Issues

Symptoms

  • redis.exceptions.ConnectionError: Error connecting to localhost:6379
  • Service starts but memory operations fail
  • RedisManager: Redis not connected. Call connect() first.

Diagnosis

# Check if Redis container is running docker compose ps redis # Test connectivity redis-cli ping # Expected: PONG # Check container logs docker compose logs redis

Solutions

Container not running:

docker compose up -d redis

Connection URL mismatch:

Verify the REDIS_URL environment variable:

REDIS_URL=redis://localhost:6379

Data corruption or persistence issues:

# Clear Redis data and restart docker compose stop redis rm -rf docker-volumes/redis docker compose up -d redis

Kafka Consumer Lag

Symptoms

  • Events published by the ingestion service are not received by consumers
  • Kafka consumer shows no new messages
  • confluent_kafka.KafkaException: KafkaError{code=_TRANSPORT}

Diagnosis

# Check if Kafka container is running docker compose ps kafka # Check container logs for errors docker compose logs kafka # List topics (requires kafka tools in the container) docker compose exec kafka kafka-topics --list --bootstrap-server localhost:9092

Solutions

Container not running or unhealthy:

docker compose up -d kafka

Kafka won’t start (cluster ID mismatch):

This happens when the Kafka data volume has stale cluster metadata:

docker compose stop kafka rm -rf docker-volumes/kafka docker compose up -d kafka

Consumer not receiving messages:

Verify the KAFKA_BOOTSTRAP_SERVERS environment variable:

KAFKA_BOOTSTRAP_SERVERS=localhost:9092

Kafka in AEGIS uses KRaft mode (no ZooKeeper) with a single broker. The cluster ID is hardcoded in docker-compose.yml.


LLM API Key Issues

Symptoms

  • openai.AuthenticationError: Incorrect API key provided
  • litellm.exceptions.AuthenticationError
  • Agent execution returns an error about missing API key
  • Empty or garbled LLM responses

Diagnosis

# Check if the key is set echo $OPENAI_API_KEY # Test the key directly curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -s | head -20

Solutions

Key not set:

Copy .env.example to .env and fill in your OpenAI API key:

cp .env.example .env # Edit .env and set OPENAI_API_KEY=sk-...

Then export it in your shell or source the env file:

source .env # Or export directly export OPENAI_API_KEY=sk-your-key-here

Key is set but wrong:

Verify the key starts with sk- and is a valid OpenAI key. If using Anthropic models via LiteLLM, also set ANTHROPIC_API_KEY.

Rate limiting:

If you see 429 Too Many Requests, you have hit the OpenAI rate limit. The orchestration engine uses gpt-4o as the primary model with gpt-4o-mini as fallback. Budget constraints are enforced per agent (see agent YAML configs in agents/).


Poetry Dependency Conflicts

Symptoms

  • poetry install fails with resolver errors
  • Version conflict between service dependencies and shared library
  • SolverProblemError mentioning incompatible versions

Diagnosis

cd services/{service-name} poetry lock --check # Verify lock file is consistent poetry show --tree # Show dependency tree

Solutions

Lock file out of date:

cd services/{service-name} poetry lock # Regenerate lock file poetry install # Install from fresh lock

Shared library version conflict:

The shared library is installed as a path dependency:

aegis-shared = {path = "../../shared", develop = true}

If the shared library’s dependencies conflict with the service’s dependencies, update the shared library’s pyproject.toml first, then re-lock both:

cd shared poetry lock cd ../services/{service-name} poetry lock poetry install

Virtual environment issues:

If Poetry is using the wrong Python version or a stale virtual environment:

cd services/{service-name} poetry env remove python poetry install # Creates a fresh virtualenv

Python Version Mismatches

Symptoms

  • SyntaxError on Python 3.12 features (e.g., type statement, StrEnum)
  • Poetry refuses to install, citing Python version constraint
  • pyenv: version '3.12.x' not installed

Diagnosis

# Check active Python version python --version pyenv version # Check what the repo expects cat .python-version

Solutions

The repository requires Python 3.12, pinned in .python-version at the repo root.

Install the correct Python version:

pyenv install 3.12 pyenv rehash

Ensure pyenv is active:

# Verify pyenv is managing the Python version which python # Should show something like: /Users/you/.pyenv/shims/python # If not, ensure pyenv is in your shell config eval "$(pyenv init -)"

Poetry using the wrong Python:

Poetry respects virtualenvs.prefer-active-python = true. Verify this is set:

poetry config virtualenvs.prefer-active-python true

Then re-create the virtual environment:

cd services/{service-name} poetry env remove python poetry install

API Gateway Won’t Compile or Start

Symptoms

  • go: command not found
  • Go compilation errors
  • Gateway starts but returns 502 on all requests

Diagnosis

# Check Go installation go version # Expected: go1.21 or higher # Try building manually cd services/api-gateway go build ./cmd/gateway/

Solutions

Go not installed:

Install Go 1.21+ from golang.org  or via your package manager.

Compilation errors:

cd services/api-gateway go mod tidy # Clean up module dependencies go build ./cmd/gateway/

502 errors on all routes:

The gateway proxies to backend services. If backends are not running, all requests return 502. Start all services first:

./infrastructure/scripts/start-all.sh

JWT Authentication Failures

Symptoms

  • 401 Unauthorized on all gateway requests
  • invalid token or token expired errors
  • Login succeeds but subsequent requests fail

Diagnosis

# Get a fresh token curl -s -X POST http://localhost:8000/api/v1/auth/token \ -H "Content-Type: application/json" \ -d '{"email": "admin@aegis.local", "password": "aegis-dev-admin"}' | python3 -m json.tool # Validate the token TOKEN="..." curl -s -X POST http://localhost:8009/auth/validate \ -G --data-urlencode "authorization=Bearer $TOKEN" | python3 -m json.tool

Solutions

Auth service not running:

The gateway validates every request through the auth service on port 8009. If it is down, all authenticated requests fail:

cd services/auth-service poetry run uvicorn auth_service.main:app --port 8009

JWT_SECRET not set or mismatched:

The auth service and gateway must share the same JWT_SECRET. Check your .env file.

Invalid email or password at login:

The auth service returns 401 Invalid email or password if the credentials are wrong or the user is inactive. For local development, log in with the seeded bootstrap admin (admin@aegis.local / aegis-dev-admin). To create or rotate a user:

cd services/auth-service poetry run python -m auth_service.create_user --email alice@example.com --roles operator

SSE/EventSource requests fail with 401:

Browser EventSource cannot set the Authorization header, so the gateway falls back to the aegis_token cookie. Make sure you logged in through the frontend (which sets the cookie) rather than only holding a token in memory.

Last updated on