Common Issues
This page covers the most frequently encountered problems when developing with the AEGIS platform, along with diagnosis steps and solutions.
Service Won’t Start
Symptoms
uvicornexits immediately with an import error- Service starts but crashes during lifespan initialization
- Port already in use error
Diagnosis
# Check if the port is already occupied
lsof -i :8001
# Try starting the service manually with verbose output
cd services/orchestration-engine
poetry run uvicorn orchestration.main:app --port 8001 --host 0.0.0.0Solutions
Import error / ModuleNotFoundError:
# Reinstall dependencies including the shared library
cd services/{service-name}
poetry installThe shared library (aegis-shared) is a path dependency at ../../shared. If you see ModuleNotFoundError: No module named 'aegis_shared', the shared package needs to be installed:
cd services/{service-name}
poetry install # This installs aegis-shared as a path depPort already in use:
# Kill the process on the port
lsof -ti:8001 | xargs kill -9
# Or use the stop script
./infrastructure/scripts/start-all.sh stopLifespan crash (database/Redis not reachable):
Make sure infrastructure is running:
docker compose up -d
docker compose ps # Verify all containers are healthyDatabase Connection Errors
Symptoms
asyncpg.exceptions.ConnectionDoesNotExistErrorconnection refusedon port 5432password authentication failed for user "aegis"- Service hangs during startup (waiting for pool connection)
Diagnosis
# Check if PostgreSQL container is running
docker compose ps postgres
# Test connectivity directly
psql -h localhost -U aegis -d aegis -c "SELECT 1;"
# Check container logs
docker compose logs postgresSolutions
Container not running:
docker compose up -d postgres
# Wait for health check to pass
docker compose ps # Look for "(healthy)" statusContainer running but connection refused:
Verify the DATABASE_URL environment variable matches Docker Compose config:
DATABASE_URL=postgresql://aegis:aegis_local@localhost:5432/aegisCheck your .env file or environment. The default credentials are aegis / aegis_local.
Schema not initialized:
If the database exists but tables are missing, the init scripts may not have run. The init SQL files are mounted as Docker volumes and run on first container creation:
# Force re-initialization by removing the volume
docker compose down
rm -rf docker-volumes/postgres
docker compose up -d postgresRemoving the postgres volume deletes all data. Only do this in local development.
Apache AGE extension issues:
AGE requires special initialization. If you see errors about ag_catalog or graph functions, check that the AGE extension loaded:
psql -h localhost -U aegis -d aegis -c "LOAD 'age'; SET search_path = ag_catalog, \"\$user\", public; SELECT * FROM ag_catalog.ag_graph;"AGE Graph oid does not exist (catalog OID drift)
Symptoms
GET /context/assemble/{well_api}and other knowledge-graph reads return 500 for every entity- Root error:
asyncpg.exceptions.UndefinedObjectError: graph with oid <N> does not exist POST /seedfails with the sameoid <N> does not exist- A service restart does not fix it (it’s catalog state, not connection cache)
- The KG service refuses to start with a message about the
oilgasgraph being corrupt
Cause
create_graph('oilgas') sets ag_catalog.ag_graph.graphid equal to the oid of the graph’s backing namespace (schema), and AGE relies on that invariant. A logical pg_dump restore reattaches the namespace by name (so reads keep working) but freezes graphid as the old literal oid, while CREATE SCHEMA oilgas is assigned a fresh oid on restore. The two drift apart, and any path that resolves the graph by graphid then dereferences a dead oid → graph with oid <N> does not exist. See issue #18 .
The relational tables restore fine — only the Apache AGE graph is affected. Logical backups are restorable; they just need the graph rebuilt afterwards (handled by restore.sh).
Diagnosis
# Is graphid still aligned with the live namespace oid? Expect 't'.
psql -h localhost -U aegis -d aegis -c \
"SELECT graphid, namespace::oid AS schema_oid, (graphid = namespace::oid) AS aligned \
FROM ag_catalog.ag_graph WHERE name='oilgas';"The KG service runs this same check at startup (_verify_graph_integrity).
Solutions
Manual repair (rebuild the graph through AGE, then reseed):
psql -h localhost -U aegis -d aegis -c \
"LOAD 'age'; SET search_path=ag_catalog,public; \
SELECT drop_graph('oilgas', true); SELECT create_graph('oilgas');"
curl -X POST http://localhost:8003/seedOn restore: infrastructure/deploy/scripts/restore.sh detects the drift after the replay and rebuilds + reseeds the graph automatically — no manual step needed.
Self-healing on boot (dev/demo only): set KG_GRAPH_AUTO_REPAIR=true so the KG service repairs the drift (drop + create + reseed) on startup instead of failing loud. This is destructive to graph data, so leave it false anywhere the graph holds non-seed data.
Redis Connection Issues
Symptoms
redis.exceptions.ConnectionError: Error connecting to localhost:6379- Service starts but memory operations fail
RedisManager: Redis not connected. Call connect() first.
Diagnosis
# Check if Redis container is running
docker compose ps redis
# Test connectivity
redis-cli ping
# Expected: PONG
# Check container logs
docker compose logs redisSolutions
Container not running:
docker compose up -d redisConnection URL mismatch:
Verify the REDIS_URL environment variable:
REDIS_URL=redis://localhost:6379Data corruption or persistence issues:
# Clear Redis data and restart
docker compose stop redis
rm -rf docker-volumes/redis
docker compose up -d redisKafka Consumer Lag
Symptoms
- Events published by the ingestion service are not received by consumers
- Kafka consumer shows no new messages
confluent_kafka.KafkaException: KafkaError{code=_TRANSPORT}
Diagnosis
# Check if Kafka container is running
docker compose ps kafka
# Check container logs for errors
docker compose logs kafka
# List topics (requires kafka tools in the container)
docker compose exec kafka kafka-topics --list --bootstrap-server localhost:9092Solutions
Container not running or unhealthy:
docker compose up -d kafkaKafka won’t start (cluster ID mismatch):
This happens when the Kafka data volume has stale cluster metadata:
docker compose stop kafka
rm -rf docker-volumes/kafka
docker compose up -d kafkaConsumer not receiving messages:
Verify the KAFKA_BOOTSTRAP_SERVERS environment variable:
KAFKA_BOOTSTRAP_SERVERS=localhost:9092Kafka in AEGIS uses KRaft mode (no ZooKeeper) with a single broker. The cluster ID is hardcoded in docker-compose.yml.
LLM API Key Issues
Symptoms
openai.AuthenticationError: Incorrect API key providedlitellm.exceptions.AuthenticationError- Agent execution returns an error about missing API key
- Empty or garbled LLM responses
Diagnosis
# Check if the key is set
echo $OPENAI_API_KEY
# Test the key directly
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-s | head -20Solutions
Key not set:
Copy .env.example to .env and fill in your OpenAI API key:
cp .env.example .env
# Edit .env and set OPENAI_API_KEY=sk-...Then export it in your shell or source the env file:
source .env
# Or export directly
export OPENAI_API_KEY=sk-your-key-hereKey is set but wrong:
Verify the key starts with sk- and is a valid OpenAI key. If using Anthropic models via LiteLLM, also set ANTHROPIC_API_KEY.
Rate limiting:
If you see 429 Too Many Requests, you have hit the OpenAI rate limit. The orchestration engine uses gpt-4o as the primary model with gpt-4o-mini as fallback. Budget constraints are enforced per agent (see agent YAML configs in agents/).
Poetry Dependency Conflicts
Symptoms
poetry installfails with resolver errors- Version conflict between service dependencies and shared library
SolverProblemErrormentioning incompatible versions
Diagnosis
cd services/{service-name}
poetry lock --check # Verify lock file is consistent
poetry show --tree # Show dependency treeSolutions
Lock file out of date:
cd services/{service-name}
poetry lock # Regenerate lock file
poetry install # Install from fresh lockShared library version conflict:
The shared library is installed as a path dependency:
aegis-shared = {path = "../../shared", develop = true}If the shared library’s dependencies conflict with the service’s dependencies, update the shared library’s pyproject.toml first, then re-lock both:
cd shared
poetry lock
cd ../services/{service-name}
poetry lock
poetry installVirtual environment issues:
If Poetry is using the wrong Python version or a stale virtual environment:
cd services/{service-name}
poetry env remove python
poetry install # Creates a fresh virtualenvPython Version Mismatches
Symptoms
SyntaxErroron Python 3.12 features (e.g.,typestatement,StrEnum)- Poetry refuses to install, citing Python version constraint
pyenv: version '3.12.x' not installed
Diagnosis
# Check active Python version
python --version
pyenv version
# Check what the repo expects
cat .python-versionSolutions
The repository requires Python 3.12, pinned in .python-version at the repo root.
Install the correct Python version:
pyenv install 3.12
pyenv rehashEnsure pyenv is active:
# Verify pyenv is managing the Python version
which python
# Should show something like: /Users/you/.pyenv/shims/python
# If not, ensure pyenv is in your shell config
eval "$(pyenv init -)"Poetry using the wrong Python:
Poetry respects virtualenvs.prefer-active-python = true. Verify this is set:
poetry config virtualenvs.prefer-active-python trueThen re-create the virtual environment:
cd services/{service-name}
poetry env remove python
poetry installAPI Gateway Won’t Compile or Start
Symptoms
go: command not found- Go compilation errors
- Gateway starts but returns 502 on all requests
Diagnosis
# Check Go installation
go version
# Expected: go1.21 or higher
# Try building manually
cd services/api-gateway
go build ./cmd/gateway/Solutions
Go not installed:
Install Go 1.21+ from golang.org or via your package manager.
Compilation errors:
cd services/api-gateway
go mod tidy # Clean up module dependencies
go build ./cmd/gateway/502 errors on all routes:
The gateway proxies to backend services. If backends are not running, all requests return 502. Start all services first:
./infrastructure/scripts/start-all.shJWT Authentication Failures
Symptoms
401 Unauthorizedon all gateway requestsinvalid tokenortoken expirederrors- Login succeeds but subsequent requests fail
Diagnosis
# Get a fresh token
curl -s -X POST http://localhost:8000/api/v1/auth/token \
-H "Content-Type: application/json" \
-d '{"email": "admin@aegis.local", "password": "aegis-dev-admin"}' | python3 -m json.tool
# Validate the token
TOKEN="..."
curl -s -X POST http://localhost:8009/auth/validate \
-G --data-urlencode "authorization=Bearer $TOKEN" | python3 -m json.toolSolutions
Auth service not running:
The gateway validates every request through the auth service on port 8009. If it is down, all authenticated requests fail:
cd services/auth-service
poetry run uvicorn auth_service.main:app --port 8009JWT_SECRET not set or mismatched:
The auth service and gateway must share the same JWT_SECRET. Check your .env file.
Invalid email or password at login:
The auth service returns 401 Invalid email or password if the credentials are wrong or the user is inactive. For local development, log in with the seeded bootstrap admin (admin@aegis.local / aegis-dev-admin). To create or rotate a user:
cd services/auth-service
poetry run python -m auth_service.create_user --email alice@example.com --roles operatorSSE/EventSource requests fail with 401:
Browser EventSource cannot set the Authorization header, so the gateway falls back to the aegis_token cookie. Make sure you logged in through the frontend (which sets the cookie) rather than only holding a token in memory.