| 
View
 

Engineering_Specification

Page history last edited by Mike 3 weeks, 5 days ago

Idea Stock Exchange: Engineering Specification

Audience: AI assistants and human contributors working on this repository. Purpose: Give you enough context to contribute code without duplicating or contradicting work that already exists. Status: Living document. Edit it when you learn something new or disagree with something in it.


What this system is, in one paragraph

The Idea Stock Exchange is a scoring engine for arguments. Given a belief ("carbon taxes reduce emissions"), users post pro and con arguments, each backed by evidence and linked to sub-arguments. The system computes a score for the belief by recursively evaluating the arguments, weighted by evidence quality, logical linkage, importance, and uniqueness. The score is not a vote tally. The number of humans who supported a position does not appear in the formula anywhere. The score is a function of the argument structure and the evidence attached to it.

The system is not a social network, a debate forum, or a prediction market. It is a computation over a graph. A separate prediction-market layer exists but does not feed into the scoring engine — the two subsystems are deliberately independent.


What's already built

Before writing new code, check these. You will probably find that your planned feature overlaps with existing work.

Core scoring engine (TypeScript):

  • src/core/scoring/scoring-engine.ts — ReasonRank, a PageRank-inspired algorithm with two score channels (pro/con). Damping factor d = 0.85 matches PageRank.
  • src/lib/propagate-belief-scores.ts — Upward propagation through the belief dependency graph with cycle detection via a visited-set.
  • tests/unit/core/scoring/scoring-engine.test.ts — Tests with expected values. Read these to understand the math before changing it.

Argument tree and Python scorer:

  • pipeline/scoring/reason_rank.py — Python implementation of the same scoring logic.
  • pipeline/models/belief_node.py — Data model for nodes in the argument tree.

Objective criteria scoring (four-dimension scoring for evaluating metrics):

  • backend/algorithms/scoring.py — Geometric mean for argument weight, logistic function for dimension scores.
  • ObjectiveCriteriaScoringAlgorithm.md — Full spec.

Duplication detection:

  • backend/algorithms/duplication_scoring.py — Three-layer similarity: mechanical equivalence, semantic embedding overlap, community verification.

Semantic overlap / similarity:

  • src/core/ai/overlap_engine.py — Navigation-based overlap signal.
  • src/core/ai/demo.py — Working demo of auto-clustering similar statements.

Frontend:

  • src/app/faq/page.tsx — FAQ with the user-facing explanation of the system.
  • src/features/epistemology/components/ProtocolDashboard.tsx — Live adversarial attack simulation.
  • src/app/algorithms/belief-equivalency/ — Layer 1 equivalency calculator.

If your planned work touches any of these, read the existing file first. Duplicating it is wasted effort. Contradicting it without explanation is worse.


Core invariants (do not break these)

These are the load-bearing properties of the system. Code that violates them is wrong, even if it compiles.

  1. The score is not a vote. No formula may include "number of users who agreed" or any proxy for it. Score is a function of argument quality and evidence, period.

  2. Author identity is orthogonal to final score. Reputation may affect speed of propagation or spam friction, but it must not appear in the final score formula. A novice posting a well-evidenced argument with tight linkage must produce the same score as an expert posting the same argument.

  3. Arguments and evidence are distinct objects that fail differently. An argument is a logical claim. Evidence is an empirical data point. They are scored separately and combined in a defined way. Do not conflate them.

  4. Fallacy accusations are arguments, not tags. There is no "click to apply fallacy penalty" mechanism. A fallacy accusation is a structured argument node with required fields (which fallacy, what was misrepresented, evidence of the misrepresentation). It must win its own sub-debate before affecting the target argument's score. See src/features/epistemology/ and FAQ Q5.

  5. Linkage is independent of truth. An argument can be true but irrelevant. The linkage score multiplies the argument's contribution. If linkage is zero, the argument contributes zero regardless of how true it is.

  6. Redundancy is compressed, not counted. Ten thousand semantically identical arguments produce one argument node's worth of score, not ten thousand. See backend/algorithms/duplication_scoring.py.

  7. The ReasonRank engine and the Market Price layer are independent. If you find yourself wiring market prices into the argument-scoring formula, stop and re-read this spec. The FAQ explicitly commits to this separation and the architecture must respect it.

  8. Updates propagate automatically. When a piece of evidence changes, every argument that uses it recalculates. When an argument's score changes, every belief that depends on it recalculates. This is the core promise — see src/lib/propagate-belief-scores.ts.


The ReasonRank formula

This is the math. It is implemented. If you want to change it, open an issue first — don't silently replace it.

For a single argument node:

RR(arg) = (1 − d) × baseTruth(arg) + d × f(proSubRank − conSubRank)

Where:

  • d = 0.85 (damping factor, identical to PageRank)
  • baseTruth(arg) = argument's own truth score after fallacy penalties
  • proSubRank = Σ RR(sub) × linkage × importance × uniqueness for each pro sub-argument
  • conSubRank = Σ RR(sub) × linkage × importance × uniqueness for each con sub-argument
  • f(net) = normalizes net sub-argument support to [0, 1]

For a leaf argument (no sub-arguments), the formula collapses to RR = baseTruth.

For a belief (the root of an argument tree):

ProRank  = Σ RR(arg) × linkage × importance × uniqueness   (over pro arguments)
ConRank  = Σ RR(arg) × linkage × importance × uniqueness   (over con arguments)
TruthScore = ProRank / (ProRank + ConRank)

TruthScore is in [0, 1]. It is the "probability that a random walker following the argument graph lands on a pro-conclusion."

What this design gives you that naive alternatives don't

  • No depth penalty. A well-reasoned argument three levels deep is not arbitrarily down-weighted. It propagates up through its parent naturally, just like PageRank.
  • Cycles handled correctly. Iterative convergence instead of hard recursion limits. If A supports B and B supports A, the scores converge rather than crashing.
  • Bounded scores without tanh saturation. The ProRank / (ProRank + ConRank) form keeps scores in [0, 1] without the hard ceiling problem that tanh introduces at extreme values.
  • Evidence is additive, not multiplicative at the top level. One piece of evidence does not dominate the entire belief; it contributes proportionally.

Argument weight (for objective criteria scoring)

Where three independent quality metrics must combine (evidence quality, logical validity, importance), use the geometric mean:

weight = (evidence_quality × logical_validity × importance)^(1/3)

Rationale: if any one factor is near zero, the overall weight drops sharply. This prevents a well-evidenced but logically fallacious argument from dominating, and it prevents a logically tight but evidence-free argument from dominating. All three have to be reasonable for the weight to be meaningful.

See backend/algorithms/scoring.py for the implementation.


Data model

The database is PostgreSQL. Semantic-similarity queries use vector embeddings stored alongside the text.

Nodes

Every argument, belief, assumption, and piece of evidence is a node.

Node:
  id                 UUID, primary key
  content            TEXT
  content_hash       SHA-256 of normalized text  (catches exact duplicates only)
  vector_embedding   FLOAT[] (for semantic similarity)
  node_type          ENUM: BELIEF | ARGUMENT | ASSUMPTION | EVIDENCE
  author_id          UUID, foreign key to users
  created_at         TIMESTAMP
  knowability_tag    ENUM: SETTLED_FACT | CONSENSUS_SCIENCE | EXPERT_JUDGMENT |
                          SPECULATION | INHERENTLY_UNCERTAIN | UNKNOWABLE

content_hash catches only exact text duplicates. It is not sufficient for duplication detection. Real dedup uses the three-layer system in backend/algorithms/duplication_scoring.py: hash first (cheap), then embedding similarity (medium), then community verification for edge cases (expensive).

Edges

Edges are directed and typed. They carry the linkage score and the relationship.

Edge:
  id                 UUID, primary key
  parent_id          UUID, foreign key to Node
  child_id           UUID, foreign key to Node
  relationship       ENUM: SUPPORTS | REFUTES | ASSUMES | EVIDENCES
  linkage_score      FLOAT in [0, 1]
  importance_score   FLOAT in [0, 1]
  uniqueness_score   FLOAT in [0, 1] (1 = fully unique, 0 = identical to another sibling)
  created_at         TIMESTAMP

The three scalar scores on each edge are the "edge weight" analogous to PageRank's 1/C(T). They are computed separately:

  • linkage_score: how strongly the child applies to the parent. Set by LinkageDebate sub-markets.
  • importance_score: how much weight this consideration carries in the parent's overall evaluation.
  • uniqueness_score: 1 minus the semantic similarity to other sibling arguments under the same parent. Prevents redundancy inflation.

Score snapshots (for prediction-market settlement)

Prediction markets need a defined settlement value. At the end of each epoch (month), the system snapshots scores for every node so contracts can settle.

EpochSnapshot:
  id                 UUID, primary key
  node_id            UUID, foreign key to Node
  epoch_date         DATE
  truth_score        FLOAT in [0, 1]
  confidence_band    FLOAT (width of confidence interval)

Snapshots are immutable. Contracts written against a node at epoch N settle against the EpochSnapshot for epoch N, even if the live score moves afterward.


Reputation (preliminary — see open question below)

Reputation affects propagation speed and spam friction, not score. A reputation page needs its own dedicated spec; this section is a placeholder.

What we know so far:

  • Each user accumulates a track record of argument quality, fallacy-accusation accuracy, and prediction accuracy.
  • Fallacy-accusation accuracy is tracked separately by target ideology (left / right / neutral) to catch the tribal pattern where a user only finds fallacies in arguments they disagree with.
  • A user's average fallacy-accusation score over time feeds into how quickly their future accusations propagate through the graph, not whether they are ultimately applied.
  • Reputation must never appear in the final argument-score formula. (See invariant 2 above.)

Open question for the reputation spec: what exactly happens when a user's fallacy accusation fails? No automatic credit-slashing — that creates the attack vector where a coordinated group financially destroys opponents by winning fallacy accusations against them. A failed accusation should lower the user's fallacy-accusation-accuracy score over time, which lowers how fast their future accusations propagate, but not penalize them with a binary score hit. The exact math needs its own document.


Anti-manipulation properties

Every attack vector collapses into one of five categories, and each has a structural defense. See FAQ Q4 for the user-facing version; here are the technical implementations.

Attack Defense Implementation
Volume brigading Duplicate compression backend/algorithms/duplication_scoring.py — three-layer similarity
Appeal to authority Author identity orthogonal to score Invariant 2 above; enforced by formula never referencing author_id
False fallacy tagging Accusations are themselves scored arguments Fallacy node type with required fields; goes through same scoring pipeline
Spam flooding Weak arguments drag down parent's average Geometric mean weight ensures low-quality factors pull weight down
Bot farms Indifferent to authorship Same defense as spam; no bot-detection arms race needed

The key design decision: we do not try to detect bots or coordinated attackers. That's an arms race we can't win. We make authorship irrelevant to scoring, so detection isn't necessary.


What the system does NOT need

This section exists because other specs have proposed features that conflict with the system's design. Avoid these.

  • No "downvote" button, no "disagree" click, no reaction emoji tied to scoring. If a user wants to express disagreement, they post a con argument.
  • No tanh bounding in the scoring formula. It saturates hard at extreme values and caps the influence of overwhelming evidence. Use the ProRank / (ProRank + ConRank) form instead.
  • No hardcoded depth limit. Use iterative convergence with a visited-set for cycle detection (see src/lib/propagate-belief-scores.ts).
  • No market-price input into the scoring engine. The prediction market is a separate subsystem. Market data never feeds argument scores.
  • No automatic credit-slashing on losing a fallacy accusation. See the reputation section.
  • No free-text-only UI. Users should write in natural language; the system extracts structured content with LLM assistance. But pure JSON input forms are not viable for user adoption.
  • No "trust the expert" override. An expert's argument goes through the same pipeline as anyone else's. Credentials can affect propagation speed, never final score.

MVP scope

Shipping the whole system in 90 days is not realistic. Here's an honest scope that one or two engineers can actually complete.

Phase 1: Scoring engine, working end-to-end (weeks 1–6)

  • [x] ReasonRank implemented in TypeScript and Python (already done)
  • [x] Belief-score propagation with cycle detection (already done)
  • [x] Objective criteria scoring (already done)
  • [ ] Integration tests covering the invariants above
  • [ ] API endpoints for creating nodes, edges, and retrieving scores
  • [ ] Basic UI for viewing a belief page with pro/con columns sorted by score

Phase 2: Duplication and linkage (weeks 7–12)

  • [x] Three-layer duplication detection (already done at module level)
  • [ ] UI integration: when a user submits an argument, the system flags likely duplicates and asks whether to append to existing node
  • [ ] Linkage debate sub-markets: for each edge, a mini pro/con on "does this argument actually apply here"
  • [ ] Automated propagation when evidence or linkage changes

Phase 3: Epoch snapshots and basic market layer (weeks 13–18)

  • [ ] Monthly cron job that snapshots the graph
  • [ ] Simple prediction-market UI (play money) for betting on snapshot values
  • [ ] Market Price and ReasonRank displayed side-by-side with the gap visible
  • [ ] No real money, no LMSR complexity yet — a simple order book is fine

Phase 4: Reputation (weeks 19–24)

  • [ ] Fallacy-accusation accuracy tracking
  • [ ] Cross-partisan calibration
  • [ ] Propagation-speed modulation based on track record
  • [ ] User-facing reputation page (read-only at first)

Each phase produces a shippable increment. Don't skip ahead. Phase 4 reputation depends on Phase 1's scoring being correct; Phase 3's market depends on Phase 2's propagation being reliable.


Contributing

If you're an AI assistant or a human contributor reading this to help build the system:

  1. Read before writing. Check the existing code in src/core/scoring/, pipeline/scoring/, and backend/algorithms/. The algorithm is already implemented. Your job is almost never to rewrite it.

  2. Match the existing style. TypeScript for the web app and scoring engine. Python for the pipeline and ML-adjacent code. PostgreSQL for storage. Don't introduce a new language or database without a good reason.

  3. Write tests before merging. The scoring engine has tests in tests/unit/core/scoring/. New scoring behavior needs new tests with expected values computed by hand.

  4. Don't break the invariants. If your PR violates one of the seven invariants in the "Core invariants" section above, it will be rejected. If you think an invariant is wrong, open an issue to discuss — don't silently violate it.

  5. Don't invent new math. The formulas in this doc are the formulas. If you believe a better one exists, write up the proposal and open an issue. Silently swapping algorithms creates debugging nightmares.

  6. Flag what you're uncertain about. If your implementation of a feature makes assumptions that could be wrong, put a # TODO comment or a GitHub issue explaining the assumption. The worst PR is one that hides its trade-offs.


Open questions

These are unresolved. If you have thoughts, open an issue.

  1. Reputation math details. What exactly is the function from fallacy-accusation track record to propagation-speed multiplier? Needs its own page and its own math.

  2. Cold-start problem. How does a new belief with no arguments get an initial score? Current code defaults to 0.5 (fully uncertain). Is that right, or should it be "undefined / insufficient data"?

  3. Evidence retraction cascades. When a Tier 1 study gets retracted, how aggressively do downstream scores update? All at once? Damped over time? Instantaneous recalculation is simplest but may cause whiplash in the UI.

  4. LLM-generated argument handling. The system is indifferent to authorship, but we probably want to display whether an argument was AI-generated, even if it doesn't affect scoring. UI question, not scoring question.

  5. Natural-language input pipeline. What's the best way to let users write prose and have the system extract structured arguments, evidence claims, and linkage proposals? This is the single biggest UX question and it's mostly unspecified.

  6. Multi-language support. The semantic similarity engine is English-only right now. Non-English arguments should work eventually.


References

 

Comments (0)

You don't have permission to comment on this page.