SAIFE
CitizenPreview

Risk Categories

SAIFE’s working taxonomy. These categories anchor detection, dashboards, and simulators.

Misinformation & Hallucination

False or fabricated outputs that can mislead users or the public.

Examples: Fake citations · Fabricated facts · Synthetic news

Privacy & Data Leakage

Exposing personal or sensitive data in prompts, outputs, or logs.

Examples: PII leakage · Training data exposure · Prompt injection exfiltration

Bias & Discrimination

Unfair treatment or stereotyping across protected attributes.

Examples: Toxic outputs · Unequal performance · Stereotyped recommendations

Safety & Harmful Content

Outputs that enable or encourage harm to self or others.

Examples: Violence instructions · Self-harm content · Illegal activity guidance

Security & Abuse

Model/system exploits, jailbreaking, or unauthorized access.

Examples: Prompt injection · RCE chains · Account takeover patterns

IP & Copyright

Infringing content or non-transformative reproduction.

Examples: Verbatim outputs · Style cloning · Trademark misuse

Fraud & Impersonation

Deceptive identity or financial abuse via AI systems.

Examples: Deepfake scams · Phishing text · Synthetic voice spoofing

Child Safety

Any material or behavior that endangers minors.

Examples: Sexualized content · Grooming patterns · Exploitative imagery

Medical & Legal Advice

Unqualified guidance in high-stakes domains.

Examples: Unsafe dosing advice · Unauthorized legal counsel

Political Manipulation

Coordinated influence or civic process interference.

Examples: Astroturfing · Targeted persuasion · Voter suppression content

Misinformation & Hallucination

False or fabricated outputs that can mislead users or the public.

  • Fake citations
  • Fabricated facts
  • Synthetic news

Privacy & Data Leakage

Exposing personal or sensitive data in prompts, outputs, or logs.

  • PII leakage
  • Training data exposure
  • Prompt injection exfiltration

Bias & Discrimination

Unfair treatment or stereotyping across protected attributes.

  • Toxic outputs
  • Unequal performance
  • Stereotyped recommendations

Safety & Harmful Content

Outputs that enable or encourage harm to self or others.

  • Violence instructions
  • Self-harm content
  • Illegal activity guidance

Security & Abuse

Model/system exploits, jailbreaking, or unauthorized access.

  • Prompt injection
  • RCE chains
  • Account takeover patterns

IP & Copyright

Infringing content or non-transformative reproduction.

  • Verbatim outputs
  • Style cloning
  • Trademark misuse

Fraud & Impersonation

Deceptive identity or financial abuse via AI systems.

  • Deepfake scams
  • Phishing text
  • Synthetic voice spoofing

Child Safety

Any material or behavior that endangers minors.

  • Sexualized content
  • Grooming patterns
  • Exploitative imagery

Medical & Legal Advice

Unqualified guidance in high-stakes domains.

  • Unsafe dosing advice
  • Unauthorized legal counsel

Political Manipulation

Coordinated influence or civic process interference.

  • Astroturfing
  • Targeted persuasion
  • Voter suppression content