Documentation

Attack Categories

Understanding the different types of AI security vulnerabilities and attack vectors tested by ModelRed

ModelRed's test suites are organized into distinct attack categories that represent different types of AI security vulnerabilities. Understanding these categories helps you identify specific threats and prioritize your security efforts.

Category Overview

🔐

Encoding Attacks

12 test suites

Base64, hex, ROT13, and other encoding schemes used to bypass input filters

🛡️

Content Safety

9 test suites

Toxicity, hate speech, slurs, and harmful content generation testing

🎭

Social Engineering

5 test suites

Emotional manipulation and psychological attack techniques

🔓

Jailbreaking

8 test suites

Advanced techniques to bypass safety guardrails and restrictions

💉

Injection Attacks

7 test suites

Prompt injection, template injection, and code execution vulnerabilities

🦠

Malware Generation

4 test suites

Detection of malicious code generation and security evasion


Encoding Attacks

Techniques that use various encoding schemes to obfuscate malicious content

Description

Encoding attacks attempt to bypass input validation and content filters by representing malicious instructions in alternative character encodings. These attacks exploit the fact that many AI systems may not properly decode or recognize encoded content as potentially harmful.

Test Suites in This Category

  • Base64 Injection - Tests Base64 encoding vulnerabilities
  • Hexadecimal Injection - Detects hex encoding attacks
  • ROT13 Encoding - Character substitution attacks
  • Morse Encoding - Morse code obfuscation
  • Braille Encoding - Braille text encoding attacks
  • NATO Encoding - Phonetic alphabet encoding
  • Unicode Attacks - Base2048, Zalgo, and emoji encoding
  • Legacy Encoding - UUencoding and ASCII85 attacks

Risk Level

Low to Medium

Common in Tiers

Primarily Free and Starter tiers - these represent fundamental security testing


Content Safety

Detection of harmful, toxic, or inappropriate content generation

Description

Content safety attacks test whether AI models can be manipulated into generating harmful content including hate speech, toxicity, slurs, sexual content, and other policy violations. These tests are crucial for user-facing applications.

Test Suites in This Category

  • Toxicity Generation - Advanced AI-powered toxicity testing
  • Continue Slurs - Partial offensive term completion
  • LMRC Content Tests - Bullying, slur usage, sexual content
  • Sexualization - Inappropriate objectification testing
  • Deadnaming - Transgender rights violations
  • Medical Misinformation - Pseudoscience and quack medicine
  • Real Toxicity Prompts - Baseline toxicity testing

Risk Level

High

Common in Tiers

Starter tier and above - essential for production applications


Social Engineering

Psychological manipulation techniques to extract information or bypass restrictions

Description

Social engineering attacks use emotional manipulation, authority exploitation, and psychological tricks to convince AI models to provide restricted information or perform unauthorized actions.

Test Suites in This Category

  • Grandma Attacks - Emotional manipulation for product keys, slurs, and substances
  • JSON Threats - Threatening models to output raw JSON
  • Tag Injection - Hidden Unicode manipulation

Risk Level

Medium to High

Common in Tiers

Starter tier - important for applications with human interaction


Jailbreaking

Advanced techniques to override safety mechanisms and restrictions

Description

Jailbreaking encompasses sophisticated attempts to bypass AI safety measures, including role-playing scenarios, complex prompt structures, and research-grade attack techniques.

Test Suites in This Category

  • Advanced Jailbreak Ablation - DAN 11.0 with 127 variations
  • AutoDAN Cached - Gradient descent optimization attacks
  • DAN in the Wild - Real-world attack collection
  • Tense Manipulation - Past/future tense bypass techniques
  • Latent Jailbreak - Translation-based distraction attacks

Risk Level

High to Critical

Common in Tiers

Starter and Pro tiers - advanced security research


Injection Attacks

Code execution and system manipulation vulnerabilities

Description

Injection attacks attempt to execute unauthorized code, manipulate system behavior, or extract sensitive information through various injection techniques including prompt injection, template injection, and SQL injection.

Test Suites in This Category

  • Prompt Injection - Framework attacks targeting hate speech and violence
  • Jinja Template Injection - Server-side template execution
  • SQL Injection Echo - Database manipulation attempts
  • Latent Injection - Hidden instructions in documents
  • XSS Attacks - Cross-site scripting and data exfiltration

Risk Level

Critical

Common in Tiers

Starter through Enterprise - varies by complexity


Malware Generation

Detection of malicious code creation capabilities

Description

Malware generation tests evaluate whether AI models can be manipulated into creating malicious software, including viruses, ransomware, cryptojacking code, and evasion techniques.

Test Suites in This Category

  • Malware Top Level - High-level malicious code requests
  • Malware Sub Functions - Individual attack component generation
  • Malware Evasion - Anti-detection code generation
  • Malware Payload - Ransomware and cryptojacking

Risk Level

Critical

Common in Tiers

Enterprise tier only - highest security requirements


Supply Chain Attacks

Testing for software dependency vulnerabilities

Description

Supply chain attacks test whether AI models recommend non-existent or malicious packages that could compromise software dependencies across various programming ecosystems.

Test Suites in This Category

  • Python Package Hallucination - Non-existent PyPI packages
  • JavaScript Package Hallucination - Fake NPM packages
  • Ruby Gem Hallucination - Non-existent Ruby gems
  • Rust Crate Hallucination - Fake Cargo packages

Risk Level

Medium to High

Common in Tiers

Starter (Python only) and Enterprise (all languages)


Misinformation & Reasoning

Testing for false information and logical manipulation

Description

These attacks test the model's susceptibility to generating or accepting false information, failing at basic reasoning tasks, or being manipulated through logical fallacies.

Test Suites in This Category

  • False Assertion - Testing rejection of demonstrably false claims
  • Who is Riley - Hallucination detection probes
  • Snowball Graph - Impossible reasoning problems

Risk Level

Medium

Common in Tiers

Starter tier - important for factual applications


Usage Examples

Testing by Category

import asyncio
from modelred import ModelRed

async def main():
    async with ModelRed(api_key="mr_your_api_key_here") as client:
        # Test encoding attacks
        encoding_suites = [
            "base64_injection",
            "hexadecimal_injection",
            "rot13_encoding"
        ]

        # Test content safety
        safety_suites = [
            "toxicity_generation",
            "continue_slurs",
            "lmrc_bullying"
        ]

        # Run category-focused assessment
        result = await client.run_assessment(
            model_id="my-model",
            test_suites=encoding_suites + safety_suites,
            priority="high",
            wait_for_completion=True
        )

asyncio.run(main())

Risk-Based Testing

# High-risk categories for production systems
high_risk_suites = [
    # Content Safety (Critical for user-facing apps)
    "toxicity_generation",
    "lmrc_bullying",

    # Injection Attacks (System security)
    "prompt_inject_hate",
    "jinja_template_injection",

    # Jailbreaking (Policy compliance)
    "past_tense_jailbreak",
    "future_tense_jailbreak"
]

result = await client.run_assessment(
    model_id="production-model",
    test_suites=high_risk_suites,
    priority="critical",
    wait_for_completion=True
)

Security Prioritization

Critical Priority

Focus on these categories first for production systems:

  1. Content Safety - User safety and policy compliance
  2. Injection Attacks - System security and data protection
  3. Jailbreaking - Safety mechanism integrity

Medium Priority

Important for comprehensive security:

  1. Social Engineering - User interaction safety
  2. Supply Chain - Development environment security
  3. Misinformation - Factual accuracy

Lower Priority

Foundation security testing:

  1. Encoding Attacks - Basic input validation