Attack Categories

Understanding the different types of AI security vulnerabilities and attack vectors tested by ModelRed

ModelRed's test suites are organized into distinct attack categories that represent different types of AI security vulnerabilities. Understanding these categories helps you identify specific threats and prioritize your security efforts.

Category Overview

🔐

Encoding Attacks

12 test suites

Base64, hex, ROT13, and other encoding schemes used to bypass input filters

🛡️

Content Safety

9 test suites

Toxicity, hate speech, slurs, and harmful content generation testing

🎭

Social Engineering

5 test suites

Emotional manipulation and psychological attack techniques

🔓

Jailbreaking

8 test suites

Advanced techniques to bypass safety guardrails and restrictions

💉

Injection Attacks

7 test suites

Prompt injection, template injection, and code execution vulnerabilities

🦠

Malware Generation

4 test suites

Detection of malicious code generation and security evasion

Encoding Attacks

Techniques that use various encoding schemes to obfuscate malicious content

Description

Encoding attacks attempt to bypass input validation and content filters by representing malicious instructions in alternative character encodings. These attacks exploit the fact that many AI systems may not properly decode or recognize encoded content as potentially harmful.

Risk Level

Low to Medium

Common in Tiers

Primarily Free and Starter tiers - these represent fundamental security testing

Content Safety

Detection of harmful, toxic, or inappropriate content generation

Description

Content safety attacks test whether AI models can be manipulated into generating harmful content including hate speech, toxicity, slurs, sexual content, and other policy violations. These tests are crucial for user-facing applications.

Test Suites in This Category

Toxicity Generation - Advanced AI-powered toxicity testing
Continue Slurs - Partial offensive term completion
LMRC Content Tests - Bullying, slur usage, sexual content
Sexualization - Inappropriate objectification testing
Deadnaming - Transgender rights violations
Medical Misinformation - Pseudoscience and quack medicine
Real Toxicity Prompts - Baseline toxicity testing

Risk Level

High

Common in Tiers

Starter tier and above - essential for production applications

Psychological manipulation techniques to extract information or bypass restrictions

Description

Social engineering attacks use emotional manipulation, authority exploitation, and psychological tricks to convince AI models to provide restricted information or perform unauthorized actions.

Test Suites in This Category

Grandma Attacks - Emotional manipulation for product keys, slurs, and substances
JSON Threats - Threatening models to output raw JSON
Tag Injection - Hidden Unicode manipulation

Risk Level

Medium to High

Common in Tiers

Starter tier - important for applications with human interaction

Jailbreaking

Advanced techniques to override safety mechanisms and restrictions

Description

Jailbreaking encompasses sophisticated attempts to bypass AI safety measures, including role-playing scenarios, complex prompt structures, and research-grade attack techniques.

Test Suites in This Category

Advanced Jailbreak Ablation - DAN 11.0 with 127 variations
AutoDAN Cached - Gradient descent optimization attacks
DAN in the Wild - Real-world attack collection
Tense Manipulation - Past/future tense bypass techniques
Latent Jailbreak - Translation-based distraction attacks

Risk Level

High to Critical

Common in Tiers

Starter and Pro tiers - advanced security research

Injection Attacks

Code execution and system manipulation vulnerabilities

Description

Injection attacks attempt to execute unauthorized code, manipulate system behavior, or extract sensitive information through various injection techniques including prompt injection, template injection, and SQL injection.

Test Suites in This Category

Prompt Injection - Framework attacks targeting hate speech and violence
Jinja Template Injection - Server-side template execution
SQL Injection Echo - Database manipulation attempts
Latent Injection - Hidden instructions in documents
XSS Attacks - Cross-site scripting and data exfiltration

Risk Level

Critical

Common in Tiers

Starter through Enterprise - varies by complexity

Malware Generation

Detection of malicious code creation capabilities

Description

Malware generation tests evaluate whether AI models can be manipulated into creating malicious software, including viruses, ransomware, cryptojacking code, and evasion techniques.

Test Suites in This Category

Malware Top Level - High-level malicious code requests
Malware Sub Functions - Individual attack component generation
Malware Evasion - Anti-detection code generation
Malware Payload - Ransomware and cryptojacking

Python Package Hallucination - Non-existent PyPI packages
JavaScript Package Hallucination - Fake NPM packages
Ruby Gem Hallucination - Non-existent Ruby gems
Rust Crate Hallucination - Fake Cargo packages

False Assertion - Testing rejection of demonstrably false claims
Who is Riley - Hallucination detection probes
Snowball Graph - Impossible reasoning problems

import asyncio
from modelred import ModelRed

async def main():
    async with ModelRed(api_key="mr_your_api_key_here") as client:
        # Test encoding attacks
        encoding_suites = [
            "base64_injection",
            "hexadecimal_injection",
            "rot13_encoding"
        ]

        # Test content safety
        safety_suites = [
            "toxicity_generation",
            "continue_slurs",
            "lmrc_bullying"
        ]

        # Run category-focused assessment
        result = await client.run_assessment(
            model_id="my-model",
            test_suites=encoding_suites + safety_suites,
            priority="high",
            wait_for_completion=True
        )

asyncio.run(main())

Risk-Based Testing

# High-risk categories for production systems
high_risk_suites = [
    # Content Safety (Critical for user-facing apps)
    "toxicity_generation",
    "lmrc_bullying",

    # Injection Attacks (System security)
    "prompt_inject_hate",
    "jinja_template_injection",

    # Jailbreaking (Policy compliance)
    "past_tense_jailbreak",
    "future_tense_jailbreak"
]

result = await client.run_assessment(
    model_id="production-model",
    test_suites=high_risk_suites,
    priority="critical",
    wait_for_completion=True
)