Streamlit Security Implementation Guide
Overview
This document outlines the comprehensive security measures implemented for file uploads in the AIAgents4Pharma Streamlit applications. The security implementation addresses file validation, content scanning, and protection against various attack vectors.
Security Architecture
Multi-Layer Validation
The security implementation uses a multi-layer approach:
- File Extension Validation - Whitelist approach for allowed file types
- MIME Type Verification - Detects file masquerading attacks
- File Size Limits - Prevents DoS attacks and resource exhaustion
- Content Pattern Scanning - Detects malicious content patterns
- Filename Sanitization - Prevents directory traversal attacks
Implementation Details
Core Security Functions
secure_file_upload()
The main security wrapper that replaces st.file_uploader()
:
from app.frontend.utils.streamlit_utils import secure_file_upload
# Basic usage
uploaded_file = secure_file_upload(
"Upload Document",
allowed_types=["pdf"],
help_text="Upload a PDF document",
max_size_mb=50,
accept_multiple_files=False
)
validate_uploaded_file()
Comprehensive validation engine that performs:
- File extension checks against whitelist
- MIME type detection using python-magic
- File size validation
- Content pattern scanning
- Security threat detection
sanitize_filename()
Filename sanitization to prevent:
- Directory traversal attacks (../../../etc/passwd
)
- Dangerous characters in filenames
- Overly long filenames
- Reserved system names
Supported File Types
PDF Files
pdf_file = secure_file_upload(
"Upload Research Paper",
allowed_types=["pdf"],
max_size_mb=50
)
.pdf
- MIME Types: application/pdf
- Content Validation: Checks for PDF header (%PDF-
)
- Max Size: 50MB (configurable)
XML/SBML Files
model_file = secure_file_upload(
"Upload Model File",
allowed_types=["xml"],
max_size_mb=25
)
.xml
, .sbml
- MIME Types: application/xml
, text/xml
- Content Validation: Checks for XML header (<?xml
)
- Max Size: 25MB (configurable)
Spreadsheet Files
data_file = secure_file_upload(
"Upload Data File",
allowed_types=["spreadsheet"],
max_size_mb=25
)
.xlsx
, .xls
, .csv
- MIME Types: Excel/CSV MIME types
- Max Size: 25MB (configurable)
Text Files
text_file = secure_file_upload(
"Upload Text Data",
allowed_types=["text"],
max_size_mb=10
)
.txt
, .md
- MIME Types: text/plain
, text/markdown
- Max Size: 10MB (configurable)
Security Configuration
Upload Limits
UPLOAD_SECURITY_CONFIG = {
"max_file_size_mb": 50, # Global default
"max_filename_length": 255,
"allowed_extensions": {
"pdf": ["pdf"],
"xml": ["xml", "sbml"],
"spreadsheet": ["xlsx", "xls", "csv"],
"text": ["txt", "md"],
},
"dangerous_extensions": [
"exe", "bat", "cmd", "com", "pif", "scr", "vbs",
"js", "jar", "app", "deb", "pkg", "dmg", "rpm",
"msi", "dll", "sys", "drv", "sh", "bash", "ps1",
"py", "pl", "rb", "php", "asp", "jsp"
]
}
Blocked Content Patterns
The system automatically blocks files containing:
- Script tags: <script>
, javascript:
, vbscript:
- Server-side code: <?php>
, #!/bin/
- Dangerous functions: eval()
, exec()
, system()
- Shell commands: #!/usr/bin/
, shell_exec()
Note: The pattern <%
is only blocked in non-PDF files, as it's part of legitimate PDF syntax. For PDFs, only truly suspicious patterns like <% eval
or <% system
are blocked.
Application Integration
Talk2AIAgents4Pharma (Combined) — T2AA4P
T2AA4P integrates both T2B and T2KG secure uploads via shared utilities, so no additional per-app code is needed beyond calling the helpers.
# In app/frontend/streamlit_app_talk2aiagents4pharma.py
# SBML/PDF (T2B side)
uploaded_sbml_file = streamlit_utils.get_t2b_uploaded_files(app)
# Data packages and multimodal files (T2KG side)
streamlit_utils.get_uploaded_files(cfg)
# Both helpers use secure_file_upload() under the hood.
Talk2BioModels - XML/SBML Upload
def get_t2b_uploaded_files(app):
uploaded_sbml_file = secure_file_upload(
"Upload an XML/SBML file",
allowed_types=["xml"],
help_text="Upload a QSP as an XML/SBML file",
max_size_mb=25,
accept_multiple_files=False,
key="secure_sbml_upload"
)
article = secure_file_upload(
"Upload an article",
allowed_types=["pdf"],
help_text="Upload a PDF article to ask questions.",
max_size_mb=50,
accept_multiple_files=False,
key="secure_article_upload"
)
Talk2KnowledgeGraphs - Data Upload
def get_uploaded_files(cfg):
data_package_files = secure_file_upload(
"💊 Upload pre-clinical drug data",
allowed_types=["text", "spreadsheet", "pdf"],
help_text="Drug targets and kinetic parameters",
max_size_mb=25,
accept_multiple_files=True,
key="secure_data_upload"
)
multimodal_files = secure_file_upload(
"📦 Upload multimodal data package",
allowed_types=["spreadsheet"],
help_text="Multimodal endotype/phenotype data",
max_size_mb=50,
accept_multiple_files=True,
key="secure_multimodal_upload"
)
Security Validation Flow
1. Pre-Upload Validation
- File extension whitelist check
- File type restriction by Streamlit
2. Post-Upload Validation
validation_result = validate_uploaded_file(uploaded_file, allowed_types, max_size_mb)
if not validation_result["valid"]:
st.error(f"❌ {uploaded_file.name}: {validation_result['error']}")
return None
# Show warnings for suspicious but not critical issues
if validation_result["warnings"]:
for warning in validation_result["warnings"]:
st.warning(f"⚠️ {uploaded_file.name}: {warning}")
st.success(f"✅ {uploaded_file.name} validated successfully")
3. Secure Processing
# Sanitize filename
safe_filename = sanitize_filename(uploaded_file.name)
# Create secure temporary file
with tempfile.NamedTemporaryFile(delete=False, suffix=f"_{safe_filename}") as f:
f.write(uploaded_file.read())
secure_file_path = f.name
# Process using sanitized path
process_file(secure_file_path, safe_filename)
Threat Protection
File Masquerading Protection
The system detects files that have been renamed to bypass security:
malware.exe → research_paper.pdf ❌ BLOCKED
- File extension: .pdf (appears safe)
- MIME type: application/x-executable (actual content)
- Result: BLOCKED with warning about MIME type mismatch
Directory Traversal Protection
# Dangerous filename examples (all blocked/sanitized):
"../../../etc/passwd" → "etc_passwd"
"..\\windows\\system32\\cmd.exe" → "cmd.exe"
"/var/log/sensitive.log" → "sensitive.log"
Content Injection Protection
Files containing these patterns are automatically rejected:
- HTML/JavaScript: <script>alert('xss')</script>
- PHP code: <?php system($_GET['cmd']); ?>
- Shell commands: #!/bin/bash rm -rf /
- Python code injection: eval(malicious_code)
Size-based DoS Protection
# File size limits by type
PDF_MAX_SIZE = 50 * 1024 * 1024 # 50MB
XML_MAX_SIZE = 25 * 1024 * 1024 # 25MB
DATA_MAX_SIZE = 25 * 1024 * 1024 # 25MB
TEXT_MAX_SIZE = 10 * 1024 * 1024 # 10MB
Error Handling
Validation Errors
# File too large
"File too large (75.2MB). Max: 50MB"
# Wrong file type
"File extension 'exe' not allowed. Allowed: ['pdf']"
# Dangerous content
"File contains suspicious content pattern: <script"
# MIME type mismatch (warning, not error)
"MIME type mismatch: detected 'application/x-executable', expected 'application/pdf'"
User Feedback
The system provides clear visual feedback: - ✅ Success: File validated and accepted - ⚠️ Warning: File accepted but with concerns - ❌ Error: File rejected with specific reason
Best Practices
For Developers
- Always use
secure_file_upload()
instead ofst.file_uploader()
- Sanitize filenames before storing or processing
- Use appropriate file type restrictions - only allow what's needed
- Set reasonable size limits based on expected use cases
- Handle validation errors gracefully with user-friendly messages
For Users
- Upload only necessary file types as specified
- Keep file sizes reasonable (under the specified limits)
- Use descriptive, clean filenames without special characters
- Verify file content matches the extension before uploading
Monitoring and Logging
Security Events
The system logs security-related events: - File validation failures - MIME type mismatches - Suspicious content detection - Size limit violations
Metrics
Key security metrics to monitor: - Upload rejection rate - Common rejection reasons - File type distribution - Size distribution
Dependencies
Required Packages
# pyproject.toml
dependencies = [
"streamlit>=1.41.1",
"python-magic>=0.4.27", # MIME type detection
# ... other dependencies
]
System Requirements
- python-magic: Requires
libmagic
system library - Linux:
sudo apt-get install libmagic1
- macOS:
brew install libmagic
- Windows: Bundled with python-magic-bin
Important: Install libmagic
before running Streamlit apps, or you'll get:
ImportError: failed to find libmagic. Check your installation
Testing
Security Test Cases
- Valid Files: Ensure proper files are accepted
- File Masquerading: Test renamed malicious files
- Size Limits: Test files exceeding size limits
- Content Injection: Test files with malicious patterns
- Directory Traversal: Test dangerous filenames
Example Tests
def test_file_validation():
# Test valid PDF
valid_pdf = create_test_pdf()
result = validate_uploaded_file(valid_pdf, ["pdf"])
assert result["valid"] == True
# Test file masquerading
fake_pdf = create_executable_named_pdf()
result = validate_uploaded_file(fake_pdf, ["pdf"])
assert len(result["warnings"]) > 0 # Should warn about MIME mismatch
# Test size limit
large_file = create_large_file(100 * 1024 * 1024) # 100MB
result = validate_uploaded_file(large_file, ["pdf"], max_size_mb=50)
assert result["valid"] == False
assert "File too large" in result["error"]
Future Enhancements
Planned Improvements
- Virus Scanning Integration - Add ClamAV or similar
- Advanced Content Analysis - Deep content inspection
- User-based Quotas - Per-user upload limits
- Audit Logging - Enhanced security event logging
- Rate Limiting - Prevent upload spam
- File Quarantine - Temporary isolation of suspicious files
Configuration Improvements
- Dynamic Configuration - Runtime security policy updates
- Per-Agent Policies - Different security rules per agent
- Content-based Rules - Smarter content analysis
- Integration APIs - External security service integration
Conclusion
The Streamlit security implementation provides comprehensive protection against file upload attacks while maintaining usability. The multi-layer approach ensures that even if one security measure is bypassed, others will catch potential threats.
For questions or security concerns, please contact the development team or create an issue in the repository.
Security is everyone's responsibility - always validate, never trust user input! 🛡️