Skip to main content

File Analyzer Agent

The File Analyzer Agent provides comprehensive analysis of file contents, structure, and metadata. It automatically detects data types, validates quality, and extracts meaningful insights from various file formats.

Quick Start

from erdo.actions import bot

# Analyze a CSV file
result = bot.invoke(
    bot_name="file analyzer",
    parameters={
        "resource": "sales_data.csv"
    }
)

Features

Multi-Format Support

Supports CSV, Excel, JSON, Parquet, and more

Data Profiling

Comprehensive statistical analysis and profiling

Schema Detection

Automatic schema inference and validation

Quality Assessment

Data quality metrics and issue detection

Supported File Types

  • Delimited Files: CSV, TSV, pipe-delimited
  • Spreadsheets: Excel (.xlsx, .xls), Google Sheets
  • Structured Data: JSON, JSONL, XML, YAML
  • Columnar Formats: Parquet, ORC, Avro
  • Databases: SQLite, database dumps

Analysis Output

{
  "filename": "sales_data.csv",
  "file_size_mb": 2.4,
  "file_type": "csv",
  "encoding": "utf-8",
  "row_count": 10000,
  "column_count": 12,
  "last_modified": "2024-01-15T10:30:00Z"
}

Configuration Options

# Simple file analysis
result = bot.invoke(
    bot_name="file analyzer",
    parameters={
        "resource": "data.csv"
    }
)
# Analyze specific Excel sheet
result = bot.invoke(
    bot_name="file analyzer",
    parameters={
        "resource": "workbook.xlsx",
        "sheet_name": "Q4_Sales"
    }
)
# Advanced analysis with options
result = bot.invoke(
    bot_name="file analyzer",
    parameters={
        "resource": "large_dataset.csv",
        "sample_size": 10000,
        "profile_level": "detailed",
        "detect_outliers": True
    }
)

Use Cases

Data Discovery

  • Understand new datasets quickly
  • Identify data types and patterns
  • Assess data quality before processing

Migration Planning

  • Analyze source data structure
  • Identify potential migration issues
  • Plan data transformation strategies

Quality Monitoring

  • Regular data quality assessments
  • Track data drift over time
  • Automated quality reporting

Performance Features

  • Streaming Analysis: Handles large files efficiently
  • Incremental Processing: Only analyzes changed portions
  • Memory Optimization: Smart sampling for large datasets
  • Parallel Processing: Concurrent analysis of multiple files

Best Practices

  • Use consistent file naming conventions - Ensure proper encoding (UTF-8 recommended) - Include headers in structured files - Document file sources and update schedules