Skip to content

Dataset Search Command

The dataset-search command finds papers that use specific datasets, including benchmark results and performance metrics.

Basic Usage

scoutml dataset-search DATASET [OPTIONS]

Examples

# Find papers using ImageNet
scoutml dataset-search "ImageNet"

# Papers using specific dataset
scoutml dataset-search "COCO"

# Domain-specific datasets
scoutml dataset-search "GLUE"

With Benchmark Results

# Include benchmark scores
scoutml dataset-search "ImageNet" --include-benchmarks

# Exclude benchmark tables
scoutml dataset-search "CIFAR-10" --no-benchmarks

Options

Option Type Default Description
--limit INTEGER 20 Number of results
--include-benchmarks FLAG False Include benchmark results
--no-benchmarks FLAG False Exclude benchmark results
--year-min INTEGER None Minimum publication year
--year-max INTEGER None Maximum publication year
--output CHOICE table Output format: table/json/csv
--export PATH None Export results to file

Computer Vision

# Classification
scoutml dataset-search "ImageNet" --include-benchmarks
scoutml dataset-search "CIFAR-10"
scoutml dataset-search "CIFAR-100"

# Object Detection
scoutml dataset-search "COCO"
scoutml dataset-search "Pascal VOC"
scoutml dataset-search "Open Images"

# Segmentation
scoutml dataset-search "ADE20K"
scoutml dataset-search "Cityscapes"

Natural Language Processing

# Benchmarks
scoutml dataset-search "GLUE"
scoutml dataset-search "SuperGLUE"
scoutml dataset-search "SQuAD"

# Language Modeling
scoutml dataset-search "WikiText-103"
scoutml dataset-search "BookCorpus"
scoutml dataset-search "Common Crawl"

Multimodal

# Vision-Language
scoutml dataset-search "MS-COCO Captions"
scoutml dataset-search "Conceptual Captions"
scoutml dataset-search "LAION-400M"

Benchmark Analysis

Finding SOTA Results

# Get current SOTA on ImageNet
scoutml dataset-search "ImageNet" \
  --include-benchmarks \
  --year-min 2022 \
  --limit 10

Tracking Progress

# See improvement over time
scoutml dataset-search "GLUE" \
  --include-benchmarks \
  --year-min 2018 \
  --output json \
  --export glue_progress.json

Advanced Usage

Comparing Methods on Same Dataset

# Find diverse approaches
scoutml dataset-search "CIFAR-10" \
  --include-benchmarks \
  --limit 30 \
  --year-min 2021

Dataset Combinations

Some papers use multiple datasets:

# Pre-training datasets
scoutml dataset-search "ImageNet-21K"

# Fine-tuning datasets
scoutml dataset-search "iNaturalist"

Domain Transfer Studies

# Cross-dataset evaluation
scoutml dataset-search "Office-31"
scoutml dataset-search "DomainNet"

Output Formats

With Benchmarks

When using --include-benchmarks:

scoutml dataset-search "ImageNet" --include-benchmarks --limit 5

Shows: - Paper details - Model/method used - Top-1 accuracy - Top-5 accuracy - Other metrics (FLOPs, parameters, etc.)

JSON Format

scoutml dataset-search "COCO" --include-benchmarks --output json

Returns:

[
  {
    "arxiv_id": "2201.03545",
    "title": "DETReg: Unsupervised Pretraining with...",
    "dataset_usage": "COCO object detection",
    "benchmark_results": {
      "mAP": 45.5,
      "AP50": 64.3,
      "AP75": 49.2
    }
  }
]

Finding Datasets by Task

Classification Datasets

# Image Classification
scoutml dataset-search "ImageNet"
scoutml dataset-search "Places365"
scoutml dataset-search "iNaturalist"

# Fine-grained Classification
scoutml dataset-search "CUB-200"
scoutml dataset-search "Stanford Cars"
scoutml dataset-search "FGVC Aircraft"

Detection Datasets

# General Object Detection
scoutml dataset-search "COCO"
scoutml dataset-search "Objects365"

# Specific Domains
scoutml dataset-search "KITTI"  # Autonomous driving
scoutml dataset-search "WiderFace"  # Face detection

Segmentation Datasets

# Semantic Segmentation
scoutml dataset-search "ADE20K"
scoutml dataset-search "Cityscapes"

# Instance Segmentation
scoutml dataset-search "COCO" --include-benchmarks

Dataset-Specific Insights

Low-Resource Datasets

# Few-shot learning datasets
scoutml dataset-search "miniImageNet"
scoutml dataset-search "Omniglot"

Synthetic Datasets

# Synthetic/rendered data
scoutml dataset-search "SYNTHIA"
scoutml dataset-search "SceneFlow"

Video Datasets

# Action recognition
scoutml dataset-search "Kinetics-400"
scoutml dataset-search "UCF-101"

Best Practices

  1. Use exact names: "MS-COCO" or "COCO", not "coco dataset"
  2. Check variants: Some datasets have multiple versions
  3. Include benchmarks: For comparing performance across papers
  4. Filter by year: Recent papers often have better results
  5. Export results: For tracking SOTA progression

Benchmark Tracking

Creating SOTA Tables

# Export benchmark results
scoutml dataset-search "ImageNet" \
  --include-benchmarks \
  --year-min 2020 \
  --output csv \
  --export imagenet_sota.csv
# JSON export for analysis
scoutml dataset-search "GLUE" \
  --include-benchmarks \
  --output json | \
  jq '.[] | {paper: .title, score: .benchmark_results.average}'

Common Issues

Dataset Name Variations

Common alternatives: - "MS-COCO" vs "COCO" - "ImageNet" vs "ILSVRC" - "Pascal VOC" vs "VOC2012"

Try different variations if no results.

Missing Benchmarks

If --include-benchmarks returns no results: 1. The dataset might not have standardized metrics 2. Papers might not report comparable numbers 3. Try without benchmark filter first