Advanced Examples¶

This guide demonstrates powerful workflows and advanced usage patterns for ScoutML, including both CLI and Python library usage.

Complex Search Workflows¶

Multi-Stage Filtering¶

Find the most promising papers through progressive filtering:

# Stage 1: Broad search
scoutml search "transformer efficiency" \
  --year-min 2021 \
  --limit 100 \
  --output json > candidates.json

# Stage 2: Filter by citations
cat candidates.json | \
  jq '.[] | select(.citations > 50)' > high_impact.json

# Stage 3: Check reproducibility
cat high_impact.json | \
  jq -r '.arxiv_id' | \
  xargs -I {} scoutml insights reproducibility --output json | \
  jq '.[] | select(.reproducibility_score > 80)'

# Stage 4: Get implementation guides
cat high_impact.json | \
  jq -r '.arxiv_id' | head -3 | \
  xargs -I {} scoutml agent implement {} --export "impl_{}.md"

Research Genealogy Mapping¶

Trace the evolution of ideas:

#!/bin/bash
# research_genealogy.sh

explore_paper() {
    local paper_id=$1
    local depth=$2

    if [ $depth -eq 0 ]; then
        return
    fi

    echo "Exploring: $paper_id (depth: $depth)"

    # Get similar papers
    scoutml similar --paper-id "$paper_id" \
      --threshold 0.8 \
      --limit 3 \
      --output json | \
    jq -r '.[] | .arxiv_id' | \
    while read -r similar_id; do
        explore_paper "$similar_id" $((depth - 1))
    done
}

# Start from seminal paper
explore_paper "1706.03762" 3  # Attention is All You Need

Research Pipeline Automation¶

Complete Literature Review Pipeline¶

#!/bin/bash
# literature_review_pipeline.sh

TOPIC="$1"
OUTPUT_DIR="review_${TOPIC// /_}"

mkdir -p "$OUTPUT_DIR"

# 1. Generate initial review
echo "Generating literature review..."
scoutml review "$TOPIC" \
  --year-min 2020 \
  --limit 100 \
  --output markdown \
  --export "$OUTPUT_DIR/review.md"

# 2. Find key papers
echo "Identifying key papers..."
scoutml search "$TOPIC" \
  --year-min 2020 \
  --sota-only \
  --limit 20 \
  --output json > "$OUTPUT_DIR/key_papers.json"

# 3. Compare top approaches
echo "Comparing approaches..."
cat "$OUTPUT_DIR/key_papers.json" | \
  jq -r '.[:5] | .[].arxiv_id' | \
  xargs scoutml compare \
  --output markdown \
  --export "$OUTPUT_DIR/comparison.md"

# 4. Analyze reproducibility
echo "Checking reproducibility..."
cat "$OUTPUT_DIR/key_papers.json" | \
  jq -r '.[].arxiv_id' | \
  xargs -I {} sh -c 'scoutml agent critique {} --aspects reproducibility --output json' > \
  "$OUTPUT_DIR/reproducibility.json"

# 5. Generate implementation guides
echo "Creating implementation guides..."
mkdir -p "$OUTPUT_DIR/implementations"
cat "$OUTPUT_DIR/key_papers.json" | \
  jq -r '.[:3] | .[].arxiv_id' | \
  while read -r paper_id; do
    scoutml agent implement "$paper_id" \
      --framework pytorch \
      --export "$OUTPUT_DIR/implementations/$paper_id.md"
  done

echo "Review pipeline complete! Results in $OUTPUT_DIR/"

Paper Analysis Dashboard¶

Create a comprehensive analysis dashboard:

#!/bin/bash
# paper_dashboard.sh

PAPER_ID="$1"
OUTPUT="dashboard_${PAPER_ID}.md"

cat > "$OUTPUT" << EOF
# Paper Analysis Dashboard: $PAPER_ID

Generated: $(date)

EOF

# Basic information
echo "## Paper Details" >> "$OUTPUT"
scoutml paper "$PAPER_ID" --output json | \
  jq -r '"Title: \(.title)\nAuthors: \(.authors | join(", "))\nYear: \(.year)\nCitations: \(.citations)"' >> "$OUTPUT"

# Critique
echo -e "\n## Research Critique" >> "$OUTPUT"
scoutml agent critique "$PAPER_ID" --output json | \
  jq -r '.critique.overall_assessment | to_entries | .[] | "- \(.key): \(.value)"' >> "$OUTPUT"

# Similar papers
echo -e "\n## Related Work" >> "$OUTPUT"
scoutml similar --paper-id "$PAPER_ID" --limit 5 --output json | \
  jq -r '.[] | "- [\(.title)](\(.arxiv_url)) (similarity: \(.similarity))"' >> "$OUTPUT"

# Implementation feasibility
echo -e "\n## Implementation Guide" >> "$OUTPUT"
echo '```bash' >> "$OUTPUT"
echo "scoutml agent implement $PAPER_ID --framework pytorch" >> "$OUTPUT"
echo '```' >> "$OUTPUT"

# Limitations and solutions
echo -e "\n## Addressing Limitations" >> "$OUTPUT"
scoutml agent solve-limitations "$PAPER_ID" \
  --tradeoffs speed \
  --tradeoffs memory \
  --output json | \
  jq -r '.solutions[:3] | .[] | "### \(.name)\n\(.description)\n"' >> "$OUTPUT"

echo "Dashboard generated: $OUTPUT"

Data Processing Pipelines¶

Export to BibTeX¶

Convert search results to BibTeX format:

#!/bin/bash
# to_bibtex.sh

scoutml search "$1" --limit 20 --output json | \
jq -r '.[] | 
"@article{\(.arxiv_id),
  title={{\(.title)}},
  author={{\(.authors | join(" and "))}},
  year={{\(.year)}},
  journal={{arXiv preprint arXiv:\(.arxiv_id)}},
  url={{https://arxiv.org/abs/\(.arxiv_id)}}
}\n"' > references.bib

Create Reading List¶

Generate organized reading lists:

#!/bin/bash
# reading_list.sh

TOPIC="$1"
OUTPUT="reading_list_${TOPIC// /_}.md"

cat > "$OUTPUT" << EOF
# Reading List: $TOPIC

## Foundational Papers (High Impact)
EOF

scoutml search "$TOPIC" \
  --min-citations 500 \
  --limit 5 \
  --output json | \
jq -r '.[] | "- [\(.title)](https://arxiv.org/abs/\(.arxiv_id)) - \(.citations) citations"' >> "$OUTPUT"

cat >> "$OUTPUT" << EOF

## Recent Advances (Last 2 Years)
EOF

scoutml search "$TOPIC" \
  --year-min 2022 \
  --sota-only \
  --limit 10 \
  --output json | \
jq -r '.[] | "- [\(.title)](https://arxiv.org/abs/\(.arxiv_id)) (\(.year)) - \(.citations) citations"' >> "$OUTPUT"

cat >> "$OUTPUT" << EOF

## Highly Reproducible
EOF

scoutml insights reproducibility \
  --domain "$TOPIC" \
  --limit 5 \
  --output json | \
jq -r '.[] | "- [\(.title)](https://arxiv.org/abs/\(.arxiv_id)) - Score: \(.reproducibility_score)"' >> "$OUTPUT"

Research Trend Analysis¶

Track Method Evolution¶

#!/bin/bash
# method_evolution.sh

METHOD="$1"
OUTPUT_DIR="evolution_${METHOD// /_}"
mkdir -p "$OUTPUT_DIR"

# Track over years
for year in {2018..2023}; do
    echo "=== Year $year ===" >> "$OUTPUT_DIR/timeline.txt"

    scoutml method-search "$METHOD" \
      --year-min $year \
      --year-max $year \
      --sort-by citations \
      --limit 3 \
      --output json > "$OUTPUT_DIR/year_$year.json"

    # Extract key innovations
    cat "$OUTPUT_DIR/year_$year.json" | \
      jq -r '.[] | "- \(.title): \(.method_usage)"' >> "$OUTPUT_DIR/timeline.txt"
done

# Visualize citations trend
echo "Year,Total_Citations,Avg_Citations" > "$OUTPUT_DIR/trend.csv"
for year in {2018..2023}; do
    stats=$(cat "$OUTPUT_DIR/year_$year.json" | \
      jq '[.[] | .citations] | {total: add, avg: (add / length)}')

    total=$(echo "$stats" | jq '.total // 0')
    avg=$(echo "$stats" | jq '.avg // 0')
    echo "$year,$total,$avg" >> "$OUTPUT_DIR/trend.csv"
done

Domain Comparison Matrix¶

#!/bin/bash
# domain_comparison.sh

DOMAINS=("computer vision" "nlp" "reinforcement learning")
METRICS=("avg_citations" "reproducibility" "compute_requirements")

echo "Domain Comparison Matrix" > comparison_matrix.txt
echo "=======================" >> comparison_matrix.txt

for domain in "${DOMAINS[@]}"; do
    echo -e "\n## $domain" >> comparison_matrix.txt

    # Get top papers
    top_papers=$(scoutml search "$domain" \
      --year-min 2022 \
      --limit 20 \
      --output json)

    # Calculate metrics
    avg_citations=$(echo "$top_papers" | \
      jq '[.[] | .citations] | add / length')

    # Get reproducibility
    avg_reproducibility=$(scoutml insights reproducibility \
      --domain "$domain" \
      --limit 20 \
      --output json | \
      jq '[.[] | .reproducibility_score] | add / length')

    echo "- Average Citations: $avg_citations" >> comparison_matrix.txt
    echo "- Average Reproducibility: $avg_reproducibility" >> comparison_matrix.txt
done

Python Library Examples¶

Research Dashboard Application¶

import scoutml
import streamlit as st
import pandas as pd
import plotly.express as px

st.title("ScoutML Research Dashboard")

# Search interface
query = st.text_input("Search Query", "transformer models")
col1, col2 = st.columns(2)
with col1:
    year_min = st.number_input("Min Year", 2018, 2024, 2020)
with col2:
    limit = st.slider("Number of Results", 10, 100, 20)

if st.button("Search"):
    with st.spinner("Searching..."):
        results = scoutml.search(query, limit=limit, year_min=year_min)

    # Convert to DataFrame
    df = pd.DataFrame(results['papers'])

    # Display metrics
    st.metric("Total Papers", len(df))
    st.metric("Average Citations", f"{df['citations'].mean():.0f}")

    # Visualization
    fig = px.scatter(df, x='year', y='citations', 
                     hover_data=['title'], 
                     title="Papers by Year and Citations")
    st.plotly_chart(fig)

    # Results table
    st.dataframe(df[['title', 'year', 'citations', 'arxiv_id']])

    # Paper details
    if st.selectbox("Select a paper for details", df['arxiv_id'].tolist()):
        paper_id = st.selectbox("Select a paper for details", df['arxiv_id'].tolist())
        details = scoutml.get_paper(paper_id)
        st.write(f"**Abstract:** {details['paper']['abstract']}")

Automated Research Assistant¶

import scoutml
import schedule
import time
from datetime import datetime
import smtplib
from email.mime.text import MIMEText

class ResearchAssistant:
    def __init__(self, topics, email):
        self.topics = topics
        self.email = email
        self.seen_papers = set()

    def check_new_papers(self):
        """Check for new papers in topics of interest."""
        new_papers = []

        for topic in self.topics:
            results = scoutml.search(
                topic, 
                limit=10, 
                year_min=datetime.now().year,
                min_citations=0
            )

            for paper in results['papers']:
                if paper['arxiv_id'] not in self.seen_papers:
                    self.seen_papers.add(paper['arxiv_id'])
                    new_papers.append({
                        'topic': topic,
                        'paper': paper
                    })

        if new_papers:
            self.send_notification(new_papers)

    def send_notification(self, papers):
        """Send email notification about new papers."""
        body = "New papers found:\n\n"

        for item in papers:
            paper = item['paper']
            body += f"Topic: {item['topic']}\n"
            body += f"Title: {paper['title']}\n"
            body += f"Authors: {', '.join(paper['authors'][:3])}...\n"
            body += f"Link: https://arxiv.org/abs/{paper['arxiv_id']}\n\n"

        msg = MIMEText(body)
        msg['Subject'] = f"ScoutML: {len(papers)} new papers found"
        msg['From'] = 'assistant@example.com'
        msg['To'] = self.email

        # Send email (configure SMTP settings)
        # smtp.send_message(msg)
        print(f"Would send email:\n{body}")

# Usage
assistant = ResearchAssistant(
    topics=["federated learning", "vision transformer", "diffusion models"],
    email="researcher@example.com"
)

# Schedule daily checks
schedule.every().day.at("09:00").do(assistant.check_new_papers)

# Run scheduler
while True:
    schedule.run_pending()
    time.sleep(60)

ML Pipeline Integration¶

import scoutml
import mlflow
import optuna
from sklearn.metrics import accuracy_score

class ResearchInformedML:
    """Use ScoutML to inform ML experiments."""

    def __init__(self, research_topic):
        self.topic = research_topic
        self.papers = []
        self.techniques = []

    def research_phase(self):
        """Research state-of-the-art techniques."""
        # Find top papers
        results = scoutml.search(
            self.topic,
            limit=20,
            min_citations=50,
            year_min=2020,
            sota_only=True
        )
        self.papers = results['papers']

        # Analyze top techniques
        for paper in self.papers[:5]:
            critique = scoutml.critique_paper(paper['arxiv_id'])
            guide = scoutml.get_implementation_guide(paper['arxiv_id'])

            self.techniques.append({
                'paper_id': paper['arxiv_id'],
                'title': paper['title'],
                'key_ideas': critique['critique']['strengths'],
                'implementation': guide['implementation']['key_components']
            })

    def experiment_phase(self, X_train, y_train, X_test, y_test):
        """Run experiments based on research findings."""
        mlflow.set_experiment(f"scoutml_{self.topic}")

        for technique in self.techniques:
            with mlflow.start_run(run_name=technique['title'][:50]):
                # Log paper information
                mlflow.log_param("paper_id", technique['paper_id'])
                mlflow.log_param("paper_title", technique['title'])

                # Implement technique (simplified)
                model = self.implement_technique(technique)
                model.fit(X_train, y_train)

                # Evaluate
                predictions = model.predict(X_test)
                accuracy = accuracy_score(y_test, predictions)

                mlflow.log_metric("accuracy", accuracy)
                mlflow.sklearn.log_model(model, "model")

                print(f"Technique: {technique['title'][:50]}")
                print(f"Accuracy: {accuracy:.4f}\n")

    def implement_technique(self, technique):
        """Implement technique based on research."""
        # This would implement the actual technique
        # based on the implementation guide
        from sklearn.ensemble import RandomForestClassifier
        return RandomForestClassifier(n_estimators=100)

# Usage
researcher = ResearchInformedML("tabular classification sota")
researcher.research_phase()
# researcher.experiment_phase(X_train, y_train, X_test, y_test)

Citation Network Analysis¶

import scoutml
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict

def build_citation_network(root_paper_id, depth=2):
    """Build a citation network starting from a paper."""
    G = nx.DiGraph()
    visited = set()

    def explore(paper_id, current_depth):
        if current_depth > depth or paper_id in visited:
            return

        visited.add(paper_id)

        # Get paper details
        try:
            paper = scoutml.get_paper(paper_id)
            G.add_node(paper_id, 
                      title=paper['paper']['title'],
                      year=paper['paper']['year'],
                      citations=paper['paper']['citations'])

            # Find similar papers (as proxy for citations)
            similar = scoutml.find_similar_papers(
                paper_id=paper_id, 
                limit=5, 
                threshold=0.8
            )

            for sim_paper in similar['papers']:
                sim_id = sim_paper['arxiv_id']
                G.add_edge(paper_id, sim_id, weight=sim_paper['similarity'])
                explore(sim_id, current_depth + 1)

        except Exception as e:
            print(f"Error processing {paper_id}: {e}")

    explore(root_paper_id, 0)
    return G

# Build network
G = build_citation_network("1706.03762", depth=2)  # Attention is All You Need

# Analyze network
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")

# Find most influential papers
pagerank = nx.pagerank(G)
top_papers = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]

print("\nMost influential papers in network:")
for paper_id, score in top_papers:
    if paper_id in G.nodes:
        title = G.nodes[paper_id].get('title', 'Unknown')[:50]
        print(f"{paper_id}: {title}... (score: {score:.4f})")

# Visualize
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, k=2, iterations=50)
nx.draw(G, pos, 
        node_size=1000,
        node_color='lightblue',
        with_labels=True,
        font_size=8,
        edge_color='gray',
        arrows=True)
plt.title("Paper Citation Network")
plt.tight_layout()
plt.show()

Integration Examples¶

Slack Notification Bot¶

#!/usr/bin/env python3
# scout_bot.py

import json
import subprocess
import requests
from datetime import datetime

SLACK_WEBHOOK = "YOUR_WEBHOOK_URL"
TOPICS = ["federated learning", "vision transformer", "diffusion models"]

def get_new_papers(topic, min_citations=10):
    cmd = [
        "scoutml", "search", topic,
        "--year-min", str(datetime.now().year),
        "--min-citations", str(min_citations),
        "--limit", "5",
        "--output", "json"
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)
    return json.loads(result.stdout)

def send_to_slack(message):
    requests.post(SLACK_WEBHOOK, json={"text": message})

# Check each topic
for topic in TOPICS:
    papers = get_new_papers(topic)
    if papers:
        message = f"*New papers in {topic}:*\n"
        for paper in papers[:3]:
            message += f"• <https://arxiv.org/abs/{paper['arxiv_id']}|{paper['title']}> - {paper['citations']} citations\n"
        send_to_slack(message)

Research Portfolio Analysis¶

#!/bin/bash
# portfolio_analysis.sh

# Your paper IDs
MY_PAPERS=(
    "2301.12345"
    "2302.23456"
    "2303.34567"
)

echo "# Research Portfolio Analysis"
echo "Generated: $(date)"
echo

# Citation analysis
total_citations=0
for paper in "${MY_PAPERS[@]}"; do
    citations=$(scoutml paper "$paper" --output json | jq '.citations')
    total_citations=$((total_citations + citations))
    echo "- $paper: $citations citations"
done
echo "Total citations: $total_citations"
echo

# Find similar work
echo "## Related Research"
for paper in "${MY_PAPERS[@]}"; do
    echo "### Similar to $paper:"
    scoutml similar --paper-id "$paper" --limit 3 --output json | \
      jq -r '.[] | "- \(.title) (similarity: \(.similarity))"'
done

# Identify gaps
echo "## Research Opportunities"
for paper in "${MY_PAPERS[@]}"; do
    echo "### Extensions for $paper:"
    scoutml agent solve-limitations "$paper" --output json | \
      jq -r '.limitations[:2] | .[] | "- \(.description)"'
done

Best Practices¶

Performance Optimization¶

Batch Operations: Process multiple items in single commands
JSON Processing: Use jq for efficient data manipulation
Caching: Save intermediate results to avoid repeated API calls
Parallel Processing: Use xargs -P for parallel execution

Error Handling¶

# Robust error handling example
process_paper() {
    local paper_id=$1
    local max_retries=3
    local retry_count=0

    while [ $retry_count -lt $max_retries ]; do
        if scoutml paper "$paper_id" --output json > "/tmp/${paper_id}.json" 2>/dev/null; then
            return 0
        fi

        retry_count=$((retry_count + 1))
        echo "Retry $retry_count for $paper_id..." >&2
        sleep 2
    done

    echo "Failed to process $paper_id after $max_retries attempts" >&2
    return 1
}

Data Validation¶

# Validate search results
validate_results() {
    local json_file=$1

    # Check if file exists and is valid JSON
    if [ ! -f "$json_file" ]; then
        echo "Error: File not found" >&2
        return 1
    fi

    if ! jq empty "$json_file" 2>/dev/null; then
        echo "Error: Invalid JSON" >&2
        return 1
    fi

    # Check if results exist
    local count=$(jq 'length' "$json_file")
    if [ "$count" -eq 0 ]; then
        echo "Warning: No results found" >&2
        return 2
    fi

    echo "Valid results: $count items"
    return 0
}

Next Steps¶

Explore best practices for optimal usage
Check output formats for data processing
Review individual command documentation
Join our community for more examples and workflows