Advanced Examples¶
This guide demonstrates powerful workflows and advanced usage patterns for ScoutML, including both CLI and Python library usage.
Complex Search Workflows¶
Multi-Stage Filtering¶
Find the most promising papers through progressive filtering:
# Stage 1: Broad search
scoutml search "transformer efficiency" \
--year-min 2021 \
--limit 100 \
--output json > candidates.json
# Stage 2: Filter by citations
cat candidates.json | \
jq '.[] | select(.citations > 50)' > high_impact.json
# Stage 3: Check reproducibility
cat high_impact.json | \
jq -r '.arxiv_id' | \
xargs -I {} scoutml insights reproducibility --output json | \
jq '.[] | select(.reproducibility_score > 80)'
# Stage 4: Get implementation guides
cat high_impact.json | \
jq -r '.arxiv_id' | head -3 | \
xargs -I {} scoutml agent implement {} --export "impl_{}.md"
Research Genealogy Mapping¶
Trace the evolution of ideas:
#!/bin/bash
# research_genealogy.sh
explore_paper() {
local paper_id=$1
local depth=$2
if [ $depth -eq 0 ]; then
return
fi
echo "Exploring: $paper_id (depth: $depth)"
# Get similar papers
scoutml similar --paper-id "$paper_id" \
--threshold 0.8 \
--limit 3 \
--output json | \
jq -r '.[] | .arxiv_id' | \
while read -r similar_id; do
explore_paper "$similar_id" $((depth - 1))
done
}
# Start from seminal paper
explore_paper "1706.03762" 3 # Attention is All You Need
Research Pipeline Automation¶
Complete Literature Review Pipeline¶
#!/bin/bash
# literature_review_pipeline.sh
TOPIC="$1"
OUTPUT_DIR="review_${TOPIC// /_}"
mkdir -p "$OUTPUT_DIR"
# 1. Generate initial review
echo "Generating literature review..."
scoutml review "$TOPIC" \
--year-min 2020 \
--limit 100 \
--output markdown \
--export "$OUTPUT_DIR/review.md"
# 2. Find key papers
echo "Identifying key papers..."
scoutml search "$TOPIC" \
--year-min 2020 \
--sota-only \
--limit 20 \
--output json > "$OUTPUT_DIR/key_papers.json"
# 3. Compare top approaches
echo "Comparing approaches..."
cat "$OUTPUT_DIR/key_papers.json" | \
jq -r '.[:5] | .[].arxiv_id' | \
xargs scoutml compare \
--output markdown \
--export "$OUTPUT_DIR/comparison.md"
# 4. Analyze reproducibility
echo "Checking reproducibility..."
cat "$OUTPUT_DIR/key_papers.json" | \
jq -r '.[].arxiv_id' | \
xargs -I {} sh -c 'scoutml agent critique {} --aspects reproducibility --output json' > \
"$OUTPUT_DIR/reproducibility.json"
# 5. Generate implementation guides
echo "Creating implementation guides..."
mkdir -p "$OUTPUT_DIR/implementations"
cat "$OUTPUT_DIR/key_papers.json" | \
jq -r '.[:3] | .[].arxiv_id' | \
while read -r paper_id; do
scoutml agent implement "$paper_id" \
--framework pytorch \
--export "$OUTPUT_DIR/implementations/$paper_id.md"
done
echo "Review pipeline complete! Results in $OUTPUT_DIR/"
Paper Analysis Dashboard¶
Create a comprehensive analysis dashboard:
#!/bin/bash
# paper_dashboard.sh
PAPER_ID="$1"
OUTPUT="dashboard_${PAPER_ID}.md"
cat > "$OUTPUT" << EOF
# Paper Analysis Dashboard: $PAPER_ID
Generated: $(date)
EOF
# Basic information
echo "## Paper Details" >> "$OUTPUT"
scoutml paper "$PAPER_ID" --output json | \
jq -r '"Title: \(.title)\nAuthors: \(.authors | join(", "))\nYear: \(.year)\nCitations: \(.citations)"' >> "$OUTPUT"
# Critique
echo -e "\n## Research Critique" >> "$OUTPUT"
scoutml agent critique "$PAPER_ID" --output json | \
jq -r '.critique.overall_assessment | to_entries | .[] | "- \(.key): \(.value)"' >> "$OUTPUT"
# Similar papers
echo -e "\n## Related Work" >> "$OUTPUT"
scoutml similar --paper-id "$PAPER_ID" --limit 5 --output json | \
jq -r '.[] | "- [\(.title)](\(.arxiv_url)) (similarity: \(.similarity))"' >> "$OUTPUT"
# Implementation feasibility
echo -e "\n## Implementation Guide" >> "$OUTPUT"
echo '```bash' >> "$OUTPUT"
echo "scoutml agent implement $PAPER_ID --framework pytorch" >> "$OUTPUT"
echo '```' >> "$OUTPUT"
# Limitations and solutions
echo -e "\n## Addressing Limitations" >> "$OUTPUT"
scoutml agent solve-limitations "$PAPER_ID" \
--tradeoffs speed \
--tradeoffs memory \
--output json | \
jq -r '.solutions[:3] | .[] | "### \(.name)\n\(.description)\n"' >> "$OUTPUT"
echo "Dashboard generated: $OUTPUT"
Data Processing Pipelines¶
Export to BibTeX¶
Convert search results to BibTeX format:
#!/bin/bash
# to_bibtex.sh
scoutml search "$1" --limit 20 --output json | \
jq -r '.[] |
"@article{\(.arxiv_id),
title={{\(.title)}},
author={{\(.authors | join(" and "))}},
year={{\(.year)}},
journal={{arXiv preprint arXiv:\(.arxiv_id)}},
url={{https://arxiv.org/abs/\(.arxiv_id)}}
}\n"' > references.bib
Create Reading List¶
Generate organized reading lists:
#!/bin/bash
# reading_list.sh
TOPIC="$1"
OUTPUT="reading_list_${TOPIC// /_}.md"
cat > "$OUTPUT" << EOF
# Reading List: $TOPIC
## Foundational Papers (High Impact)
EOF
scoutml search "$TOPIC" \
--min-citations 500 \
--limit 5 \
--output json | \
jq -r '.[] | "- [\(.title)](https://arxiv.org/abs/\(.arxiv_id)) - \(.citations) citations"' >> "$OUTPUT"
cat >> "$OUTPUT" << EOF
## Recent Advances (Last 2 Years)
EOF
scoutml search "$TOPIC" \
--year-min 2022 \
--sota-only \
--limit 10 \
--output json | \
jq -r '.[] | "- [\(.title)](https://arxiv.org/abs/\(.arxiv_id)) (\(.year)) - \(.citations) citations"' >> "$OUTPUT"
cat >> "$OUTPUT" << EOF
## Highly Reproducible
EOF
scoutml insights reproducibility \
--domain "$TOPIC" \
--limit 5 \
--output json | \
jq -r '.[] | "- [\(.title)](https://arxiv.org/abs/\(.arxiv_id)) - Score: \(.reproducibility_score)"' >> "$OUTPUT"
Research Trend Analysis¶
Track Method Evolution¶
#!/bin/bash
# method_evolution.sh
METHOD="$1"
OUTPUT_DIR="evolution_${METHOD// /_}"
mkdir -p "$OUTPUT_DIR"
# Track over years
for year in {2018..2023}; do
echo "=== Year $year ===" >> "$OUTPUT_DIR/timeline.txt"
scoutml method-search "$METHOD" \
--year-min $year \
--year-max $year \
--sort-by citations \
--limit 3 \
--output json > "$OUTPUT_DIR/year_$year.json"
# Extract key innovations
cat "$OUTPUT_DIR/year_$year.json" | \
jq -r '.[] | "- \(.title): \(.method_usage)"' >> "$OUTPUT_DIR/timeline.txt"
done
# Visualize citations trend
echo "Year,Total_Citations,Avg_Citations" > "$OUTPUT_DIR/trend.csv"
for year in {2018..2023}; do
stats=$(cat "$OUTPUT_DIR/year_$year.json" | \
jq '[.[] | .citations] | {total: add, avg: (add / length)}')
total=$(echo "$stats" | jq '.total // 0')
avg=$(echo "$stats" | jq '.avg // 0')
echo "$year,$total,$avg" >> "$OUTPUT_DIR/trend.csv"
done
Domain Comparison Matrix¶
#!/bin/bash
# domain_comparison.sh
DOMAINS=("computer vision" "nlp" "reinforcement learning")
METRICS=("avg_citations" "reproducibility" "compute_requirements")
echo "Domain Comparison Matrix" > comparison_matrix.txt
echo "=======================" >> comparison_matrix.txt
for domain in "${DOMAINS[@]}"; do
echo -e "\n## $domain" >> comparison_matrix.txt
# Get top papers
top_papers=$(scoutml search "$domain" \
--year-min 2022 \
--limit 20 \
--output json)
# Calculate metrics
avg_citations=$(echo "$top_papers" | \
jq '[.[] | .citations] | add / length')
# Get reproducibility
avg_reproducibility=$(scoutml insights reproducibility \
--domain "$domain" \
--limit 20 \
--output json | \
jq '[.[] | .reproducibility_score] | add / length')
echo "- Average Citations: $avg_citations" >> comparison_matrix.txt
echo "- Average Reproducibility: $avg_reproducibility" >> comparison_matrix.txt
done
Python Library Examples¶
Research Dashboard Application¶
import scoutml
import streamlit as st
import pandas as pd
import plotly.express as px
st.title("ScoutML Research Dashboard")
# Search interface
query = st.text_input("Search Query", "transformer models")
col1, col2 = st.columns(2)
with col1:
year_min = st.number_input("Min Year", 2018, 2024, 2020)
with col2:
limit = st.slider("Number of Results", 10, 100, 20)
if st.button("Search"):
with st.spinner("Searching..."):
results = scoutml.search(query, limit=limit, year_min=year_min)
# Convert to DataFrame
df = pd.DataFrame(results['papers'])
# Display metrics
st.metric("Total Papers", len(df))
st.metric("Average Citations", f"{df['citations'].mean():.0f}")
# Visualization
fig = px.scatter(df, x='year', y='citations',
hover_data=['title'],
title="Papers by Year and Citations")
st.plotly_chart(fig)
# Results table
st.dataframe(df[['title', 'year', 'citations', 'arxiv_id']])
# Paper details
if st.selectbox("Select a paper for details", df['arxiv_id'].tolist()):
paper_id = st.selectbox("Select a paper for details", df['arxiv_id'].tolist())
details = scoutml.get_paper(paper_id)
st.write(f"**Abstract:** {details['paper']['abstract']}")
Automated Research Assistant¶
import scoutml
import schedule
import time
from datetime import datetime
import smtplib
from email.mime.text import MIMEText
class ResearchAssistant:
def __init__(self, topics, email):
self.topics = topics
self.email = email
self.seen_papers = set()
def check_new_papers(self):
"""Check for new papers in topics of interest."""
new_papers = []
for topic in self.topics:
results = scoutml.search(
topic,
limit=10,
year_min=datetime.now().year,
min_citations=0
)
for paper in results['papers']:
if paper['arxiv_id'] not in self.seen_papers:
self.seen_papers.add(paper['arxiv_id'])
new_papers.append({
'topic': topic,
'paper': paper
})
if new_papers:
self.send_notification(new_papers)
def send_notification(self, papers):
"""Send email notification about new papers."""
body = "New papers found:\n\n"
for item in papers:
paper = item['paper']
body += f"Topic: {item['topic']}\n"
body += f"Title: {paper['title']}\n"
body += f"Authors: {', '.join(paper['authors'][:3])}...\n"
body += f"Link: https://arxiv.org/abs/{paper['arxiv_id']}\n\n"
msg = MIMEText(body)
msg['Subject'] = f"ScoutML: {len(papers)} new papers found"
msg['From'] = 'assistant@example.com'
msg['To'] = self.email
# Send email (configure SMTP settings)
# smtp.send_message(msg)
print(f"Would send email:\n{body}")
# Usage
assistant = ResearchAssistant(
topics=["federated learning", "vision transformer", "diffusion models"],
email="researcher@example.com"
)
# Schedule daily checks
schedule.every().day.at("09:00").do(assistant.check_new_papers)
# Run scheduler
while True:
schedule.run_pending()
time.sleep(60)
ML Pipeline Integration¶
import scoutml
import mlflow
import optuna
from sklearn.metrics import accuracy_score
class ResearchInformedML:
"""Use ScoutML to inform ML experiments."""
def __init__(self, research_topic):
self.topic = research_topic
self.papers = []
self.techniques = []
def research_phase(self):
"""Research state-of-the-art techniques."""
# Find top papers
results = scoutml.search(
self.topic,
limit=20,
min_citations=50,
year_min=2020,
sota_only=True
)
self.papers = results['papers']
# Analyze top techniques
for paper in self.papers[:5]:
critique = scoutml.critique_paper(paper['arxiv_id'])
guide = scoutml.get_implementation_guide(paper['arxiv_id'])
self.techniques.append({
'paper_id': paper['arxiv_id'],
'title': paper['title'],
'key_ideas': critique['critique']['strengths'],
'implementation': guide['implementation']['key_components']
})
def experiment_phase(self, X_train, y_train, X_test, y_test):
"""Run experiments based on research findings."""
mlflow.set_experiment(f"scoutml_{self.topic}")
for technique in self.techniques:
with mlflow.start_run(run_name=technique['title'][:50]):
# Log paper information
mlflow.log_param("paper_id", technique['paper_id'])
mlflow.log_param("paper_title", technique['title'])
# Implement technique (simplified)
model = self.implement_technique(technique)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model(model, "model")
print(f"Technique: {technique['title'][:50]}")
print(f"Accuracy: {accuracy:.4f}\n")
def implement_technique(self, technique):
"""Implement technique based on research."""
# This would implement the actual technique
# based on the implementation guide
from sklearn.ensemble import RandomForestClassifier
return RandomForestClassifier(n_estimators=100)
# Usage
researcher = ResearchInformedML("tabular classification sota")
researcher.research_phase()
# researcher.experiment_phase(X_train, y_train, X_test, y_test)
Citation Network Analysis¶
import scoutml
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict
def build_citation_network(root_paper_id, depth=2):
"""Build a citation network starting from a paper."""
G = nx.DiGraph()
visited = set()
def explore(paper_id, current_depth):
if current_depth > depth or paper_id in visited:
return
visited.add(paper_id)
# Get paper details
try:
paper = scoutml.get_paper(paper_id)
G.add_node(paper_id,
title=paper['paper']['title'],
year=paper['paper']['year'],
citations=paper['paper']['citations'])
# Find similar papers (as proxy for citations)
similar = scoutml.find_similar_papers(
paper_id=paper_id,
limit=5,
threshold=0.8
)
for sim_paper in similar['papers']:
sim_id = sim_paper['arxiv_id']
G.add_edge(paper_id, sim_id, weight=sim_paper['similarity'])
explore(sim_id, current_depth + 1)
except Exception as e:
print(f"Error processing {paper_id}: {e}")
explore(root_paper_id, 0)
return G
# Build network
G = build_citation_network("1706.03762", depth=2) # Attention is All You Need
# Analyze network
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
# Find most influential papers
pagerank = nx.pagerank(G)
top_papers = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]
print("\nMost influential papers in network:")
for paper_id, score in top_papers:
if paper_id in G.nodes:
title = G.nodes[paper_id].get('title', 'Unknown')[:50]
print(f"{paper_id}: {title}... (score: {score:.4f})")
# Visualize
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, k=2, iterations=50)
nx.draw(G, pos,
node_size=1000,
node_color='lightblue',
with_labels=True,
font_size=8,
edge_color='gray',
arrows=True)
plt.title("Paper Citation Network")
plt.tight_layout()
plt.show()
Integration Examples¶
Slack Notification Bot¶
#!/usr/bin/env python3
# scout_bot.py
import json
import subprocess
import requests
from datetime import datetime
SLACK_WEBHOOK = "YOUR_WEBHOOK_URL"
TOPICS = ["federated learning", "vision transformer", "diffusion models"]
def get_new_papers(topic, min_citations=10):
cmd = [
"scoutml", "search", topic,
"--year-min", str(datetime.now().year),
"--min-citations", str(min_citations),
"--limit", "5",
"--output", "json"
]
result = subprocess.run(cmd, capture_output=True, text=True)
return json.loads(result.stdout)
def send_to_slack(message):
requests.post(SLACK_WEBHOOK, json={"text": message})
# Check each topic
for topic in TOPICS:
papers = get_new_papers(topic)
if papers:
message = f"*New papers in {topic}:*\n"
for paper in papers[:3]:
message += f"• <https://arxiv.org/abs/{paper['arxiv_id']}|{paper['title']}> - {paper['citations']} citations\n"
send_to_slack(message)
Research Portfolio Analysis¶
#!/bin/bash
# portfolio_analysis.sh
# Your paper IDs
MY_PAPERS=(
"2301.12345"
"2302.23456"
"2303.34567"
)
echo "# Research Portfolio Analysis"
echo "Generated: $(date)"
echo
# Citation analysis
total_citations=0
for paper in "${MY_PAPERS[@]}"; do
citations=$(scoutml paper "$paper" --output json | jq '.citations')
total_citations=$((total_citations + citations))
echo "- $paper: $citations citations"
done
echo "Total citations: $total_citations"
echo
# Find similar work
echo "## Related Research"
for paper in "${MY_PAPERS[@]}"; do
echo "### Similar to $paper:"
scoutml similar --paper-id "$paper" --limit 3 --output json | \
jq -r '.[] | "- \(.title) (similarity: \(.similarity))"'
done
# Identify gaps
echo "## Research Opportunities"
for paper in "${MY_PAPERS[@]}"; do
echo "### Extensions for $paper:"
scoutml agent solve-limitations "$paper" --output json | \
jq -r '.limitations[:2] | .[] | "- \(.description)"'
done
Best Practices¶
Performance Optimization¶
- Batch Operations: Process multiple items in single commands
- JSON Processing: Use
jq
for efficient data manipulation - Caching: Save intermediate results to avoid repeated API calls
- Parallel Processing: Use
xargs -P
for parallel execution
Error Handling¶
# Robust error handling example
process_paper() {
local paper_id=$1
local max_retries=3
local retry_count=0
while [ $retry_count -lt $max_retries ]; do
if scoutml paper "$paper_id" --output json > "/tmp/${paper_id}.json" 2>/dev/null; then
return 0
fi
retry_count=$((retry_count + 1))
echo "Retry $retry_count for $paper_id..." >&2
sleep 2
done
echo "Failed to process $paper_id after $max_retries attempts" >&2
return 1
}
Data Validation¶
# Validate search results
validate_results() {
local json_file=$1
# Check if file exists and is valid JSON
if [ ! -f "$json_file" ]; then
echo "Error: File not found" >&2
return 1
fi
if ! jq empty "$json_file" 2>/dev/null; then
echo "Error: Invalid JSON" >&2
return 1
fi
# Check if results exist
local count=$(jq 'length' "$json_file")
if [ "$count" -eq 0 ]; then
echo "Warning: No results found" >&2
return 2
fi
echo "Valid results: $count items"
return 0
}
Next Steps¶
- Explore best practices for optimal usage
- Check output formats for data processing
- Review individual command documentation
- Join our community for more examples and workflows