Research

My research focuses on advancing visual perception systems by integrating edge-cloud processing, ensuring robust privacy safeguards, and leveraging dynamic temporal modeling to enable scalable, real-time scene understanding. I have not yet completed my research and work so this is just a preview of what I have done so far.

Research Abstract

Current Vision Language Models (VLMs) often struggle with real-time performance and maintaining continuous contextual understanding of dynamic visual scenes, particularly within privacy-sensitive or resource-constrained environments.

Addressing these limitations, I introduce Orion, a novel, real-time local visual perception architecture featuring a hybrid on-device and server architecture. At its core, Orion integrates YOLOv11n for efficient real-time object detection, a custom finetuned version of FastVLM-0.5b for structured, on-device image captioning, a custom fine-tuned version of server-side/locally inferenced Gemma-3B LLM, a novel high-dimensional vector database embedding system, and a structured relational model of dynamic graph knowledge to enable true causal reasoning for for sophisticated temporal reasoning, contextual analysis, and answering user queries. This design prioritizes privacy and low-latency inference by performing visual encoding and captioning directly on the mobile device, leveraging FastVLM's proven efficiency, including RAM usage below 1GB and Time-to-First-Token (TTFT) as low as 600ms on higher-end iPhones.

Building upon this robust foundation, I propose to extend Orion’s architectural and methodological capabilities to enable advanced analysis and tracking of visual events within a persistent contextual memory using a vector database. This expanded framework will allow Orion to automatically detect and articulate subtle or significant changes across successive video frames, provide deep visual insights into the evolution of scenes, and facilitate complex user queries regarding temporal visual events and their contextual implications. Orion represents a significant step towards creating intelligent, interactive, and deployable visual perception agents that can truly "remember" and "understand" the world as it unfolds.

Research Motivation

Latency Bottleneck

Current AI vision systems take 3-5 seconds to process visual data, making them unsuitable for real-time applications.

Privacy Violations

Existing solutions require sending raw video to cloud servers, creating significant privacy and security risks.

Visual Hallucinations

Vision-language models frequently generate inaccurate descriptions, failing to ground responses in actual visual evidence.

Stateless Processing

Current systems lack memory between interactions, unable to build understanding or context over time.

Key Research Contributions

Hybrid Edge-Server Architecture

Split-computation design that processes visual data locally while performing complex reasoning on a local server.

Semantic Uplift Process

Transforms raw object detection outputs into structured, grounded scene descriptions using vision-language models.

Privacy-First Design

Makes sure raw video never leaves the device, transmitting only anonymized textual descriptions for processing.

Temporal Memory System

Implements persistent contextual memory using vector databases like Milvus for queryable event analysis over time.

The Hybrid Cloud-Edge Architecture

Orion's novelty is in its system architecture and data-flow paradigm. By splitting tasks between an edge device and a local server, Orion balances low-latency perception with sophisticated, stateful reasoning.

Edge Device

e.g., Smartphone

Capabilities

• Real-time image capture
• On-device YOLO detection
• Spatial relationship analysis
• Low-power processing

Low Latency

Text Payload

Data Exchange

Structured descriptions, not the image

Insights & State

Local Server

e.g., Apple Silicon Mac

Capabilities

• Advanced VLM processing
• Temporal reasoning
• Knowledge graph storage
• Complex inference

High Compute

Vector Database Search & API Packet Flow

How Orion's vector database search, packet-level communication, and real-time processing pipeline work with live simulations and code examples.

Vector Database Architecture & Semantic Search

Orion's memory system uses high-dimensional vector embeddings to enable semantic search across temporal observations. Unlike traditional keyword matching, this approach captures semantic similarity and contextual relationships. I am currently using Milvus for vector database storage.

Technical Specifications

Embedding Model:all-mpnet-base-v2

Vector Dimensions:768

Index Type:FAISS IVF-PQ

Similarity Metric:Cosine Similarity

Search Complexity:O(log n)

Performance Metrics

Query Latency:< 50ms

Index Size (1M vectors):~3GB

Recall@10:0.95

Memory Usage:~4GB RAM

Query Embedding

User query 'Where are my keys?' is processed through sentence-transformer

all-mpnet-base-v2 generates 768-dimensional dense vector

query_vector = model.encode("Where are my keys?")
# Output: [0.1234, -0.5678, 0.9012, ...]

Similarity Search

Cosine similarity computed against stored scene embeddings

FAISS index enables sub-linear search through millions of vectors

similarities = np.dot(query_vector, scene_vectors.T)
top_k_indices = np.argsort(similarities)[-5:]

Context Retrieval

Top-K most similar scenes retrieved with metadata

Temporal weighting applied: recent scenes get 1.2x boost

results = [
  {"scene": "keys on kitchen counter", "similarity": 0.89, "timestamp": "2024-01-15T14:30:22Z"},
  {"scene": "keys in jacket pocket", "similarity": 0.76, "timestamp": "2024-01-15T09:15:11Z"}
]

LLM Synthesis

Gemma-3B synthesizes retrieved contexts into coherent answer

Context window: 8192 tokens, retrieval-augmented generation

prompt = f"Based on these observations: {retrieved_contexts}\nAnswer: {user_query}"
response = gemma_3b.generate(prompt)

Vector Embeddings & Dynamic Graph Knowledge

A key goal for Orion is to move beyond stateless perception to stateful understanding. This requires an explicit memory system. I propose a phased implementation, starting with semantic search and evolving towards structured knowledge graphs.

Phase 1: Vector-Based Semantic Memory

The 'What and Where'

Creates a searchable log of all observations, enabling content-based retrieval of past events.

Store and search scene descriptions

Sentence Embeddings + Vector Database

Semantic similarity search

Query: 'Where did I last see my wallet?'

Query Flow

User Query

Find my keys

Query Embedding

Vector representation

Vector DB Search

Similarity matching

Ranked Results

Scene descriptions

Vector Database: Technical Implementation

Embedding & Indexing Pipeline

Scene Description

A person is sitting on a red chair reading a book in a well-lit living room

→ Tokenized text ready for embedding

Sentence Embedding

Tokenized scene description

→ 768-dimensional dense vector [0.1234, -0.5678, ...]

Vector Indexing

Dense vector + metadata

→ FAISS IVF-PQ index entry with clustering

Memory Storage

Indexed vector

→ Searchable memory entry with temporal weighting

FAISS Index Configuration

dimension = 768  # all-mpnet-base-v2 output
nlist = 1024     # number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

Interactive Query Examples

Query Latency

<50ms

Average search time across 1M+ vectors

Memory Efficiency

75%

Compression ratio vs. flat index

Search Accuracy

95%

Recall@10 on semantic queries

Temporal Reasoning: From Snapshots to Understanding

By maintaining a persistent memory of structured observations, Orion can perform sophisticated temporal reasoning. It compares textual snapshots over time to detect changes, infer complex events, and answer questions about causality and sequence.

Change DetectionCausal InferenceEvent SequencingContext Preservation

Validation Framework

Component Evaluation Metrics

Component	Evaluation Task	Metrics	Datasets
YOLOv11n Evaluating object detection accuracy and real-time performance capabilities	Real-time Object Detection	mAP (mean Average Precision)FPS (Frames Per Second)	MS-COCO Open Images Dataset
FastVLM-0.5b Measuring quality of machine-generated captions against human references	Image/Frame Captioning	BLEUMETEORCIDEr	MS-COCO Captions Flickr30k
Semantic Search (Milvus) Testing semantic similarity and retrieval accuracy for video content	Video Question Answering	AccuracyWUPS (Wu-Palmer Similarity)	VideoQA Custom Video Benchmark
Knowledge Extraction Evaluating correctness of (subject, predicate, object) triple extraction	Entity & Relation Extraction	PrecisionRecallF1-score	Custom-annotated Dataset ConceptNet
Temporal Reasoning Testing unique reasoning for questions like 'Why did event X happen?'	Temporal & Causal QA	Accuracy on Custom BenchmarkCausal Reasoning Score	Custom Benchmark (REXTIME-inspired) SOK-Bench

Evaluation Objectives

Semantic Retrieval Evaluation

Evaluate semantic retrieval performance on video content

In Progress

VideoQA Benchmarking

Benchmark system performance on VideoQA datasets

Planned

Scene Description Quality

Test and validate scene description generation quality

Completed

Custom Temporal Benchmark

Develop custom temporal reasoning benchmark dataset

In Progress

Causal Reasoning Testing

Test causal reasoning capabilities and accuracy

Planned

Entity Resolution Accuracy

Evaluate entity resolution and disambiguation accuracy

In Progress

Validation Methodology

Standard Metrics Justification

• BLEU, METEOR, CIDEr: Testing for text generation quality
• mAP: Standard for object detection accuracy(used in YOLO research papers as well)
• WUPS: Semantic similarity for open-ended answers
• Precision/Recall/F1: Used for knowledge extraction evaluation

Custom Benchmarks

• Temporal QA: I will develop a custom benchmark for unique reasoning capabilities
• Video Context: Similar, specific evaluation datasets
• Real-time Performance: FPS metrics for practical deployment(real-time abilities)
• Causal Reasoning: Evaluation for “Why” questions

Development Timeline

June 2025 - July 2025

Research Phase 1: Foundation & Prototype Development

Research Phase 1

Completed

Core infrastructure, authentication, and initial prototypes. So far during this time I have made a fully built out API, auth system(w/email), a Mac app, and an iOS app which are all fully functional and passing all unit tests.

•API architecture & authentication
•Dashboard, Mac app, iOS app development
•FastVLM, YOLOv11n, Gemma-3B prototypes

August 2025 - December 2025

Research Phase 2: Vector Database & Server Architecture

Research Phase 2

In Progress

Vector database system and server architecture optimization. This phase also does include framework validation and making sure that vector database search is evaluated and tweaked.

•Vector database implementation
•Edge-cloud processing framework
•Distributed inference protocols

January 2025 - April 2026

Research Phase 3: Dynamic Graph Knowledge System

Research Phase 3

Upcoming

Advanced temporal reasoning and dynamic knowledge systems. I will also be tweaking the Gemma-3B model to work with the vector database and the dynamic graph knowledge system, and creating a new model that can understand the context of the scene based on this new information.

•Dynamic graph architecture
•Temporal reasoning models
•Real-time scene understanding

May 2026 - June 2026

Beta Deployment & Publication

Upcoming

Real-world deployment and academic research publication. This would mean deployment as a real-world service in beta that people could actually use and test out, and also publishing the research paper on the work done so far.

•Beta application deployment
•Research paper completion
•Public release preparation

Real-World Applications

My whole goal with Orion Live is to serve as a deployable visual perception agent across many different domains, offering open-source, real-time, and privacy-first visual intelligence solutions.

Real-time ProcessingPrivacy-First DesignOpen SourceMulti-modal Intelligence

Accessibility Assistance

Real-time scene understanding for visually impaired users

Key Capabilities

Live scene narration and object identification
Navigation assistance with obstacle detection
Text recognition and reading assistance
Smart home integration for daily activities

Deployment

Mobile apps, smart glasses, wearable devices

Autonomous Systems

Visual perception for drones, robots, and autonomous vehicles

Key Capabilities

Real-time obstacle detection and avoidance
Dynamic path planning with visual feedback
Environmental monitoring and surveillance
Search and rescue operations

Deployment

Edge computing modules, embedded systems

Intelligent Agents

General-purpose visual intelligence for various applications

Key Capabilities

Interactive visual question answering
Real-time video analysis and summarization
Multi-modal conversation capabilities
Context-aware visual reasoning

Deployment

Cloud services, mobile applications, web platforms

Smart Environments

Ambient intelligence for homes, offices, and public spaces

Key Capabilities

Occupancy detection and space optimization
Security monitoring with privacy preservation
Energy management through activity recognition
Elderly care and health monitoring

Deployment

IoT devices, smart cameras, edge gateways

Why Choose Orion Live?

Privacy-First

Unlike proprietary solutions, Orion processes data locally with optional cloud enhancement

Open Source

Fully transparent, customizable, and community-driven development

Real-Time

Edge-optimized processing for instant visual understanding

Open Science & Resources

I have released my fine-tuned FastVLM-0.5b model and the custom COCO dataset used for validation. Explore the model and dataset on Huggingface, and check out the complete codebase on GitHub. I plan to formally publish this research in about one year through a professor-led collaboration.

Research

Latency Bottleneck

Privacy Violations

Visual Hallucinations

Stateless Processing

Hybrid Edge-Server Architecture

Semantic Uplift Process

Privacy-First Design

Temporal Memory System

The Hybrid Cloud-Edge Architecture

Vector Database Search & API Packet Flow

Technical Specifications

Performance Metrics

Vector Embeddings & Dynamic Graph Knowledge

Embedding & Indexing Pipeline

Scene Description

Sentence Embedding

Vector Indexing

Memory Storage

FAISS Index Configuration

"Where did I put my laptop?"

"What was the person doing before they sat down?"

Query Latency

Memory Efficiency

Search Accuracy

Temporal Reasoning: From Snapshots to Understanding

Validation Framework

Evaluation Objectives

Semantic Retrieval Evaluation

VideoQA Benchmarking

Scene Description Quality

Custom Temporal Benchmark

Causal Reasoning Testing

Entity Resolution Accuracy

Standard Metrics Justification

Custom Benchmarks

Development Timeline

Research Phase 1: Foundation & Prototype Development

Research Phase 2: Vector Database & Server Architecture

Research Phase 3: Dynamic Graph Knowledge System

Beta Deployment & Publication

Real-World Applications

Key Capabilities

Deployment

Key Capabilities

Deployment

Key Capabilities

Deployment

Key Capabilities

Deployment

Why Choose Orion Live?

Privacy-First

Open Source

Real-Time

Related Works & References

Models & Frameworks

Research Papers

Databases & Infrastructure

Open Science & Resources

Latency Bottleneck

Privacy Violations

Visual Hallucinations

Stateless Processing

Hybrid Edge-Server Architecture

Semantic Uplift Process

Privacy-First Design

Temporal Memory System

The Hybrid Cloud-Edge Architecture

Vector Database Search & API Packet Flow

Technical Specifications

Performance Metrics

Vector Embeddings & Dynamic Graph Knowledge

Embedding & Indexing Pipeline

Scene Description

Sentence Embedding

Vector Indexing

Memory Storage

FAISS Index Configuration

"Where did I put my laptop?"

"What was the person doing before they sat down?"