My research focuses on advancing visual perception systems by integrating edge-cloud processing, ensuring robust privacy safeguards, and leveraging dynamic temporal modeling to enable scalable, real-time scene understanding. I have not yet completed my research and work so this is just a preview of what I have done so far.
I have released my fine-tuned FastVLM-0.5b model and the custom COCO dataset used for validation. Explore the model and dataset on Huggingface, and check out the complete codebase on GitHub. I plan to formally publish this research in about one year through a professor-led collaboration.
Current Vision Language Models (VLMs) often struggle with real-time performance and maintaining continuous contextual understanding of dynamic visual scenes, particularly within privacy-sensitive or resource-constrained environments.
Addressing these limitations, I introduce Orion, a novel, real-time local visual perception architecture featuring a hybrid on-device and server architecture. At its core, Orion integrates YOLOv11n for efficient real-time object detection, a custom finetuned version of FastVLM-0.5b for structured, on-device image captioning, a custom fine-tuned version of server-side/locally inferenced Gemma-3B LLM, a novel high-dimensional vector database embedding system, and a structured relational model of dynamic graph knowledge to enable true causal reasoning for for sophisticated temporal reasoning, contextual analysis, and answering user queries. This design prioritizes privacy and low-latency inference by performing visual encoding and captioning directly on the mobile device, leveraging FastVLM's proven efficiency, including RAM usage below 1GB and Time-to-First-Token (TTFT) as low as 600ms on higher-end iPhones.
Building upon this robust foundation, I propose to extend Orion’s architectural and methodological capabilities to enable advanced analysis and tracking of visual events within a persistent contextual memory using a vector database. This expanded framework will allow Orion to automatically detect and articulate subtle or significant changes across successive video frames, provide deep visual insights into the evolution of scenes, and facilitate complex user queries regarding temporal visual events and their contextual implications. Orion represents a significant step towards creating intelligent, interactive, and deployable visual perception agents that can truly "remember" and "understand" the world as it unfolds.
Current AI vision systems take 3-5 seconds to process visual data, making them unsuitable for real-time applications.
Existing solutions require sending raw video to cloud servers, creating significant privacy and security risks.
Vision-language models frequently generate inaccurate descriptions, failing to ground responses in actual visual evidence.
Current systems lack memory between interactions, unable to build understanding or context over time.
Split-computation design that processes visual data locally while performing complex reasoning on a local server.
Transforms raw object detection outputs into structured, grounded scene descriptions using vision-language models.
Makes sure raw video never leaves the device, transmitting only anonymized textual descriptions for processing.
Implements persistent contextual memory using vector databases like Milvus for queryable event analysis over time.
Orion's novelty is in its system architecture and data-flow paradigm. By splitting tasks between an edge device and a local server, Orion balances low-latency perception with sophisticated, stateful reasoning.
e.g., Smartphone
e.g., Apple Silicon Mac
How Orion's vector database search, packet-level communication, and real-time processing pipeline work with live simulations and code examples.
Orion's memory system uses high-dimensional vector embeddings to enable semantic search across temporal observations. Unlike traditional keyword matching, this approach captures semantic similarity and contextual relationships. I am currently using Milvus for vector database storage.
User query 'Where are my keys?' is processed through sentence-transformer
all-mpnet-base-v2 generates 768-dimensional dense vector
query_vector = model.encode("Where are my keys?")
# Output: [0.1234, -0.5678, 0.9012, ...]
Cosine similarity computed against stored scene embeddings
FAISS index enables sub-linear search through millions of vectors
similarities = np.dot(query_vector, scene_vectors.T)
top_k_indices = np.argsort(similarities)[-5:]
Top-K most similar scenes retrieved with metadata
Temporal weighting applied: recent scenes get 1.2x boost
results = [
{"scene": "keys on kitchen counter", "similarity": 0.89, "timestamp": "2024-01-15T14:30:22Z"},
{"scene": "keys in jacket pocket", "similarity": 0.76, "timestamp": "2024-01-15T09:15:11Z"}
]
Gemma-3B synthesizes retrieved contexts into coherent answer
Context window: 8192 tokens, retrieval-augmented generation
prompt = f"Based on these observations: {retrieved_contexts}\nAnswer: {user_query}"
response = gemma_3b.generate(prompt)
A key goal for Orion is to move beyond stateless perception to stateful understanding. This requires an explicit memory system. I propose a phased implementation, starting with semantic search and evolving towards structured knowledge graphs.
The 'What and Where'
Creates a searchable log of all observations, enabling content-based retrieval of past events.
A person is sitting on a red chair reading a book in a well-lit living room
→ Tokenized text ready for embedding
Tokenized scene description
→ 768-dimensional dense vector [0.1234, -0.5678, ...]
Dense vector + metadata
→ FAISS IVF-PQ index entry with clustering
Indexed vector
→ Searchable memory entry with temporal weighting
dimension = 768 # all-mpnet-base-v2 output
nlist = 1024 # number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
Average search time across 1M+ vectors
Compression ratio vs. flat index
Recall@10 on semantic queries
By maintaining a persistent memory of structured observations, Orion can perform sophisticated temporal reasoning. It compares textual snapshots over time to detect changes, infer complex events, and answer questions about causality and sequence.
Component | Evaluation Task | Metrics | Datasets |
---|---|---|---|
YOLOv11n Evaluating object detection accuracy and real-time performance capabilities | Real-time Object Detection | mAP (mean Average Precision)FPS (Frames Per Second) | MS-COCO Open Images Dataset |
FastVLM-0.5b Measuring quality of machine-generated captions against human references | Image/Frame Captioning | BLEUMETEORCIDEr | MS-COCO Captions Flickr30k |
Semantic Search (Milvus) Testing semantic similarity and retrieval accuracy for video content | Video Question Answering | AccuracyWUPS (Wu-Palmer Similarity) | VideoQA Custom Video Benchmark |
Knowledge Extraction Evaluating correctness of (subject, predicate, object) triple extraction | Entity & Relation Extraction | PrecisionRecallF1-score | Custom-annotated Dataset ConceptNet |
Temporal Reasoning Testing unique reasoning for questions like 'Why did event X happen?' | Temporal & Causal QA | Accuracy on Custom BenchmarkCausal Reasoning Score | Custom Benchmark (REXTIME-inspired) SOK-Bench |
Evaluate semantic retrieval performance on video content
In ProgressBenchmark system performance on VideoQA datasets
PlannedTest and validate scene description generation quality
CompletedDevelop custom temporal reasoning benchmark dataset
In ProgressTest causal reasoning capabilities and accuracy
PlannedEvaluate entity resolution and disambiguation accuracy
In ProgressJune 2025 - July 2025
Core infrastructure, authentication, and initial prototypes. So far during this time I have made a fully built out API, auth system(w/email), a Mac app, and an iOS app which are all fully functional and passing all unit tests.
August 2025 - December 2025
Vector database system and server architecture optimization. This phase also does include framework validation and making sure that vector database search is evaluated and tweaked.
January 2025 - April 2026
Advanced temporal reasoning and dynamic knowledge systems. I will also be tweaking the Gemma-3B model to work with the vector database and the dynamic graph knowledge system, and creating a new model that can understand the context of the scene based on this new information.
May 2026 - June 2026
Real-world deployment and academic research publication. This would mean deployment as a real-world service in beta that people could actually use and test out, and also publishing the research paper on the work done so far.
My whole goal with Orion Live is to serve as a deployable visual perception agent across many different domains, offering open-source, real-time, and privacy-first visual intelligence solutions.
Real-time scene understanding for visually impaired users
Mobile apps, smart glasses, wearable devices
Visual perception for drones, robots, and autonomous vehicles
Edge computing modules, embedded systems
General-purpose visual intelligence for various applications
Cloud services, mobile applications, web platforms
Ambient intelligence for homes, offices, and public spaces
IoT devices, smart cameras, edge gateways
Unlike proprietary solutions, Orion processes data locally with optional cloud enhancement
Fully transparent, customizable, and community-driven development
Edge-optimized processing for instant visual understanding