Temporal Video Analysis for Identifying Traditional Malay Buildings Using Residual Network and Vision Transformer
Keywords:
Traditional Malay Architecture, Video Classification, Residual Network, Vision Transformer, Deep LearningAbstract
The lack of digital documentation in preserving traditional Malay architecture faces serious challenges, especially with the modernization that slowly obscures the shape and authenticity of the building. Essential elements such as roof shapes, stage structures, and typical ornamental carvings are difficult to identify manually without special skills and considerable time. Malay architecture is an integral part of Indonesia's cultural heritage that needs to be documented systematically and digitally. Along with advances in Artificial Intelligence (AI) technology, traditional buildings' intense learning, identification, and classification can now be done automatically through video-based visual data processing. This study uses a video-based deep learning approach to develop and evaluate a classification system for traditional Malay buildings. Two types of architecture are used: Residual Network (ResNet) and Vision Transformer (ViT). The dataset in the form of videos of traditional buildings was collected from the Pekanbaru, Riau Province, then processed through frame extraction, spatial-temporal augmentation, and visual annotation, resulting in a total of 1,500 frames as training data. This study also presents a novel aspect by comparing the performance of five deep learning models: ResNet18, ResNet34, ResNet50, ResNet101 (CNN), and ViT based on self-attention. ViT, which is rarely used in traditional video-based architecture, shows competitive accuracy and proves its effectiveness in understanding global visual relationships. The training method is carried out using supervised learning and evaluated based on classification accuracy. The test results show that all models can accurately identify visual features of Malay architecture. ResNet50 recorded the highest accuracy (100%), followed by ResNet18 (96.0%), ResNet101 (94.9%), ResNet34 (93.9%), and ViT (93.9%). These findings strengthen the potential for utilizing deep learning in cultural preservation through a video-based automatic documentation system.
Downloads
Downloads
Submitted
Accepted
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Sri Winiarti, Sunardi, Abdul Fadlil

This work is licensed under a Creative Commons Attribution 4.0 International License.








