Temporal Video Analysis for Identifying Traditional Malay Buildings Using Residual Network and Vision Transformer

Sri Winiarti; Sunardi; Abdul Fadlil

Authors

Sri Winiarti University Ahmad Dahlan
Indonesia
Sunardi Universitas Ahmad Dahlan
Indonesia
Abdul Fadlil Universitas Ahmad Dahlan
Indonesia

Keywords:

Traditional Malay Architecture, Video Classification, Residual Network, Vision Transformer, Deep Learning

Abstract

The lack of digital documentation in preserving traditional Malay architecture faces serious challenges, especially with the modernization that slowly obscures the shape and authenticity of the building. Essential elements such as roof shapes, stage structures, and typical ornamental carvings are difficult to identify manually without special skills and considerable time. Malay architecture is an integral part of Indonesia's cultural heritage that needs to be documented systematically and digitally. Along with advances in Artificial Intelligence (AI) technology, traditional buildings' intense learning, identification, and classification can now be done automatically through video-based visual data processing. This study uses a video-based deep learning approach to develop and evaluate a classification system for traditional Malay buildings. Two types of architecture are used: Residual Network (ResNet) and Vision Transformer (ViT). The dataset in the form of videos of traditional buildings was collected from the Pekanbaru, Riau Province, then processed through frame extraction, spatial-temporal augmentation, and visual annotation, resulting in a total of 1,500 frames as training data. This study also presents a novel aspect by comparing the performance of five deep learning models: ResNet18, ResNet34, ResNet50, ResNet101 (CNN), and ViT based on self-attention. ViT, which is rarely used in traditional video-based architecture, shows competitive accuracy and proves its effectiveness in understanding global visual relationships. The training method is carried out using supervised learning and evaluated based on classification accuracy. The test results show that all models can accurately identify visual features of Malay architecture. ResNet50 recorded the highest accuracy (100%), followed by ResNet18 (96.0%), ResNet101 (94.9%), ResNet34 (93.9%), and ViT (93.9%). These findings strengthen the potential for utilizing deep learning in cultural preservation through a video-based automatic documentation system.

Downloads

Download data is not yet available.

Temporal Video Analysis for Identifying Traditional Malay Buildings Using Residual Network and Vision Transformer

Authors

Keywords:

Abstract

Downloads

Downloads

Submitted

Accepted

Published

How to Cite

Issue

Section

License

sidebarmenu

Journal Metrics

Visitors

indexedby