• Advanced Photonics Nexus
  • Vol. 4, Issue 5, 056014 (2025)
Jiahui Liao1、2、3, Xucong Bian1、2、3, Xiang’ai Cheng1、2、3, Quanjiang Li4, Yuting Jiang5, Shaozhen Lou5, Haoqian Wang1、2、3, Zixiao Hua1, Teng Li1, Jiangbin Zhang1、2、3, Zhongjie Xu1、2、3, Yueqiang Hu5、*, and Zhongyang Xing1、2、3、*
Author Affiliations
  • 1National University of Defense Technology, College of Advanced Interdisciplinary Studies, Changsha, China
  • 2National University of Defense Technology, Nanhu Laser Laboratory, Changsha, China
  • 3Hunan Provincial Key Laboratory of High Energy Laser Technology, Changsha, China
  • 4National University of Defense Technology, College of Science, Changsha, China
  • 5Hunan University, College of Mechanical and Vehicle Engineering, National Research Center for High-Efficiency Grinding, Changsha, China
  • *Corresponding author: huyq@hnu.edu.cn,xingzhongyang@nudt.edu.cn
  • show less
    DOI: 10.1117/1.APN.4.5.056014 Cite this Article Set citation alerts
    Jiahui Liao, Xucong Bian, Xiang’ai Cheng, Quanjiang Li, Yuting Jiang, Shaozhen Lou, Haoqian Wang, Zixiao Hua, Teng Li, Jiangbin Zhang, Zhongjie Xu, Yueqiang Hu, Zhongyang Xing, "GLSaT: a spectral-aware transformer-based network enabling highly efficient and precise inverse design in metasurface optical filters," Adv. Photon. Nexus 4, 056014 (2025) Copy Citation Text show less

    Abstract

    The traditional forward design process of metasurface optical filters is computationally costly and time-consuming; therefore, inverse design based on deep learning (DL) can help accelerate the process. We propose the global- and local-spectrum-aware transformer (GLSaT), a DL model that concerns the intrinsic correlations within the spectral sequences, compensating the drawbacks of current networks that only focus on structure-to-spectrum mappings. With both inter- and intra-fragment attention mechanisms implemented, the GLSaT achieves 32.9% higher accuracy than fully connected networks in our reflection tests. It also demonstrates an inherent balance between predictive precision and computational efficiency, outperforming alternative architectures. Furthermore, our extensive experimental validations demonstrate its generalization capability across diverse metasurface functionalities. The GLSaT architecture shows great potential for enhancing the efficiency of data-driven metasurface inverse design in the future.

    1 Introduction

    Metasurfaces offer precise control of light by intricately designing subwavelength elements, allowing for versatile operation of optical properties on a flat platform.13 With carefully engineered two-dimensional arrays, metasurfaces can achieve full control over phase, amplitude, and polarization of the light,46 thereby supporting diverse applications.712 One important application is the optical filters that can manipulate spectra with diverse functions, including reflection, transmission, and scattering.1317 To ensure the performance of the devices, accurate control over amplitudes is required.13,1820 However, it is computationally expensive and time-consuming to optimize the configurations by conducting the numerical full-wave simulations, e.g., the finite-difference time-domain (FDTD) method.2123 Therefore, an efficient and effective on-demand design method of the metasurface configurations is needed.

    Artificial intelligence (AI), which serves as a powerful computational tool, has been applied to accelerate the design process.14,15,2428 For example, a deep neural network (DNN) that has multiple hidden layers with sufficient hidden units24,2932 can be used to discover complex relationships between structural parameters and their electromagnetic (EM) responses, including amplitude15,33,34 and phase.24,35 The datasets are generated by full-wave simulations.16,24,30,3537 Once trained with massive pre-simulated data, the DNN can reduce the time for forward prediction to milliseconds and thus facilitate inverse design from desired spectra to structures.33,38,39 However, one high-dimensional spectrum of an ideal filter can be generated by several sets of low-dimensional structural parameters, so the inverse prediction networks may yield non-unique solutions, making the training difficult to converge. This means that the input spectrum of the inverse network is non-physical and unsuitable to be directly used for predicting the structures. Thus, the spectral data should be pre-processed into more physical spectral curves, such as through Gaussian curve shaping24 or using a probability model.16 In addition, a highly accurate forward network is often concatenated right after the inverse network, which helps accelerate training convergence by rapidly acquiring EM responses while training.16,24

    In recent years, several other deep learning (DL) models have been adopted for metasurface design. The generative adversarial network (GAN) has been used to seek the structures from the intended spectra,33 whereas networks such as the fully connected network (FCN),15,4042 the bidirectional neural network (BNN),38 and the neural tensor network (NTN)24,3537 are for predicting the on-demand spectra, assisting the inverse design of structures. However, these networks mainly focus on structure-to-spectrum mapping, disregarding the intrinsic correlations embedded within spectral sequences. This can limit their ability to reproduce complex spectral features (e.g., the relative intensities or shapes of multiple peaks) even if overall accuracy appears high. Moreover, some previous studies have difficulty processing high-dimensional spectra and therefore down-sample the output (i.e., use fewer spectral points). This removes sharp resonance peaks or other fine details, leading to a loss of important spectral information.24,33,40 Transformer models based on self-attention mechanisms, excelling at capturing contextual sequence dependencies,43 can be adapted to capture spectral features. The transformer-based architectures have presented preferable performance in such applications, i.e., high-Q resonances in metasurface sensors,25 broadband solar metamaterial absorbers,44 and subtle molecular fingerprints in Raman spectra.45 However, current transformer architectures remain two main problems. First, they often require extensive layer stacking to achieve the desired performance, leading to increased model complexity and computational costs.25,45 Second, most of the methods only use the existing attention modules and lack the novel design of the architecture, neglecting the interdependencies among spectral fragments.25,44,45 These issues highlight the need for a highly efficient and accurate transformer architecture that can predict EM responses with fewer layers.

    Here, we propose the global- and local-spectrum-aware transformer (GLSaT), a forward prediction network, to assist the inverse design of metasurfaces. The GLSaT integrates both intra-fragment spectral information and inter-fragment correlations to address the challenge of insufficient information from low to high dimensions. By modifying the attention mechanism, the double-transformer network achieves spectral response predictions with high accuracy while reducing the number of network parameters compared to previous transformer architectures. To inverse design meta-atoms, we input an ideal high-dimensional reflection spectrum into a DNN to determine the corresponding low-dimensional structural parameters. To overcome the convergence issues, the GLSaT is concatenated after the DNN. The reconstructed spectra of GLSaT are then compared with the input spectrum of DNN to calculate the loss, which guides the gradient descent optimization. Finally, leveraging the forward network, we demonstrated single- and dual-band highly reflective spectra inverse design tasks on our dataset, as well as the design of an achromatic metalens. We further explore explainability, generalization capabilities, and the accuracy-efficiency trade-offs of our model. These contributions underscore the potential of our model as a generalizable framework for spectral analysis, offering broad applicability across a range of spectral prediction tasks.

    2 Methods

    2.1 Motivation of Transformer Modules

    Figure 1(a) illustrates an optical metasurface constructed by the simplest meta-atoms with pitch P, diameter D, and height H for cylinders shown in Fig. 1(b). The construction of diamond cylinders and silica substrate can generate the two-band high-reflectivity spectra, as shown in the first subplot of Fig. 1(c). More complex meta-atoms can generate broadband high reflection, high transmission, and even more complicated spectra, such as the second and the third subplots shown in Fig. 1(c). Based on the application requirements, the optimal metasurface parameters are selected by analyzing these spectral features. As illustrated in Fig. 1(d), a simulation dataset is used to pre-train a forward model that maps structure parameters to spectra. The simulation dataset is utilized to pre-train a data-driven forward network that maps the structure parameters to the spectrum, thereby accelerating the inverse design process. Prediction models for EM responses from low-dimensional structures to high-dimensional spectra often consider structural parameters and discrete spectral points separately, with little consideration of the relationships between them. The size of the spectral data is denoted as T×1, where T represents the number of spectral points, and we separate the spectral data into h fragments to better focus on the inter- and intra-fragment spectral information [see Fig. 1(e)]. We regard this fragment-based representation as spectral-aware data, allowing us to extract richer features from structural inputs. The ability of transformer-based networks to analyze semantic associations between tokens can be effectively transferred to spectral data, enabling them to capture correlations within spectral information.43 This enhances the accuracy of EM spectral predictions for metasurfaces. Further details on the motivation for using the transformer are provided in Fig. S1 in the Supplementary Material.

    (a) Schematic of optical metasurface. (b) Structural parameters of meta-atoms, such as period P, diameter D, and height H. (c) Spectra corresponding to different meta-atom configurations. The first spectrum shows high reflectivity with two bands. The second and third spectra represent more complicated spectral responses. (d) Inverse design network for metasurface optical filters. The inverse network predicts the structural parameters of the metasurface, whereas the forward network generates the predicted spectra. (e) Transformation of 1D spectral data (size T×1) into 2D by dividing it into h fragments, enabling better extraction of spectral information for transformer modules in the forward network.

    Figure 1.(a) Schematic of optical metasurface. (b) Structural parameters of meta-atoms, such as period P, diameter D, and height H. (c) Spectra corresponding to different meta-atom configurations. The first spectrum shows high reflectivity with two bands. The second and third spectra represent more complicated spectral responses. (d) Inverse design network for metasurface optical filters. The inverse network predicts the structural parameters of the metasurface, whereas the forward network generates the predicted spectra. (e) Transformation of 1D spectral data (size T×1) into 2D by dividing it into h fragments, enabling better extraction of spectral information for transformer modules in the forward network.

    2.2 Architecture of GLSaT

    The forward network architecture named GLSaT is illustrated in Fig. 2, and it is used for forward prediction from structure parameters to spectra. It uses the method of inter-spectral fragment interaction attention and intra-spectral fragment self-attention strategy, maximizing the use of intrinsic spectral information in structure and enabling effective feature extraction. As shown in Fig. 2(a), the model consists of two main components: a fully connected layer (FCL) module for dimensionality transformation and a dual transformer module for spectrum-relevant representation extraction.

    (a) GLSaT network architecture for forward prediction. (b) The shared components in both GPT and LPT modules that the project matrices generate are query (Q), key (K), and value (V) vectors. (c) The details of the cross-attention mechanism in GPT for inter-fragment multi-head attention. (d) The details of the self-attention mechanism in LPT for intra-fragment multi-head attention. (e) The shared components norm and feed forward in both GPT and LPT modules.

    Figure 2.(a) GLSaT network architecture for forward prediction. (b) The shared components in both GPT and LPT modules that the project matrices generate are query (Q), key (K), and value (V) vectors. (c) The details of the cross-attention mechanism in GPT for inter-fragment multi-head attention. (d) The details of the self-attention mechanism in LPT for intra-fragment multi-head attention. (e) The shared components norm and feed forward in both GPT and LPT modules.

    The FCL module employs three residual blocks, which consist of two FCLs to achieve a three-stage dimensional expansion. This enlarges the initial D×1-dimensional structural parameters to T×1-dimensional spectra. More details can be found in Sec. 2.1 of the Supplementary Material.46 As the outputs of the FCL module enter the transformer, they first pass through a linear layer to standardize their dimensions to d×1, resulting in an input tensor of shape N×d×1 with the last dimension being 1. Before entering the following transformer network, we propose a data representation and processing framework tailored for spectral data. By aligning the number of heads (h) in the multi-head attention mechanism with the tokens in the sentence data, the input tensor is restructured as N×h×(d/h). Building upon the strategy of employing distinct heads to focus on different spectral fragments, a two-stage global and local spectrum-aware approach is proposed, which includes cross-spectral multi-head attention for inter-spectral interaction and intra-spectral multi-head attention for self-interaction within the spectral fragment. Therefore, we design a double-transformer architecture consisting of the global perception transformer (GPT) and the local perception transformer (LPT).

    The data are first processed by a projection matrix to obtain the queries Qi, keys Ki, and values Vi for the multi-head attention mechanism [seen in Fig. 2(b)]. As illustrated in Fig. 2(c), in the first stage, the transformer captures information among spectral fragments by processing tensors of shape N×hg×(d/hg), where hg is both the number of fragments and heads. This setup yields attention score matrices of size hg×hg, allowing for efficient extraction of global spectral features. The cross-attention mechanism is adopted to capture the attention of different spectral fragments, and the attention scores SG(Q,K) and attention HG(Q,K,V) are computed by Eqs. (1) and (2). SG(Q,K)=Softmax(QKThg),HG(Q,K,V)=SGV,{Q=(Q1,Q2,,Qi,,Qhg)TK=(K1,K2,,Ki,,Khg)TV=(V1,V2,,Vi,,Vhg)T,where SG(Q,K) are derived by calculating the dot products of Q and the transpose of K, scaled by the square root of hg, followed by the softmax operation, and thus, when matrix multiplication is performed, all Qi will interact with Ki. The attentions HG(Q,K,V) from all heads are concatenated together to obtain the result of the multi-head attention mechanism MG(Q,K,V) as Eq. (4). MG(Q,K,V)=Concat(H1G,,HiG,,HhgG)WG,where H1G is the attention result of the ith head and WG denotes the weight parameter matrix related to all the heads. This enables the model to capture global dependencies across spectral fragments, facilitating coarse tuning for global spectral perception.

    As illustrated in Fig. 2(d), in the second stage, the refined spectral representation is processed further to focus on fine-grained local details using tensors of shape N×hl×[1×(d/hl)], resulting in attention score matrices of size 1×1 with the number of hl. The data are processed by the LPT, which uses hl attention heads for intra-fragment multi-head attention. This mechanism captures the relationships within each spectral fragment separately. The attention scores SG(Q,K) are computed similarly to the GPT, but with queries, keys, and values all derived from the same spectral fragment, and the product results are applied on the sigmoid function [shown in Eqs. (5) and (6)]. SL=(S1L,S2L,,SiL,,ShlL)T,SiL(Qi,Ki)=Sigmoid(QiKiThl),  Sigmoid(x)=11+ex,HL(Q,K,V)=SLV,  HiL(Qi,Ki,Vi)=SiLVi.

    The attention HL(Q,K) is described by the Hadamard product of SL and V [seen in Eq. (7)]. In the calculation process, SiL only contains the information from Qi and Ki, and HiL only contains the information from Qi, Ki, and Vi, which comes from the same spectral fragment. In addition, the attentions from all heads are concatenated together to obtain the result of the multi-head self-attention mechanism ML(Q,K,V) as Eq. (8). ML(Q,K,V)=Concat(H1L,,HiL,,HhlL)WL,where HiL is the attention result of the ith head and WL denotes the weight parameter matrix related to all the heads. This mechanism captures the internal spectral relationships within each fragment, enabling fine-tuning for local spectral perception.

    The attention output is passed to the norm and feed forward (NFF) layer [see Fig. 2(e) and also Sec. 2.24749 of the Supplementary Material]. The loss function of GLSaT used is the mean squared error (MSE) [LGLSAT in Eq. (9)], which measures the discrepancy between the predicted spectra Sprediction and actual spectra Ssimulation. For model evaluation, we computed the mean absolute error (MAE) on the validation set [EGLSAT in Eq. (10)], providing a robust metric of model performance. Implementation of the GLSaT can be found in Sec. 2.3 of the Supplementary Material.50,51LGLSAT=1Ni=1,2,,N(SpredictionSsimulation)2,EGLSAT=1Ni=1,2,,N|SpredictionSsimulation|.

    2.3 Dataset Generation

    Our metasurface filter is specifically designed as a narrow-band reflector for alkali lasers with a linewidth of 10nm at 795 nm. To generate the dataset, we start from the simplest cylindrical geometry meta-atoms in square lattices of diamond metasurfaces.52 The design is parameterized by the cylinder diameter (D), height (H), and distance of adjacent cylinders (P). Considering practical fabrication constraints and the diameter D must be smaller than period P, we performed a systematic parameter sweep across these three independent variables, with uniform sampling within defined ranges [see Fig. S21(a) in the Supplementary Material]. Notably, the sampling density for H is lower than for D and P, as spectral variations are observed to be relatively insensitive to changes in H. For each unique parameter combination, the reflection spectrum is computed using FDTD simulations (Lumerical Solutions, Canada) over a wavelength range of 700 to 1100 nm, comprising 300 equidistant points. As shown in Fig. S22 in the Supplementary Material, our dataset covers peak wavelengths spanning the entire spectral range of 700 to 1100 nm. Ultimately, we acquired only 2120 sample sets. Figures S21(b)-S21(c) in the Supplementary Material display the spectra samples, with their corresponding D, H, and P.

    2.4 On-Demand Inverse Design

    Figure 3 illustrates the architecture of the inverse network that consists of a Gaussian-shape spectrum generator (GSSG) and a DNN module. The input spectrum is square-shaped, with the center wavelength and bandwidth corresponding to the parameters of a given laser in practical applications. However, a single-mode laser generally approximates a Gaussian shape, so we establish direct correspondence between ideal spectral responses and spectra input into the DNN network by a GSSG. For a single peak spectrum, the Gaussian-shaped reflectivity spectrum is set as Eq. (11), and for spectra with dual peaks, the shape is set as Eq. (12). R(λ)=exp[(λλ0)22σ2],R(λ)=exp[(λλ1)22σ12]+exp[(λλ2)22σ22],where σ, σ1, and σ2 are the standard deviations, which are related to the bandwidth of the laser. λ, λ1, and λ2 are the center wavelengths of the lasers.

    Architecture of the meta-filter design network.

    Figure 3.Architecture of the meta-filter design network.

    The DNN consists of three consecutive fully connected hidden layers containing T, 0.6T, and 0.3T neurons, respectively. The input layer incorporates the target spectrum consisting of T points, and the output layer produces a D×1 vector representing the geometric parameters of the metasurfaces. These predicted parameters are then set as the input to the cascaded GLSaT, where the EM response of the design is assessed. During training, the weights and biases in the DNN layers are iteratively updated, whereas GLSaT remain fixed. As training progresses, the generator progressively refines its output, ultimately forming a DNN capable of producing high-reflectivity metasurface designs with a single computation.

    The input spectra of DNN and predicted spectra from GLSaT are compared to calculate the loss function, which is minimized using gradient descent to optimize the DNN parameters and improve the accuracy of structural predictions. To use the simulation-generated dataset for supervised training of the inverse network, we perform a thresholding operation to binarize the original dataset, resulting in a dataset suitable for training the inverse network. We refine the dataset spectra based on a predefined demand threshold (Idemand), which we set at 0.9. This threshold filters the spectra to retain only binary spectra containing values of 0 or 1.

    The DNN is trained using a loss function defined in Eq. (13), which combines the spectral loss with boundary and aspect ratio penalties to constrain the structural parameters. Here, the structural parameters predicted by the DNN is PDNN=[p1,p2,,pD] and the aspect ratio as RDNN=[r1/h1,r2/h2,,rD/hD]. The predicted spectra by GLSaT are denoted as SGLSaT, and the spectra processed by the GSSG are represented as SGSSG. Considering the application of a narrowband metasurface optical filter for filtering demands of one or two single-wavelength lasers, we introduce a weight factor WS (WS=SGSSG3) to amplify peak features, enabling the network to prioritize capturing key spectral characteristics. More importantly, for other applications, such as broadband filters and metalens design, we can modify the input parameters, weights, or even the loss function to focus on the desired spectral features (see Sec. 2.4 of the Supplementary Material for a modified inverse design scheme tailored for metalens applications53,54). Equation (17) assesses peak wavelength shift (|λλ0|), peak deviation (|ττ0|), and full width at 0.9 maximum (FW0.9M) (|χχ0|) for validation, which rigorously evaluates the accuracy of reproducing the target spectrum. LDNN=α1Ni=1,2,,NWS(SGLSaTSGSSG)2+1Ni=1,2,,N(βLbound+γLratio),Lbound=softplus(PminPDNN)2+softplus(PmaxPDNN)2,Lratio=softplus(RminRDNN)2+softplus(RmaxRDNN)2,softplus(x)=log(1+ex),EGLSaT=1Ni=1,2,,Nk=1D[(λkλ0λmaxλmin)2+(τkτ0)2+(χkχ0λmaxλmin)2].

    Particularly, if considering the fabrication errors, we can extend the evaluation of the feature performance metric EGLSaT by adopting a sigma-point based unscented transform (UT). In this approach, each structural parameter vector p is perturbed by a small deviation ΔpN(0,Σ) that models Gaussian fabrication noise. Instead of exhaustive sampling, a set of 2d+1 sigma points {p(k)} is generated from the covariance Σ, each associated with a predefined weight w(k). The robust feature error is then computed as a weighted combination: EGLSaTrobk=02dω(k)EGLSaT(p(k)),which efficiently approximates the expectation of the phase, peak, and bandwidth deviations under fabrication uncertainties. This formulation allows the framework to rigorously assess not only nominal accuracy but also the robustness of the inverse-designed structures against realistic process variations, where EGLSaTrob<0.05 indicates excellent robustness, 0.05 to 0.1 is acceptable, and values above 0.1 suggest noticeable performance fluctuations requiring further optimization or higher fabrication precision.

    3 Results and Discussion

    3.1 GLSaT Performance

    3.1.1 Predicted results of our dataset

    To check the accuracy of the fully trained GLSaT, we provide the results of 212 testing data samples for a comparison between the simulated and predicted reflective spectra, as shown in Figs. 4(a)-4(h). The difference is quantified using MAE, labeled above each comparison figure, with the absolute error at each spectral point (error) represented as a blue shaded region on the plot, which shows the accuracy of the forward-trained network. In Fig. 4(i), their difference is further measured by absolute error for every spectral point with an averaged value of only 0.0063. The agreement between the simulated and predicted results demonstrates the accuracy of our model, which shows the ability of our GLSaT model in capturing the spectral sequence information of metasurface elements.

    (a)–(h) Comparison of simulated and predicted reflection spectra for some testing samples. (i) Summary of the absolute error across all spectral points (212×300=63,600), with an average error of 0.63%.

    Figure 4.(a)–(h) Comparison of simulated and predicted reflection spectra for some testing samples. (i) Summary of the absolute error across all spectral points (212×300=63,600), with an average error of 0.63%.

    3.1.2 Ablation study and explainability

    To evaluate the contributions of each component in the GLSaT architecture and verify the necessity of its dual-transformer layers, namely GPT and LPT, we conducted ablation studies on the metasurface reflector dataset.

    Figure 5(a) presents the evolution of the validation MAE over 5000 epochs for each configuration (the complete GLSaT architecture, GLSaT without GPT, LPT, dual transformer modules, and the baseline FCN). The zoomed-in subplot highlights the performance of GLSaT, which consistently achieves the lowest MAE. Removing either transformer layer increases the MAE, and removing both leads to further degradation. Even without both transformer modules, the network still benefits from other components, such as the residual block and 1D convolution, which maintain its performance advantage over the FCN. The results underscore the essential role of the transformer modules in capturing the complex relationships of spectral fragments and achieving better performance. As a complementary analysis, Fig. 5(b) shows the histogram distribution of the absolute error (error) between the simulated and predicted spectra across the spectral points of the untrained testing dataset. GLSaT exhibits the smallest error distribution across spectral points, indicating its high predictive accuracy. Configurations with one transformer layer removed exhibit slightly worse error distributions, whereas the absence of both transformer modules produces a significant rightward shift, indicative of reduced accuracy. The baseline FCN configuration performs the worst, with the highest error values distributed broadly across spectral points, which shows that GLSaT achieves 32.9% higher accuracy. These results highlight the pivotal role of transformer modules, especially the synergy between GPT and LPT, in jointly capturing global and local spectral features for robust and precise predictions.

    Evaluation of the GLSaT architecture through ablation experiments. (a) Validation MAE as a function of training epochs 1 to 5000 for the GLSaT architecture and its ablated variants. Configurations include the complete GLSaT model (red), GLSaT without the global perception transformer (GPT, orange), GLSaT without the local perception transformer (LPT, yellow), GLSaT without both transformer layers (green), and a baseline FCN (blue). The inset zooms in on the final 600 epochs, showing the superior performance and faster convergence of GLSaT compared with the ablated configurations and the baseline. (b) Distribution of absolute error (error) across the spectral points of the testing set for the same configurations. The x-axis represents the error values on a logarithmic scale, whereas the y-axis denotes the number of spectral points corresponding to each error value. (c) Case study of the inter-token multi-head attention score distribution over heads of GPT for epochs of 10, 20, 180, 350, 540, 1090, 1830, and 2000. The regions highlighted by the red bounding boxes represent attention patterns observed during spectral information enrichment. The heatmap values represent the logarithmic scale of the attention score. (d) Case study of the intra-token multi-head attention score distribution over heads of LPT for epochs from 1 to 2000. (e) Case study of the outputs of different layers at 3000 epoch: dimension-enhancing residual module (FCL), start linear of GPT (linear), attention mechanism of GPT (GPT attention), norm and feed forward of GPT (NFF), attention mechanism of LPT (LPT attention), the NFF layer of LPT (NFF), and 1D convolutional layers (1Dconv). X is the input of the first transformer GPT, Zg is the output of LPT, Zl is the output of LPT, and Y is the final output of GLSaT. The blue box represents GPT, whereas the pink box represents LPT. Panels (c)–(e) come from the same sample.

    Figure 5.Evaluation of the GLSaT architecture through ablation experiments. (a) Validation MAE as a function of training epochs 1 to 5000 for the GLSaT architecture and its ablated variants. Configurations include the complete GLSaT model (red), GLSaT without the global perception transformer (GPT, orange), GLSaT without the local perception transformer (LPT, yellow), GLSaT without both transformer layers (green), and a baseline FCN (blue). The inset zooms in on the final 600 epochs, showing the superior performance and faster convergence of GLSaT compared with the ablated configurations and the baseline. (b) Distribution of absolute error (error) across the spectral points of the testing set for the same configurations. The x-axis represents the error values on a logarithmic scale, whereas the y-axis denotes the number of spectral points corresponding to each error value. (c) Case study of the inter-token multi-head attention score distribution over heads of GPT for epochs of 10, 20, 180, 350, 540, 1090, 1830, and 2000. The regions highlighted by the red bounding boxes represent attention patterns observed during spectral information enrichment. The heatmap values represent the logarithmic scale of the attention score. (d) Case study of the intra-token multi-head attention score distribution over heads of LPT for epochs from 1 to 2000. (e) Case study of the outputs of different layers at 3000 epoch: dimension-enhancing residual module (FCL), start linear of GPT (linear), attention mechanism of GPT (GPT attention), norm and feed forward of GPT (NFF), attention mechanism of LPT (LPT attention), the NFF layer of LPT (NFF), and 1D convolutional layers (1Dconv). X is the input of the first transformer GPT, Zg is the output of LPT, Zl is the output of LPT, and Y is the final output of GLSaT. The blue box represents GPT, whereas the pink box represents LPT. Panels (c)–(e) come from the same sample.

    To explore deeper into the explainability of GLSaT, we analyzed one randomly selected training sample. Figure 5(c) presents the inter-fragment multi-head attention distribution of GPT across training epochs. In the first 20 epochs, attention is dispersed since the model has not yet learned structure-spectrum relations. As the training progresses, attention gradually concentrates, expands its coverage, and finally achieves global attention. At epochs 1830 and 2000, attention stabilizes as validation MAE reaches 0.008, indicating the model has learned the hidden spectral information in the structure. A similar trend appears in Fig. 5(d), showing intra-fragment attention distributions of LPT from epochs 1 to 2000. The local attention mechanism gradually refines its focus, ultimately converging on critical features within the spectral sequence, such as peaks and troughs. These key points act as reference anchors, enabling precise reconstruction of the entire spectral sequence.

    To further illustrate the role of the double-attention mechanism in spectral features extraction, we present the outputs of different modules in the GLSaT model: the dimension-enhancing residual module (FCL), the start linear layer of GPT (linear), the attention mechanism of GPT (GPT attention), the NFF layer of GPT (NFF), the attention mechanism of LPT (LPT attention), the NFF layer of LPT (NFF), and the 1D convolutional layers (1Dconv) shown in Fig. 5(e). The blue box represents GPT, whereas the pink box represents LPT. In the prediction process from structural parameters to spectra, the FCL module first increases the dimensionality of the input data (seen in ①), generating complex high-frequency information. The start linear layers in GPT further enhance this high-frequency information (②). As the model has extracted a large amount of useful information, the proportion of redundant information has also risen. To enhance the acquisition of relevant information, the global attention mechanism in GPT leverages the global interactions among spectral fragments (represented by the small red box and dashed lines in ②), thereby increasing the proportion of spectral-related information (③). After processing through NFF (④), the inputs to LPT contain richer high-frequency information compared to the output of the FCL. Next, the LPT performs local attention perception on the spectrum, carrying out local selection and fine-tuning within spectral fragments (⑤). After further processing by NFF, the relevant features are extracted and the redundancy is removed, resulting in the final outputs (⑥) containing key information such as spectral peak magnitude, position, rising/falling trends, and implied resonance features (represented by the small red box and dashed lines in ⑤ and ⑥). Through 1D convolutional smoothing, the model ultimately generates a predicted spectrum with high accuracy (⑦).

    We verified the effectiveness of the GLSaT model through ablation experiments and concluded that GLSaT adheres to the information bottleneck theory,5557 as described in Eqs. (19) and (20). I(Z,X)min{I(Zg,X),I(Zl,X)},I(Z,Y)max{I(Y,Zg),I(Y,Zl)},where I(a,b) represents the mutual information between variables a and b. The relationship among the input variable (X), latent variable (Z), and output variable (Y) plays a key role in spectral information extraction and prediction. We define that the latent variable of GPT alone is denoted as Zg, and for LPT alone is denoted as Zl. Based on the information bottleneck theory, the results indicate that the redundancy between the compressed representation Z and the input variable X is smaller than the redundancy between Zg and X, as well as between Zl and X. In addition, the correlation between Z and the output variable Y is greater than the correlation between Zg and Y, as well as between Zl and Y. In other words, the GPT utilizes inter-fragment multi-head attention to capture associations among spectral fragments, thereby overcoming the information bottleneck caused during the low-to-high dimensionality transition. Meanwhile, the LPT applies intra-fragment self-attention to refine spectral details and extract target-relevant representations, aiming to reduce redundancy and accurately identify spectral features. This double-layer transformer configuration enables a comprehensive integration of inter- and intra-spectral fragment features, effectively addressing the limitations of traditional attention mechanisms.

    3.1.3 Generalization and contrast

    To evaluate the generalization capability of the GLSaT model, we conducted extensive experiments across six diverse datasets, including diamond reflectors, SiC reflectors, broadband sound absorbers,16 four-resonator transmission systems,42 nanophotonic scattering particles,40 cylindrical metasurfaces, and H-shaped metasurfaces.24 Each dataset represents varying complexity for evaluating the robustness of GLSaT, with sample sizes ranging from 2120 to 200,000, structural parameter counts between 3 and 18, and target points spanning 201 to 2002. All training procedures are conducted for fewer than 5000 epochs. The performance metrics for each dataset, summarized in Table 1, include training set MSE, validation set MAE, test set evaluations, and comparison with the source papers. Comparison of simulated and predicted reflection spectra for some testing samples of each dataset is shown in Fig. S2 (SiC reflector), Fig. S3 (sound absorber), Fig. S4 (four resonators), Fig. S5 (nanophotonic particle), Fig. S6 (cylindrical metasurface), and Figs. S7–S9 (H-shaped metasurface) in the Supplementary Material.

    DatasetDiamond reflectorSiC reflectorSound absorber16Four resonators42Nanophotonic particle40Cylindrical metasurface24H-shaped metasurface24
    SourceOursOurs2024 MSSP2019 Opt. Express2018 Sci. Adv2019 ACS Photonics2019 ACS Photonics
    Spectral typeReflectivityReflectivityAbsorptionTransmissionScattering-sectionImag (S21)Real (S21)Imag (S21)Transmission (S21)Phase (S21)Transmission
    Structural dimension33188546
    Spectral dimension3003006002002201301301
    Sample number21202520200,00020,998500065,61124,800
    Source modelGLSaTGLSaTFCNFCN fed by combined geometric inputsFCNNTN & down-sampledNTN & down-sampled
    Source metricsMAE: 1.89×102MSE: 1.6×103MRE: 0.46%MSE: 2.3×104Accuracy: 99.4%Accuracy: 99.3%
    99%MSE: 6.2×103
    min MAE: 4.1×10395%MSE: 3.4×103  s
    GLSaT Train MSE1.04×1058.02×1061.86×1055.14×1073.26×1041.59×1042.78×1042.77×1042.56×104
    GLSaT Val MAE5.3×1034.2×1033.8×1035.5×1037.2×1035.7×1031.25×1021.26×1021.33×102
    GLSaT test metricsMSE: 3.28×104MSE: 2.44×104MAE: 3.8×103MSE: 8.72×105MRE: 0.32%MSE: 2.6×104MSE: 1.1×103MSE: 1.2×103Accuracy: 99.9%Accuracy: 99.4%MSE: 1.36×103
    99%MSE: 6.16×104
    MAE: 6.3×103MAE: 4.9×103min MAE: 1.2×103MAE: 1.26×102MAE: 1.26×102MAE: 1.13×102
    95%MSE: 2.47×104
    Comparison+79.9%+94.6%+30.4%−13%+0.5%+0.1%
    +70.7%+90.1%
    +92.7%

    Table 1. Generalization and contrast for GLSaT.

    For the absorption spectrum of the broadband sound absorber acoustic metasurface, despite the complexity of having 18 structural parameters and 600 prediction points, GLSaT achieves an average MAE of 0.0038 on both the validation and test sets, which is lower than the minimum MAE of 0.0041 reported in the source paper. This highlights the high accuracy of our model and demonstrates its robust fitting capability when provided with abundant samples. In the case of a complex metasurface with four resonators and eight structural parameters, 99% of the data had an error of less than 6.16×104, and 95% exhibited MSE less than 2.47×104, outperforming the metrics from the source paper.

    To further benchmark the performance of GLSaT against alternative network architectures, we conducted systematic comparisons with several representative models, including an FCN, a BNN, a temporal convolutional network (CNN), an NTN, a long short-term memory-based recurrent neural network (LSTM), and the OptoGPT model.58 Particularly, OptoGPT is a transformer variant under the Opto framework, which is designed for multilayer thin-film spectral prediction. All models were trained on the cylindrical metasurface dataset containing 65,611 samples, originally sampled at 301 spectral points. To evaluate robustness across different spectral resolutions, we additionally generated datasets with 151 and 61 sampling points through uniform down-sampling. Training was carried out under identical conditions, employing the OneCycleLR scheduler (maximum learning rate of 2×103, initial learning rate of 4×104, warm-up ratio of 20%, initial divisor of 5.0, and final divisor of 5×103) and the AdamW optimizer with weight decay of 5×105. The same training, validation, and test splits, batch sizes, epochs (2000), and mean squared error (MSE) loss functions were used to ensure fairness. Detailed network architecture, training dynamics, and performance histories are provided in Figs. S10–S16 in the Supplementary Material, corresponding to each evaluated model.

    In Table 2, we report the Overfitting Index, which is defined as Overfitting Index=Lval,endLval,minLval,minwhere Lval,end denotes the validation loss at the final epoch and Lval,min represents the minimum validation loss observed during training. This metric quantifies the degree of degradation relative to the optimal validation point, with larger values indicating more severe overfitting.

    GLSaTFCNBNNCNNNTNLSTMOptoGPT
    Number of parameters15,527,6782,748,07616,371,7746,043,1186,866,3904,690,17433,910,573
    Average inference time11.960 ms7.972 ms6.312 ms9.302 ms18.604 ms11.628 ms12.957 ms
    Per-epoch training time8.6395 s2.3784 s4.7049 s5.2655 s10.3487 s6.6998 s10.3565 s
    Test set MAE301 points0.00770.02280.02850.01710.04280.02410.0465
    151 points0.00940.02400.03000.01420.04290.02840.0437
    61 points0.00800.02350.02670.01640.04250.02740.0333
    Average of test set MAE0.00840.02340.02840.01590.04270.02660.0412
    Performance consistency0.11050.02840.11060.13560.00690.12520.1899
    Overfitting index301 points0.01740.76120.05320.00310.00080.48680.1425
    151 points0.02350.25930.09900.00840.00330.89090.5117
    61 points0.00960.18420.03350.00150.00390.45211.2893
    Average of overfitting index0.01680.40160.06190.00430.00270.60990.6478
    Overfitting stability0.33880.63790.44280.68290.51510.32650.7377

    Table 2. Comparison of different neural network models.

    The comparative results are summarized in Table 2. GLSaT achieves the lowest average test set MAE across different spectral resolutions (0.0084), significantly outperforming conventional architectures such as FCN (0.0234) and BNN (0.0284), while also providing higher performance consistency, reflected by a low coefficient of variation (0.1105). In terms of generalization robustness, GLSaT maintains a substantially lower average overfitting index (0.0168) and improved stability (0.3388), whereas baseline models such as FCN and LSTM exhibit severe overfitting with large index values (0.4016 and 0.6099, respectively). Notably, although CNN achieves competitive results with low MAE and overfitting index, a direct comparison of the training histories in Figs. S10 and S13 in the Supplementary Material reveals that GLSaT not only attains higher prediction accuracy but also converges more rapidly, underscoring the effectiveness of its dual-spectrum attention mechanism. Meanwhile, OptoGPT, despite being a large-scale Transformer model with over 33 million parameters, exhibits inferior accuracy and stability on this task. This performance gap may arise from the fact that OptoGPT was originally tailored for multilayer thin-film spectral prediction, rendering its architecture less suited for the spectral characteristics of metasurface unit cells. Taken together, these results demonstrate that GLSaT offers a superior balance between accuracy, convergence efficiency, and generalization stability, establishing its advantage as a forward prediction backbone for metasurface inverse design.

    In addition, we conducted experiments with different training data sizes to evaluate the data dependence of GLSaT (see Sec. 5 of the Supplementary Material for details), which further confirms the robustness of GLSaT across varying data volumes.

    To precisely predict the transmission, researchers tend to change the physical variables to be predicted, including the combination of amplitude and phase or the real and imaginary parts of the complex reflective coefficient t.24,30 For the NTN model, we evaluated the transmission coefficient (S21) for a cylindrical metasurface, with the model achieving an MSE of 2.6×104 on the test set, which is comparable to the 2.3×104 reported in the source paper. However, the source paper employed down-sampling, reducing the spectral points from 301 to 31, which may have led to a loss of finer spectral details. By contrast, prediction by GLSaT maintained high dimension. For the H-shaped metasurface configuration, we predicted both the real and imaginary parts of the S21 parameter, as well as the transmission spectrum. By utilizing the relationship between the S21 parameters and transmission and phase, GLSaT is able to predict the transmission spectrum and phase using the trained real part of S21[real(S21)] and imaginary part of S21[imag(S21)]. Here, transmission coefficient t is obtained from the scattering parameters, among which S21 represents t, and its intensity is transmission T [shown in Eq. (22)]. In addition, the phase Ph of t is the arctangent of the ratio of imag(S21) to real(S21) [shown in Eq. (23)]. t=real(S21)+jimag(S21),  T=|t|2Ph=tan1[imag(S21)real(S21)].

    Figure S23 in the Supplementary Material illustrates the predicted transmission and phase responses of the H-shaped metasurface through the real and imaginary parts of S21 prediction. This approach has broad applicability in fields such as metalens and anomalous reflection metasurfaces, where both amplitude and phase need to be accurately matched.44,59,60 These results demonstrate that the model can accurately predict the phase behavior of the metasurface, which is critical for applications involving phase manipulation.

    3.2 Application to Inverse Design

    The predicted structural parameters and corresponding predicted spectra SGLSaT are presented in Figs. 6(a)6(d), alongside a comparison with the spectra processed by the GSSG SGSSG. This comparison is used to train the DNN through a loss function, which helps drive the predicted spectra to approximate the GSSG-generated spectra. In addition, to further validate the application of GLSaT, we use the DNN-predicted structural parameters to simulate the corresponding spectra Ssimulation through FDTD and compare them with the initial spectra Sinitial from the dataset. The results are summarized in Table 3. For each case, the differences between the target spectra and the FDTD-simulated spectra of the predicted structures are quantified in terms of peak wavelength shift, peak deviation, and FW0.9M. In addition, the feature performance error EFPN and its robust counterpart were evaluated under conservative fabrication tolerances (σH=10  nm, σD=10  nm, σP=5  nm, ρHD=0.5, ρDP=0.3, ρHP=0.2). All cases exhibit small deviations and maintain EGLSaTrob values below 0.07, confirming that the designed structures not only reproduce the target spectral features with high accuracy but also remain robust against realistic fabrication variations. The results demonstrate that our inverse design framework effectively achieves the desired structural predictions, whereas the predicted spectra closely align with the spectral requirements of metasurface filters. These findings underscore the capability of our method to accurately and efficiently help design metasurface optical filters.

    (a)–(d) Comparison of the spectra at different stages of the inverse design process. The spectra generated by the Gaussian-shaped spectrum generator (SGSSG, solid blue line) are compared with the predicted spectra obtained from GLSaT SGLSAT, dashed blue line), the spectra from the initial guess of the structure parameters Sinitial, dashed light pink line), and the spectra derived from FDTD simulations of the predicted structure parameters (S′simulation, dashed dark pink line). The structural parameters corresponding to each case are provided at the top of each panel, denoted as H, D, and P, in units of nanometers. Each subplot illustrates the consistency between the predicted and simulated spectra in the targeted wavelength range of 700 to 1100 nm. The structural data are displayed in the last row beneath the corresponding figures.

    Figure 6.(a)–(d) Comparison of the spectra at different stages of the inverse design process. The spectra generated by the Gaussian-shaped spectrum generator (SGSSG, solid blue line) are compared with the predicted spectra obtained from GLSaT SGLSAT, dashed blue line), the spectra from the initial guess of the structure parameters Sinitial, dashed light pink line), and the spectra derived from FDTD simulations of the predicted structure parameters (Ssimulation, dashed dark pink line). The structural parameters corresponding to each case are provided at the top of each panel, denoted as H, D, and P, in units of nanometers. Each subplot illustrates the consistency between the predicted and simulated spectra in the targeted wavelength range of 700 to 1100 nm. The structural data are displayed in the last row beneath the corresponding figures.

    (a)(b)(c)(d)
    Peak wavelength shift01.33782.675600
    Peak deviation0.001700.00410.00020.0006
    FW0.9M4.276610.66323.998.498810.9156
    EGLSaT0.04250.02070.02150.0279
    EGLSaTrob0.06680.02140.02850.0314

    Table 3. Comparison between the target spectra SGSSG and Ssimulation.

    Moreover, to demonstrate the adaptability of the proposed framework beyond filtering tasks, we also extend the inverse design strategy to metalenses and further applied it to the design of broadband metalenses (see Sec. 4 of the Supplementary Material for details).

    4 Conclusion

    In this work, we propose the GLSaT as an effective and scalable framework for spectral prediction in metasurface design. By incorporating global attention for capturing inter-fragment dependencies and local attention for fine-grained spectral features, GLSaT can map the relationship between structural parameters and spectral responses with high accuracy. Comprehensive explainability analyses through attention mechanisms further demonstrate its ability to effectively capture both global and local spectral dependencies, thereby validating its physical interpretability.

    Our extensive experimental validations demonstrate the generalization capability of GLSaT across diverse metasurface functionalities, including but not limited to reflection modulation, transmission control, phase manipulation, and scattering pattern prediction. The GLSaT architecture not only improves prediction accuracy compared to existing baselines but also enhances the efficiency of data-driven metasurface inverse design, like metasurface optical filters. Despite its efficiency and accuracy, we acknowledge three key constraints: empirical errors from uneven dataset distributions, overfitting risks from model complexity, and generalization limits imposed by dataset size. Guided by generalization error bound theory, these factors define the performance boundary, which can be further improved with larger and more diverse datasets or refined model architectures. More details on the boundary analysis of GLSaT are provided in Sec. 5 of the Supplementary Material.

    GLSaT provides an efficient, precise, and interpretable approach for metasurface design, offering a scalable solution for spectral prediction and inverse design. Future work will focus on extending its application to multifunctional metasurfaces and structural shape optimization, further advancing high-precision and computationally efficient metasurface design.

    Acknowledgments

    Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant No. 12204541), the Science and Technology Innovation Program of Hunan Province (Grant No. 2021RC3083), and the High-level Talents Programs of the National University of Defense Technology. We also thank Zhenqian Xiao from Harbin Institute of Technology for the useful discussions.

    Biographies of the authors are not available.

    References

    [1] N. I. Zheludev, Y. S. Kivshar. From metamaterials to metadevices. Nat. Mater., 11, 917-924(2012).

    [2] W. T. Chen, A. Y. Zhu, F. Capasso. Flat optics with dispersion-engineered metasurfaces. Nat. Rev. Mater., 5, 604-620(2020).

    [3] Y. Liu, X. Zhang. Metamaterials: a new frontier of science and technology. Chem. Soc. Rev., 40, 2494(2011).

    [4] Q. Yuan et al. Recent advanced applications of metasurfaces in multi-dimensions. Nanophotonics, 12, 2295-2315(2023).

    [5] H.-T. Chen, A. J. Taylor, N. Yu. A review of metasurfaces: physics and applications. Rep. Prog. Phys., 79, 76401(2016).

    [6] I. Brener, A. Faraon et al. Dielectric Metamaterials, 175-194(2020).

    [7] L. Zhang et al. Real-time machine learning–enhanced hyperspectro-polarimetric imaging via an encoding metasurface. Sci. Adv., 10, eadp5192(2024).

    [8] H. Ren et al. Complex-amplitude metasurface-based orbital angular momentum holography in momentum space. Nat. Nanotechnol., 15, 948-955(2020).

    [9] R. C. Devlin et al. Arbitrary spin-to–orbital angular momentum conversion of light. Science, 358, 896-901(2017).

    [10] H. A. Atikian et al. Diamond mirrors for high-power continuous-wave lasers. Nat Commun, 13, 2610(2022).

    [11] G. Kim et al. Metasurface-driven full-space structured light for three-dimensional imaging. Nat. Commun., 13, 5920(2022).

    [12] T. Li et al. Revolutionary meta-imaging: from superlens to metalens. Photonics Insights, 2, R01(2023).

    [13] I. Brener, S. Kruk, Y. Kivshar et al. Dielectric Metamaterials, 145-174(2020).

    [14] X. Jiang et al. Metasurface based on inverse design for maximizing solar spectral absorption. Adv. Opt. Mater., 9, 2100575(2021).

    [15] J. Peurifoy et al. Nanophotonic particle simulation and inverse design using artificial neural networks. Sci. Adv., 4, eaar4206(2018).

    [16] Z. Xiao et al. Accelerated design of low-frequency broadband sound absorber with deep learning approach. Mech. Syst. Signal Process., 211, 111228(2024).

    [17] E. Choi et al. 360° structured light with learned metasurfaces. Nat. Photonics, 18, 848-855(2024).

    [18] B. Slovick et al. Perfect dielectric-metamaterial reflector. Phys. Rev. B, 88, 165116(2013).

    [19] M. Decker et al. High‐efficiency dielectric Huygens’ surfaces. Adv. Opt. Mater., 3, 813-820(2015).

    [20] J. Ji et al. On-chip multifunctional metasurfaces with full-parametric multiplexed jones matrix. Nat. Commun., 15, 8271(2024).

    [21] N. Yu et al. Light propagation with phase discontinuities: generalized laws of reflection and refraction. Science, 334, 333-337(2011).

    [22] M. Khorasaninejad et al. Metalenses at visible wavelengths: diffraction-limited focusing and subwavelength resolution imaging. Science, 352, 1190-1194(2016).

    [23] H. Huang et al. Leaky-wave metasurfaces for integrated photonics. Nat. Nanotechnol., 18, 580-588(2023).

    [24] S. An et al. A deep learning approach for objective-driven all-dielectric metasurface design. ACS Photonics, 6, 3196-3207(2019).

    [25] Y. Gao et al. Meta‐attention deep learning for smart development of metasurface sensors. Adv. Sci., 11, 2405750(2024).

    [26] S. Lee, C. Park, J. Rho. Mapping information and light: trends of AI-enabled metaphotonics. Curr. Opin. Solid State Mater. Sci., 29, 101144(2024).

    [27] J. Kim et al. Dynamic hyperspectral holography enabled by inverse-designed metasurfaces with oblique helicoidal cholesterics. Adv. Mater., 36, e2311785(2024).

    [28] T. Badloe, S. Lee, J. Rho. Computation at the speed of light: metamaterials for all-optical calculations and neural networks. Adv. Photonics, 4, 64002(2022).

    [29] J. Lv et al. Polarization-controlled metasurface design based on deep ResNet. Opt. Laser Technol., 179, 111396(2024).

    [30] W. Ji et al. Recent advances in metasurface design and quantum optics applications with machine learning, physics-informed neural networks, and topology optimization methods. Light: Sci. Appl., 12, 169(2023).

    [31] B. Lusch, J. N. Kutz, S. L. Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun., 9, 4950(2018).

    [32] W. Ma et al. Deep learning for the design of photonic structures. Nat. Photonics, 15, 77-90(2021).

    [33] Z. Liu et al. Generative model for the inverse design of metasurfaces. Nano Lett., 18, 6570-6576(2018).

    [34] T. B. Kanmaz et al. Deep-learning-enabled electromagnetic near-field prediction and inverse design of metasurfaces. Optica, 10, 1373(2023).

    [35] S. An et al. Ultrawideband Schiffman phase shifter designed with deep neural networks. IEEE Trans. Microwave Theory Tech., 70, 4694-4705(2022).

    [36] S. An et al. Deep neural network enabled active metasurface embedded design. Nanophotonics, 11, 4149-4158(2022).

    [37] S. An et al. Deep learning modeling approach for metasurfaces with high degrees of freedom. Opt. Express, 28, 31932(2020).

    [38] W. Ma, F. Cheng, Y. Liu. Deep-learning-enabled on-demand design of chiral metamaterials. ACS Nano, 12, 6326-6334(2018).

    [39] S. So et al. Multicolor and 3D holography generated by inverse-designed single-cell metasurfaces. Adv. Mater., 35, e2208520(2023).

    [40] I. Malkiel et al. Plasmonic nanostructure design and characterization via deep learning. Light: Sci. Appl., 7, 60(2018).

    [41] L. Gao et al. A bidirectional deep neural network for accurate silicon color design. Adv. Mater., 31, 1905467(2019).

    [42] C. C. Nadell et al. Deep learning for accelerated all-dielectric metasurface design. Opt. Express, 27, 27523(2019).

    [43] I. Brener, A. Vaswani et al. Attention is all you need. Adv. Neural Inf. Process. Syst., 5998-6008(2023).

    [44] W. Chen et al. Broadband solar metamaterial absorbers empowered by transformer‐based deep learning. Adv. Sci., 10, 2206718(2023).

    [45] Q. Yu et al. Fragment-fusion transformer: deep learning-based discretization method for continuous single-cell Raman spectra analysis. ACS Sens., 9, 3907-3920.

    [46] A. L. Maas, A. Y. Hannun, A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models(2013).

    [47] A. Dosovitskiy et al. An image is worth 16 × 16 words: transformers for image recognition at scale(2021).

    [48] K. He et al. Deep residual learning for image recognition, 770-778(2016).

    [49] S. Bai, J. Z. Kolter, V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling(2018).

    [50] I. Loshchilov, F. Hutter. Decoupled weight decay regularization(2019).

    [51] I. Loshchilov, F. Hutter. SGDR: stochastic gradient descent with warm restarts(2017).

    [52] Z. Xing et al. The design of highly reflective all-dielectric metasurfaces based on diamond resonators. Photonics, 11, 1015(2024).

    [53] S. Wang et al. Broadband achromatic optical metasurface devices. Nat. Commun., 8, 187(2017).

    [54] S. Shrestha et al. Broadband achromatic dielectric metalenses. Light: Sci. Appl., 7, 85(2018).

    [55] F. Cao et al. Justices for information bottleneck theory(2023).

    [56] K. Kawaguchi et al. How does information bottleneck help deep learning?, 16049-16096(2023).

    [57] Z. Yang et al. Exploring information processing in large language models: insights from information bottleneck theory(2025).

    [58] T. Ma et al. OptoGPT: a foundation model for inverse design in optical multilayer thin film structures. Opto-electron. Adv., 7, 240062(2024).

    [59] Y. Hu et al. Asymptotic dispersion engineering for ultra-broadband meta-optics. Nat. Commun., 14, 6649(2023).

    [60] T. He et al. Perfect anomalous reflectors at optical frequencies. Sci. Adv., 8, eabk3381(2022).

    Jiahui Liao, Xucong Bian, Xiang’ai Cheng, Quanjiang Li, Yuting Jiang, Shaozhen Lou, Haoqian Wang, Zixiao Hua, Teng Li, Jiangbin Zhang, Zhongjie Xu, Yueqiang Hu, Zhongyang Xing, "GLSaT: a spectral-aware transformer-based network enabling highly efficient and precise inverse design in metasurface optical filters," Adv. Photon. Nexus 4, 056014 (2025)
    Download Citation