- Advanced Photonics Nexus
- Vol. 4, Issue 5, 056014 (2025)
Abstract
1 Introduction
Metasurfaces offer precise control of light by intricately designing subwavelength elements, allowing for versatile operation of optical properties on a flat platform.1
Artificial intelligence (AI), which serves as a powerful computational tool, has been applied to accelerate the design process.14,15,24
In recent years, several other deep learning (DL) models have been adopted for metasurface design. The generative adversarial network (GAN) has been used to seek the structures from the intended spectra,33 whereas networks such as the fully connected network (FCN),15,40
Here, we propose the global- and local-spectrum-aware transformer (GLSaT), a forward prediction network, to assist the inverse design of metasurfaces. The GLSaT integrates both intra-fragment spectral information and inter-fragment correlations to address the challenge of insufficient information from low to high dimensions. By modifying the attention mechanism, the double-transformer network achieves spectral response predictions with high accuracy while reducing the number of network parameters compared to previous transformer architectures. To inverse design meta-atoms, we input an ideal high-dimensional reflection spectrum into a DNN to determine the corresponding low-dimensional structural parameters. To overcome the convergence issues, the GLSaT is concatenated after the DNN. The reconstructed spectra of GLSaT are then compared with the input spectrum of DNN to calculate the loss, which guides the gradient descent optimization. Finally, leveraging the forward network, we demonstrated single- and dual-band highly reflective spectra inverse design tasks on our dataset, as well as the design of an achromatic metalens. We further explore explainability, generalization capabilities, and the accuracy-efficiency trade-offs of our model. These contributions underscore the potential of our model as a generalizable framework for spectral analysis, offering broad applicability across a range of spectral prediction tasks.
2 Methods
2.1 Motivation of Transformer Modules
Figure 1(a) illustrates an optical metasurface constructed by the simplest meta-atoms with pitch , diameter , and height for cylinders shown in Fig. 1(b). The construction of diamond cylinders and silica substrate can generate the two-band high-reflectivity spectra, as shown in the first subplot of Fig. 1(c). More complex meta-atoms can generate broadband high reflection, high transmission, and even more complicated spectra, such as the second and the third subplots shown in Fig. 1(c). Based on the application requirements, the optimal metasurface parameters are selected by analyzing these spectral features. As illustrated in Fig. 1(d), a simulation dataset is used to pre-train a forward model that maps structure parameters to spectra. The simulation dataset is utilized to pre-train a data-driven forward network that maps the structure parameters to the spectrum, thereby accelerating the inverse design process. Prediction models for EM responses from low-dimensional structures to high-dimensional spectra often consider structural parameters and discrete spectral points separately, with little consideration of the relationships between them. The size of the spectral data is denoted as , where represents the number of spectral points, and we separate the spectral data into fragments to better focus on the inter- and intra-fragment spectral information [see Fig. 1(e)]. We regard this fragment-based representation as spectral-aware data, allowing us to extract richer features from structural inputs. The ability of transformer-based networks to analyze semantic associations between tokens can be effectively transferred to spectral data, enabling them to capture correlations within spectral information.43 This enhances the accuracy of EM spectral predictions for metasurfaces. Further details on the motivation for using the transformer are provided in Fig. S1 in the Supplementary Material.
![]()
Figure 1.(a) Schematic of optical metasurface. (b) Structural parameters of meta-atoms, such as period
2.2 Architecture of GLSaT
The forward network architecture named GLSaT is illustrated in Fig. 2, and it is used for forward prediction from structure parameters to spectra. It uses the method of inter-spectral fragment interaction attention and intra-spectral fragment self-attention strategy, maximizing the use of intrinsic spectral information in structure and enabling effective feature extraction. As shown in Fig. 2(a), the model consists of two main components: a fully connected layer (FCL) module for dimensionality transformation and a dual transformer module for spectrum-relevant representation extraction.
![]()
Figure 2.(a) GLSaT network architecture for forward prediction. (b) The shared components in both GPT and LPT modules that the project matrices generate are query (Q), key (K), and value (V) vectors. (c) The details of the cross-attention mechanism in GPT for inter-fragment multi-head attention. (d) The details of the self-attention mechanism in LPT for intra-fragment multi-head attention. (e) The shared components norm and feed forward in both GPT and LPT modules.
The FCL module employs three residual blocks, which consist of two FCLs to achieve a three-stage dimensional expansion. This enlarges the initial -dimensional structural parameters to -dimensional spectra. More details can be found in Sec. 2.1 of the Supplementary Material.46 As the outputs of the FCL module enter the transformer, they first pass through a linear layer to standardize their dimensions to , resulting in an input tensor of shape with the last dimension being 1. Before entering the following transformer network, we propose a data representation and processing framework tailored for spectral data. By aligning the number of heads () in the multi-head attention mechanism with the tokens in the sentence data, the input tensor is restructured as . Building upon the strategy of employing distinct heads to focus on different spectral fragments, a two-stage global and local spectrum-aware approach is proposed, which includes cross-spectral multi-head attention for inter-spectral interaction and intra-spectral multi-head attention for self-interaction within the spectral fragment. Therefore, we design a double-transformer architecture consisting of the global perception transformer (GPT) and the local perception transformer (LPT).
The data are first processed by a projection matrix to obtain the queries , keys , and values for the multi-head attention mechanism [seen in Fig. 2(b)]. As illustrated in Fig. 2(c), in the first stage, the transformer captures information among spectral fragments by processing tensors of shape , where is both the number of fragments and heads. This setup yields attention score matrices of size , allowing for efficient extraction of global spectral features. The cross-attention mechanism is adopted to capture the attention of different spectral fragments, and the attention scores and attention are computed by Eqs. (1) and (2).
As illustrated in Fig. 2(d), in the second stage, the refined spectral representation is processed further to focus on fine-grained local details using tensors of shape , resulting in attention score matrices of size with the number of . The data are processed by the LPT, which uses attention heads for intra-fragment multi-head attention. This mechanism captures the relationships within each spectral fragment separately. The attention scores are computed similarly to the GPT, but with queries, keys, and values all derived from the same spectral fragment, and the product results are applied on the sigmoid function [shown in Eqs. (5) and (6)].
The attention is described by the Hadamard product of and [seen in Eq. (7)]. In the calculation process, only contains the information from and , and only contains the information from , , and , which comes from the same spectral fragment. In addition, the attentions from all heads are concatenated together to obtain the result of the multi-head self-attention mechanism as Eq. (8).
The attention output is passed to the norm and feed forward (NFF) layer [see Fig. 2(e) and also Sec. 2.247
2.3 Dataset Generation
Our metasurface filter is specifically designed as a narrow-band reflector for alkali lasers with a linewidth of at 795 nm. To generate the dataset, we start from the simplest cylindrical geometry meta-atoms in square lattices of diamond metasurfaces.52 The design is parameterized by the cylinder diameter (), height (), and distance of adjacent cylinders (). Considering practical fabrication constraints and the diameter must be smaller than period , we performed a systematic parameter sweep across these three independent variables, with uniform sampling within defined ranges [see Fig. S21(a) in the Supplementary Material]. Notably, the sampling density for is lower than for and , as spectral variations are observed to be relatively insensitive to changes in . For each unique parameter combination, the reflection spectrum is computed using FDTD simulations (Lumerical Solutions, Canada) over a wavelength range of 700 to 1100 nm, comprising 300 equidistant points. As shown in Fig. S22 in the Supplementary Material, our dataset covers peak wavelengths spanning the entire spectral range of 700 to 1100 nm. Ultimately, we acquired only 2120 sample sets. Figures S21(b)-S21(c) in the Supplementary Material display the spectra samples, with their corresponding , , and .
2.4 On-Demand Inverse Design
Figure 3 illustrates the architecture of the inverse network that consists of a Gaussian-shape spectrum generator (GSSG) and a DNN module. The input spectrum is square-shaped, with the center wavelength and bandwidth corresponding to the parameters of a given laser in practical applications. However, a single-mode laser generally approximates a Gaussian shape, so we establish direct correspondence between ideal spectral responses and spectra input into the DNN network by a GSSG. For a single peak spectrum, the Gaussian-shaped reflectivity spectrum is set as Eq. (11), and for spectra with dual peaks, the shape is set as Eq. (12).
![]()
Figure 3.Architecture of the meta-filter design network.
The DNN consists of three consecutive fully connected hidden layers containing , , and neurons, respectively. The input layer incorporates the target spectrum consisting of points, and the output layer produces a vector representing the geometric parameters of the metasurfaces. These predicted parameters are then set as the input to the cascaded GLSaT, where the EM response of the design is assessed. During training, the weights and biases in the DNN layers are iteratively updated, whereas GLSaT remain fixed. As training progresses, the generator progressively refines its output, ultimately forming a DNN capable of producing high-reflectivity metasurface designs with a single computation.
The input spectra of DNN and predicted spectra from GLSaT are compared to calculate the loss function, which is minimized using gradient descent to optimize the DNN parameters and improve the accuracy of structural predictions. To use the simulation-generated dataset for supervised training of the inverse network, we perform a thresholding operation to binarize the original dataset, resulting in a dataset suitable for training the inverse network. We refine the dataset spectra based on a predefined demand threshold (), which we set at 0.9. This threshold filters the spectra to retain only binary spectra containing values of 0 or 1.
The DNN is trained using a loss function defined in Eq. (13), which combines the spectral loss with boundary and aspect ratio penalties to constrain the structural parameters. Here, the structural parameters predicted by the DNN is and the aspect ratio as . The predicted spectra by GLSaT are denoted as , and the spectra processed by the GSSG are represented as . Considering the application of a narrowband metasurface optical filter for filtering demands of one or two single-wavelength lasers, we introduce a weight factor () to amplify peak features, enabling the network to prioritize capturing key spectral characteristics. More importantly, for other applications, such as broadband filters and metalens design, we can modify the input parameters, weights, or even the loss function to focus on the desired spectral features (see Sec. 2.4 of the Supplementary Material for a modified inverse design scheme tailored for metalens applications53,54). Equation (17) assesses peak wavelength shift (), peak deviation (), and full width at 0.9 maximum (FW0.9M) () for validation, which rigorously evaluates the accuracy of reproducing the target spectrum.
Particularly, if considering the fabrication errors, we can extend the evaluation of the feature performance metric by adopting a sigma-point based unscented transform (UT). In this approach, each structural parameter vector is perturbed by a small deviation that models Gaussian fabrication noise. Instead of exhaustive sampling, a set of 2d+1 sigma points is generated from the covariance , each associated with a predefined weight . The robust feature error is then computed as a weighted combination:
3 Results and Discussion
3.1 GLSaT Performance
3.1.1 Predicted results of our dataset
To check the accuracy of the fully trained GLSaT, we provide the results of 212 testing data samples for a comparison between the simulated and predicted reflective spectra, as shown in Figs. 4(a)-4(h). The difference is quantified using MAE, labeled above each comparison figure, with the absolute error at each spectral point (error) represented as a blue shaded region on the plot, which shows the accuracy of the forward-trained network. In Fig. 4(i), their difference is further measured by absolute error for every spectral point with an averaged value of only 0.0063. The agreement between the simulated and predicted results demonstrates the accuracy of our model, which shows the ability of our GLSaT model in capturing the spectral sequence information of metasurface elements.
![]()
Figure 4.(a)–(h) Comparison of simulated and predicted reflection spectra for some testing samples. (i) Summary of the absolute error across all spectral points (
3.1.2 Ablation study and explainability
To evaluate the contributions of each component in the GLSaT architecture and verify the necessity of its dual-transformer layers, namely GPT and LPT, we conducted ablation studies on the metasurface reflector dataset.
Figure 5(a) presents the evolution of the validation MAE over 5000 epochs for each configuration (the complete GLSaT architecture, GLSaT without GPT, LPT, dual transformer modules, and the baseline FCN). The zoomed-in subplot highlights the performance of GLSaT, which consistently achieves the lowest MAE. Removing either transformer layer increases the MAE, and removing both leads to further degradation. Even without both transformer modules, the network still benefits from other components, such as the residual block and 1D convolution, which maintain its performance advantage over the FCN. The results underscore the essential role of the transformer modules in capturing the complex relationships of spectral fragments and achieving better performance. As a complementary analysis, Fig. 5(b) shows the histogram distribution of the absolute error (error) between the simulated and predicted spectra across the spectral points of the untrained testing dataset. GLSaT exhibits the smallest error distribution across spectral points, indicating its high predictive accuracy. Configurations with one transformer layer removed exhibit slightly worse error distributions, whereas the absence of both transformer modules produces a significant rightward shift, indicative of reduced accuracy. The baseline FCN configuration performs the worst, with the highest error values distributed broadly across spectral points, which shows that GLSaT achieves 32.9% higher accuracy. These results highlight the pivotal role of transformer modules, especially the synergy between GPT and LPT, in jointly capturing global and local spectral features for robust and precise predictions.
![]()
Figure 5.Evaluation of the GLSaT architecture through ablation experiments. (a) Validation MAE as a function of training epochs 1 to 5000 for the GLSaT architecture and its ablated variants. Configurations include the complete GLSaT model (red), GLSaT without the global perception transformer (GPT, orange), GLSaT without the local perception transformer (LPT, yellow), GLSaT without both transformer layers (green), and a baseline FCN (blue). The inset zooms in on the final 600 epochs, showing the superior performance and faster convergence of GLSaT compared with the ablated configurations and the baseline. (b) Distribution of absolute error (error) across the spectral points of the testing set for the same configurations. The
To explore deeper into the explainability of GLSaT, we analyzed one randomly selected training sample. Figure 5(c) presents the inter-fragment multi-head attention distribution of GPT across training epochs. In the first 20 epochs, attention is dispersed since the model has not yet learned structure-spectrum relations. As the training progresses, attention gradually concentrates, expands its coverage, and finally achieves global attention. At epochs 1830 and 2000, attention stabilizes as validation MAE reaches 0.008, indicating the model has learned the hidden spectral information in the structure. A similar trend appears in Fig. 5(d), showing intra-fragment attention distributions of LPT from epochs 1 to 2000. The local attention mechanism gradually refines its focus, ultimately converging on critical features within the spectral sequence, such as peaks and troughs. These key points act as reference anchors, enabling precise reconstruction of the entire spectral sequence.
To further illustrate the role of the double-attention mechanism in spectral features extraction, we present the outputs of different modules in the GLSaT model: the dimension-enhancing residual module (FCL), the start linear layer of GPT (linear), the attention mechanism of GPT (GPT attention), the NFF layer of GPT (NFF), the attention mechanism of LPT (LPT attention), the NFF layer of LPT (NFF), and the 1D convolutional layers (1Dconv) shown in Fig. 5(e). The blue box represents GPT, whereas the pink box represents LPT. In the prediction process from structural parameters to spectra, the FCL module first increases the dimensionality of the input data (seen in ①), generating complex high-frequency information. The start linear layers in GPT further enhance this high-frequency information (②). As the model has extracted a large amount of useful information, the proportion of redundant information has also risen. To enhance the acquisition of relevant information, the global attention mechanism in GPT leverages the global interactions among spectral fragments (represented by the small red box and dashed lines in ②), thereby increasing the proportion of spectral-related information (③). After processing through NFF (④), the inputs to LPT contain richer high-frequency information compared to the output of the FCL. Next, the LPT performs local attention perception on the spectrum, carrying out local selection and fine-tuning within spectral fragments (⑤). After further processing by NFF, the relevant features are extracted and the redundancy is removed, resulting in the final outputs (⑥) containing key information such as spectral peak magnitude, position, rising/falling trends, and implied resonance features (represented by the small red box and dashed lines in ⑤ and ⑥). Through 1D convolutional smoothing, the model ultimately generates a predicted spectrum with high accuracy (⑦).
We verified the effectiveness of the GLSaT model through ablation experiments and concluded that GLSaT adheres to the information bottleneck theory,55
3.1.3 Generalization and contrast
To evaluate the generalization capability of the GLSaT model, we conducted extensive experiments across six diverse datasets, including diamond reflectors, SiC reflectors, broadband sound absorbers,16 four-resonator transmission systems,42 nanophotonic scattering particles,40 cylindrical metasurfaces, and H-shaped metasurfaces.24 Each dataset represents varying complexity for evaluating the robustness of GLSaT, with sample sizes ranging from 2120 to 200,000, structural parameter counts between 3 and 18, and target points spanning 201 to 2002. All training procedures are conducted for fewer than 5000 epochs. The performance metrics for each dataset, summarized in Table 1, include training set MSE, validation set MAE, test set evaluations, and comparison with the source papers. Comparison of simulated and predicted reflection spectra for some testing samples of each dataset is shown in Fig. S2 (SiC reflector), Fig. S3 (sound absorber), Fig. S4 (four resonators), Fig. S5 (nanophotonic particle), Fig. S6 (cylindrical metasurface), and Figs. S7–S9 (H-shaped metasurface) in the Supplementary Material.
| Dataset | Diamond reflector | SiC reflector | Sound absorber | Four resonators | Nanophotonic particle | Cylindrical metasurface | H-shaped metasurface | ||||
| Source | Ours | Ours | 2024 MSSP | 2019 Opt. Express | 2018 Sci. Adv | 2019 ACS Photonics | 2019 ACS Photonics | ||||
| Spectral type | Reflectivity | Reflectivity | Absorption | Transmission | Scattering-section | Imag ( | Real ( | Imag ( | Transmission ( | Phase ( | Transmission |
| Structural dimension | 3 | 3 | 18 | 8 | 5 | 4 | 6 | ||||
| Spectral dimension | 300 | 300 | 600 | 2002 | 201 | 301 | 301 | ||||
| Sample number | 2120 | 2520 | 200,000 | 20,998 | 5000 | 65,611 | 24,800 | ||||
| Source model | GLSaT | GLSaT | FCN | FCN fed by combined geometric inputs | FCN | NTN & down-sampled | NTN & down-sampled | ||||
| Source metrics | — | — | MAE: | MSE: | MRE: 0.46% | MSE: | — | — | Accuracy: 99.4% | Accuracy: 99.3% | — |
| 99%MSE: | |||||||||||
| min MAE: | 95%MSE: | ||||||||||
| GLSaT Train MSE | — | — | |||||||||
| GLSaT Val MAE | — | — | |||||||||
| GLSaT test metrics | MSE: | MSE: | MAE: | MSE: | MRE: 0.32% | MSE: | MSE: | MSE: | Accuracy: 99.9% | Accuracy: 99.4% | MSE: |
| 99%MSE: | |||||||||||
| MAE: | MAE: | min MAE: | MAE: | MAE: | MAE: | ||||||
| 95%MSE: | |||||||||||
| — | — | — | — | — | |||||||
Table 1. Generalization and contrast for GLSaT.
For the absorption spectrum of the broadband sound absorber acoustic metasurface, despite the complexity of having 18 structural parameters and 600 prediction points, GLSaT achieves an average MAE of 0.0038 on both the validation and test sets, which is lower than the minimum MAE of 0.0041 reported in the source paper. This highlights the high accuracy of our model and demonstrates its robust fitting capability when provided with abundant samples. In the case of a complex metasurface with four resonators and eight structural parameters, 99% of the data had an error of less than , and 95% exhibited MSE less than , outperforming the metrics from the source paper.
To further benchmark the performance of GLSaT against alternative network architectures, we conducted systematic comparisons with several representative models, including an FCN, a BNN, a temporal convolutional network (CNN), an NTN, a long short-term memory-based recurrent neural network (LSTM), and the OptoGPT model.58 Particularly, OptoGPT is a transformer variant under the Opto framework, which is designed for multilayer thin-film spectral prediction. All models were trained on the cylindrical metasurface dataset containing 65,611 samples, originally sampled at 301 spectral points. To evaluate robustness across different spectral resolutions, we additionally generated datasets with 151 and 61 sampling points through uniform down-sampling. Training was carried out under identical conditions, employing the OneCycleLR scheduler (maximum learning rate of , initial learning rate of , warm-up ratio of 20%, initial divisor of 5.0, and final divisor of ) and the AdamW optimizer with weight decay of . The same training, validation, and test splits, batch sizes, epochs (2000), and mean squared error (MSE) loss functions were used to ensure fairness. Detailed network architecture, training dynamics, and performance histories are provided in Figs. S10–S16 in the Supplementary Material, corresponding to each evaluated model.
In Table 2, we report the Overfitting Index, which is defined as
| GLSaT | FCN | BNN | CNN | NTN | LSTM | OptoGPT | ||
| Number of parameters | 15,527,678 | 2,748,076 | 16,371,774 | 6,043,118 | 6,866,390 | 4,690,174 | 33,910,573 | |
| Average inference time | 11.960 ms | 7.972 ms | 6.312 ms | 9.302 ms | 18.604 ms | 11.628 ms | 12.957 ms | |
| Per-epoch training time | 8.6395 s | 2.3784 s | 4.7049 s | 5.2655 s | 10.3487 s | 6.6998 s | 10.3565 s | |
| Test set MAE | 301 points | 0.0077 | 0.0228 | 0.0285 | 0.0171 | 0.0428 | 0.0241 | 0.0465 |
| 151 points | 0.0094 | 0.0240 | 0.0300 | 0.0142 | 0.0429 | 0.0284 | 0.0437 | |
| 61 points | 0.0080 | 0.0235 | 0.0267 | 0.0164 | 0.0425 | 0.0274 | 0.0333 | |
| Average of test set MAE | 0.0084 | 0.0234 | 0.0284 | 0.0159 | 0.0427 | 0.0266 | 0.0412 | |
| Performance consistency | 0.1105 | 0.0284 | 0.1106 | 0.1356 | 0.0069 | 0.1252 | 0.1899 | |
| Overfitting index | 301 points | 0.0174 | 0.7612 | 0.0532 | 0.0031 | 0.0008 | 0.4868 | 0.1425 |
| 151 points | 0.0235 | 0.2593 | 0.0990 | 0.0084 | 0.0033 | 0.8909 | 0.5117 | |
| 61 points | 0.0096 | 0.1842 | 0.0335 | 0.0015 | 0.0039 | 0.4521 | 1.2893 | |
| Average of overfitting index | 0.0168 | 0.4016 | 0.0619 | 0.0043 | 0.0027 | 0.6099 | 0.6478 | |
| Overfitting stability | 0.3388 | 0.6379 | 0.4428 | 0.6829 | 0.5151 | 0.3265 | 0.7377 |
Table 2. Comparison of different neural network models.
The comparative results are summarized in Table 2. GLSaT achieves the lowest average test set MAE across different spectral resolutions (0.0084), significantly outperforming conventional architectures such as FCN (0.0234) and BNN (0.0284), while also providing higher performance consistency, reflected by a low coefficient of variation (0.1105). In terms of generalization robustness, GLSaT maintains a substantially lower average overfitting index (0.0168) and improved stability (0.3388), whereas baseline models such as FCN and LSTM exhibit severe overfitting with large index values (0.4016 and 0.6099, respectively). Notably, although CNN achieves competitive results with low MAE and overfitting index, a direct comparison of the training histories in Figs. S10 and S13 in the Supplementary Material reveals that GLSaT not only attains higher prediction accuracy but also converges more rapidly, underscoring the effectiveness of its dual-spectrum attention mechanism. Meanwhile, OptoGPT, despite being a large-scale Transformer model with over 33 million parameters, exhibits inferior accuracy and stability on this task. This performance gap may arise from the fact that OptoGPT was originally tailored for multilayer thin-film spectral prediction, rendering its architecture less suited for the spectral characteristics of metasurface unit cells. Taken together, these results demonstrate that GLSaT offers a superior balance between accuracy, convergence efficiency, and generalization stability, establishing its advantage as a forward prediction backbone for metasurface inverse design.
In addition, we conducted experiments with different training data sizes to evaluate the data dependence of GLSaT (see Sec. 5 of the Supplementary Material for details), which further confirms the robustness of GLSaT across varying data volumes.
To precisely predict the transmission, researchers tend to change the physical variables to be predicted, including the combination of amplitude and phase or the real and imaginary parts of the complex reflective coefficient .24,30 For the NTN model, we evaluated the transmission coefficient () for a cylindrical metasurface, with the model achieving an MSE of on the test set, which is comparable to the reported in the source paper. However, the source paper employed down-sampling, reducing the spectral points from 301 to 31, which may have led to a loss of finer spectral details. By contrast, prediction by GLSaT maintained high dimension. For the H-shaped metasurface configuration, we predicted both the real and imaginary parts of the parameter, as well as the transmission spectrum. By utilizing the relationship between the parameters and transmission and phase, GLSaT is able to predict the transmission spectrum and phase using the trained real part of and imaginary part of . Here, transmission coefficient is obtained from the scattering parameters, among which represents , and its intensity is transmission [shown in Eq. (22)]. In addition, the phase Ph of is the arctangent of the ratio of to [shown in Eq. (23)].
Figure S23 in the Supplementary Material illustrates the predicted transmission and phase responses of the H-shaped metasurface through the real and imaginary parts of prediction. This approach has broad applicability in fields such as metalens and anomalous reflection metasurfaces, where both amplitude and phase need to be accurately matched.44,59,60 These results demonstrate that the model can accurately predict the phase behavior of the metasurface, which is critical for applications involving phase manipulation.
3.2 Application to Inverse Design
The predicted structural parameters and corresponding predicted spectra are presented in Figs. 6(a)–6(d), alongside a comparison with the spectra processed by the GSSG . This comparison is used to train the DNN through a loss function, which helps drive the predicted spectra to approximate the GSSG-generated spectra. In addition, to further validate the application of GLSaT, we use the DNN-predicted structural parameters to simulate the corresponding spectra through FDTD and compare them with the initial spectra from the dataset. The results are summarized in Table 3. For each case, the differences between the target spectra and the FDTD-simulated spectra of the predicted structures are quantified in terms of peak wavelength shift, peak deviation, and FW0.9M. In addition, the feature performance error and its robust counterpart were evaluated under conservative fabrication tolerances (, , , , , ). All cases exhibit small deviations and maintain values below 0.07, confirming that the designed structures not only reproduce the target spectral features with high accuracy but also remain robust against realistic fabrication variations. The results demonstrate that our inverse design framework effectively achieves the desired structural predictions, whereas the predicted spectra closely align with the spectral requirements of metasurface filters. These findings underscore the capability of our method to accurately and efficiently help design metasurface optical filters.
![]()
Figure 6.(a)–(d) Comparison of the spectra at different stages of the inverse design process. The spectra generated by the Gaussian-shaped spectrum generator (
| (a) | (b) | (c) | (d) | ||
| Peak wavelength shift | 0 | 1.3378 | 2.6756 | 0 | 0 |
| Peak deviation | 0.0017 | 0 | 0.0041 | 0.0002 | 0.0006 |
| FW0.9M | 4.2766 | 10.6632 | 3.99 | 8.4988 | 10.9156 |
| 0.0425 | 0.0207 | 0.0215 | 0.0279 | ||
| 0.0668 | 0.0214 | 0.0285 | 0.0314 |
Table 3. Comparison between the target spectra
Moreover, to demonstrate the adaptability of the proposed framework beyond filtering tasks, we also extend the inverse design strategy to metalenses and further applied it to the design of broadband metalenses (see Sec. 4 of the Supplementary Material for details).
4 Conclusion
In this work, we propose the GLSaT as an effective and scalable framework for spectral prediction in metasurface design. By incorporating global attention for capturing inter-fragment dependencies and local attention for fine-grained spectral features, GLSaT can map the relationship between structural parameters and spectral responses with high accuracy. Comprehensive explainability analyses through attention mechanisms further demonstrate its ability to effectively capture both global and local spectral dependencies, thereby validating its physical interpretability.
Our extensive experimental validations demonstrate the generalization capability of GLSaT across diverse metasurface functionalities, including but not limited to reflection modulation, transmission control, phase manipulation, and scattering pattern prediction. The GLSaT architecture not only improves prediction accuracy compared to existing baselines but also enhances the efficiency of data-driven metasurface inverse design, like metasurface optical filters. Despite its efficiency and accuracy, we acknowledge three key constraints: empirical errors from uneven dataset distributions, overfitting risks from model complexity, and generalization limits imposed by dataset size. Guided by generalization error bound theory, these factors define the performance boundary, which can be further improved with larger and more diverse datasets or refined model architectures. More details on the boundary analysis of GLSaT are provided in Sec. 5 of the Supplementary Material.
GLSaT provides an efficient, precise, and interpretable approach for metasurface design, offering a scalable solution for spectral prediction and inverse design. Future work will focus on extending its application to multifunctional metasurfaces and structural shape optimization, further advancing high-precision and computationally efficient metasurface design.
Acknowledgments
Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant No. 12204541), the Science and Technology Innovation Program of Hunan Province (Grant No. 2021RC3083), and the High-level Talents Programs of the National University of Defense Technology. We also thank Zhenqian Xiao from Harbin Institute of Technology for the useful discussions.
Biographies of the authors are not available.
References
[6] I. Brener, A. Faraon et al. Dielectric Metamaterials, 175-194(2020).
[13] I. Brener, S. Kruk, Y. Kivshar et al. Dielectric Metamaterials, 145-174(2020).
[18] B. Slovick et al. Perfect dielectric-metamaterial reflector. Phys. Rev. B, 88, 165116(2013).
[43] I. Brener, A. Vaswani et al. Attention is all you need. Adv. Neural Inf. Process. Syst., 5998-6008(2023).
[46] A. L. Maas, A. Y. Hannun, A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models(2013).
[47] A. Dosovitskiy et al. An image is worth 16 × 16 words: transformers for image recognition at scale(2021).
[48] K. He et al. Deep residual learning for image recognition, 770-778(2016).
[49] S. Bai, J. Z. Kolter, V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling(2018).
[50] I. Loshchilov, F. Hutter. Decoupled weight decay regularization(2019).
[51] I. Loshchilov, F. Hutter. SGDR: stochastic gradient descent with warm restarts(2017).
[53] S. Wang et al. Broadband achromatic optical metasurface devices. Nat. Commun., 8, 187(2017).
[54] S. Shrestha et al. Broadband achromatic dielectric metalenses. Light: Sci. Appl., 7, 85(2018).
[55] F. Cao et al. Justices for information bottleneck theory(2023).
[56] K. Kawaguchi et al. How does information bottleneck help deep learning?, 16049-16096(2023).
[57] Z. Yang et al. Exploring information processing in large language models: insights from information bottleneck theory(2025).
[60] T. He et al. Perfect anomalous reflectors at optical frequencies. Sci. Adv., 8, eabk3381(2022).

Set citation alerts for the article
Please enter your email address


AI Video Guide
AI Picture Guide
AI One Sentence


