- AISTATSScore-based Quickest Change Detection for Unnormalized ModelsSuya Wu, Enmao Diao, Taposh Banerjee, and 2 more authorsIn International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
Classical change detection algorithms typically require modeling pre-change and post-change distributions. The calculations may not be feasible for various machine learning models because of the complexity of computing the partition functions and normalized distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. In this paper, we develop a new variant of the classical Cumulative Sum (CUSUM) change detection, namely Score-based CUSUM (SCUSUM), based on Fisher divergence and the Hyvärinen score. Our method allows the applications of the quickest change detection for unnormalized distributions. We provide a theoretical analysis of the detection delay given the constraints on false alarms. We prove the asymptotic optimality of the proposed method in some particular cases. We also provide numerical experiments to demonstrate our method’s computation, performance, and robustness advantages.
- arXivQuickest Change Detection for Unnormalized Statistical ModelsSuya Wu, Enmao Diao, Taposh Banerjee, and 2 more authorsarXiv e-prints, 2023
Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyvärinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.
- ICMESemi-Supervised Federated Learning for Keyword SpottingEnmao Diao, Eric W Tramel, Jie Ding, and 1 more authorIn 2023 IEEE International Conference on Multimedia and Expo (ICME), 2023
Keyword Spotting (KWS) is a critical aspect of audio-based applications on mobile devices and virtual assistants. Recent developments in Federated Learning (FL) have significantly expanded the ability to train machine learning models by utilizing the computational and private data resources of numerous distributed devices. However, existing FL methods typically require that devices possess accurate ground-truth labels, which can be both expensive and impractical when dealing with local audio data. In this study, we first demonstrate the effectiveness of Semi-Supervised Federated Learning (SSL) and FL for KWS. We then extend our investigation to Semi-Supervised Federated Learning (SSFL) for KWS, where devices possess completely unlabeled data, while the server has access to a small amount of labeled data. We perform numerical analyses using state-of-the-art SSL, FL, and SSFL techniques to demonstrate that the performance of KWS models can be significantly improved by leveraging the abundant unlabeled heterogeneous data available on devices.
- KDDOnce-for-All Federated Learning: Learning From and Deploying to Heterogeneous ClientsKamala Varma, Enmao Diao, Tanya Roosta, and 2 more authorsIn KDD 2023 Workshop on Federated Learning for Distributed Data Mining, 2023
Federated learning (FL) enables multiple client devices to train a single machine learning model collaboratively. As FL often involves various smart devices, it is important to adapt the FL pipeline to accommodate device resource constraints. This work addresses the problem of training and storing memory-intensive deep neural network architectures on resource-constrained devices. Existing solutions often involve computationally expensive methods. We propose Once-for-All Federated Learning (OFA-FL) to overcome this limitation by learning a model that concurrently optimizes sub-networks of various sizes. Clients can therefore receive the sub-network best suited for their device resources without extra computation. Our experiments show that each component of OFA-FL contributes to well-performing FL-produced sub-networks while maintaining a global network design that supports the efficient deployment of device resource-specific sub-networks.
- UAIRobust Quickest Change Detection for Unnormalized ModelsSuya Wu, Enmao Diao, Jie Ding, and 2 more authorsIn Uncertainty in Artificial Intelligence (UAI), 2023
Detecting an abrupt and persistent change in the underlying distribution of online data streams is an important problem in many applications. This paper proposes a new robust score-based algorithm called RSCUSUM, which can be applied to unnormalized models and addresses the issue of unknown post-change distributions. RSCUSUM replaces the Kullback-Leibler divergence with the Fisher divergence between pre- and post-change distributions for computational efficiency in unnormalized statistical models and introduces a notion of the “least favorable” distribution for robust change detection. The algorithm and its theoretical analysis are demonstrated through simulation studies.
- DCCA Physics-Informed Vector Quantized Autoencoder for Data Compression of Turbulent FlowMohammadreza Momenifar, Enmao Diao, Vahid Tarokh, and 1 more authorIn 2022 Data Compression Conference (DCC), 2022
Analyzing large-scale data from simulations of turbulent flows is memory intensive, requiring significant resources. This major challenge highlights the need for data compression techniques. In this study, we apply a physics-informed Deep Learning technique based on vector quantization to generate a discrete, low-dimensional representation of data from simulations of three-dimensional turbulent flows. The deep learning framework is composed of convolutional layers and incorporates physical constraints on the flow, such as preserving incompressibility and global statistical characteristics of the velocity gradients. The accuracy of the model is assessed using statistical, comparison-based similarity and physics-based metrics. The training data set is produced from Direct Numerical Simulation of an incompressible, statistically stationary, isotropic turbulent flow. The performance of this lossy data compression scheme is evaluated not only with unseen data from the stationary, isotropic turbulent flow, but also with data from decaying isotropic turbulence, and a Taylor-Green vortex flow. Defining the compression ratio (CR) as the ratio of original data size to the compressed one, the results show that our model based on vector quantization can offer CR =85 with a mean square error (MSE) of O(10−3), and predictions that faithfully reproduce the statistics of the flow, except at the very smallest scales where there is some loss. Compared to the recent study based on a conventional autoencoder where compression is performed in a continuous space, our model improves the CR by more than 30 percent, and reduces the MSE by an order of magnitude. Our compression model is an attractive solution for situations where fast, high quality and low-overhead encoding and decoding of large data are required.
- JoTDimension Reduced Turbulent Flow Data from Deep Vector QuantizersMohammadreza Momenifar, Enmao Diao, Vahid Tarokh, and 1 more authorJournal of Turbulence (JoT), 2022
Analyzing large-scale data from simulations of turbulent flows is memory intensive, requiring significant resources. This major challenge highlights the need for data compression techniques. In this study, we apply a physics-informed Deep Learning technique based on vector quantization to generate a discrete, low-dimensional representation of data from simulations of three-dimensional turbulent flows. The deep learning framework is composed of convolutional layers and incorporates physical constraints on the flow, such as preserving incompressibility and global statistical characteristics of the velocity gradients. The accuracy of the model is assessed using statistical, comparison-based similarity and physics-based metrics. The training data set is produced fromDirect Numerical Simulation of an incompressible, statistically stationary, isotropic turbulent flow.The performance of this lossy data compression scheme is evaluated not only with unseen data from the stationary, isotropic turbulent flow, but also with data from decaying isotropic turbulence, aTaylor-Green vortex flow, and a turbulent channel flow. Defining the compression ratio (CR) as the ratio of original data size to the compressed one, the results show that our model based on vector quantization can offer CR= 85 with a mean square error (MSE) of O(10−3), and predictions that faithfully reproduce the statistics of the flow, except at the very smallest scales where there is some loss. Compared to the recent study of Glaws. et. al. (Physical Review Fluids, 5(11):114602, 2020),which was based on a conventional autoencoder (where compression is performed in a continuous space), our model improves the CR by more than 30 percent, and reduces the MSE by an order of magnitude. Our compression model is an attractive solution for situations where fast, high quality and low-overhead encoding and decoding of large data are required.
- IEEE AccessScore-based Hypothesis Testing for Unnormalized ModelsSuya Wu, Enmao Diao, Khalil Elkhalil, and 2 more authorsIEEE Access, 2022
Unnormalized statistical models play an important role in machine learning, statistics, and signal processing. In this paper, we derive a new hypothesis testing procedure for unnormalized models. Our approach is motivated by the success of score matching techniques that avoid the intensive computational costs of normalization constants in many high-dimensional settings. Our proposed test statistic is the difference between Hyvärinen scores corresponding to the null and alternative hypotheses. Under some reasonable conditions, we prove that the asymptotic distribution of this statistic is Chi-squared. We outline a bootstrap approach to learn the test critical values, particularly when the distribution under the null hypothesis cannot be expressed in a closed form, and provide consistency guarantees. Finally, we conduct extensive numerical experiments and demonstrate that our proposed approach outperforms goodness-of-fit benchmarks in various settings.
- NeurIPSGAL: Gradient Assisted Learning for Decentralized Multi-Organization CollaborationsEnmao Diao, Jie Ding, and Vahid TarokhAdvances in Neural Information Processing Systems (NeurIPS), 2022
Collaborations among multiple organizations, such as financial institutions, medical centers, and retail markets in decentralized settings are crucial to providing improved service and performance. However, the underlying organizations may have little interest in sharing their local data, models, and objective functions. These requirements have created new challenges for multi-organization collaboration. In this work, we propose Gradient Assisted Learning (GAL), a new method for multiple organizations to assist each other in supervised learning tasks without sharing local data, models, and objective functions. In this framework, all participants collaboratively optimize the aggregate of local loss functions, and each participant autonomously builds its own model by iteratively fitting the gradients of the overarching objective function. We also provide asymptotic convergence analysis and practical case studies of GAL. Experimental studies demonstrate that GAL can achieve performance close to centralized learning when all data, models, and objective functions are fully disclosed.
- NeurIPSSemiFL: Semi-Supervised Federated Learning for Unlabeled Clients with Alternate TrainingEnmao Diao, Jie Ding, and Vahid TarokhAdvances in Neural Information Processing Systems (NeurIPS), 2022
Federated Learning allows the training of machine learning models by using the computation and private data resources of many distributed clients. Most existing results on Federated Learning (FL) assume the clients have ground-truth labels. However, in many practical scenarios, clients may be unable to label task-specific data due to a lack of expertise or resource. We propose SemiFL to address the problem of combining communication-efficient FL such as FedAvg with Semi-Supervised Learning (SSL). In SemiFL, clients have completely unlabeled data and can train multiple local epochs to reduce communication costs, while the server has a small amount of labeled data. We provide a theoretical understanding of the success of data augmentation-based SSL methods to illustrate the bottleneck of a vanilla combination of communication-efficient FL with SSL. To address this issue, we propose alternate training to ‘fine-tune global model with labeled data’ and ‘generate pseudo-labels with the global model.’ We conduct extensive experiments and demonstrate that our approach significantly improves the performance of a labeled server with unlabeled clients training with multiple local epochs. Moreover, our method outperforms many existing SSFL baselines and performs competitively with the state-of-the-art FL and SSL results.
- NeurIPSPerFedSI: A Framework for Personalized Federated Learning with Side InformationLiam Collins, Enmao Diao, Tanya Roosta, and 2 more authorsIn NeurIPS 2022 Workshop on Federated Learning: Recent Advances and New Challenges, 2022
With an ever-increasing number of smart edge devices with computation and communication constraints, Federated Learning (FL) is a promising paradigm for learning from distributed devices and their data. Typical approaches to FL aim to learn a single model that simultaneously performs well for all clients. But such an approach may be ineffective when the clients’ data distributions are heterogeneous. In these cases, we aim to learn personalized models for each client’s data yet still leverage shared information across clients. A critical avenue that may allow for such personalization is the presence of client-specific side information available to each client, such as client embeddings obtained from domain-specific knowledge, pre-trained models, or simply one-hot encodings. In this work, we propose a new FL framework for utilizing a general form of client-specific side information for personalized federated learning. We prove that incorporating side information can improve model performance for simplified multi-task linear regression and matrix completion problems. Further, we validate these results with image classification experiments on Omniglot, CIFAR-10, and CIFAR-100, revealing that proper use of side information can be beneficial for personalization.
- AsilomarPersonalized Federated Recommender Systems with Private and Partially Federated AutoEncodersQi Le, Enmao Diao, Xinran Wang, and 3 more authorsIn 2022 56th Asilomar Conference on Signals, Systems, and Computers (Asilomar), 2022
Recommender Systems (RSs) have become increasingly important in many application domains, such as digital marketing. Conventional RSs often need to collect users’ data, centralize them on the server-side, and form a global model to generate reliable recommendations. However, they suffer from two critical limitations: the personalization problem that the RSs trained traditionally may not be customized for individual users, and the privacy problem that directly sharing user data is not encouraged. We propose Personalized Federated Recommender Systems (PersonalFR), which introduces a personalized autoencoder-based recommendation model with Federated Learning (FL) to address these challenges. PersonalFR guarantees that each user can learn a personal model from the local dataset and other participating users’ data without sharing local data, data embeddings, or models. PersonalFR consists of three main components, including AutoEncoder-based RSs (ARSs) that learn the user-item interactions, Partially Federated Learning (PFL) that updates the encoder locally and aggregates the decoder on the server-side, and Partial Compression (PC) that only computes and transmits active model parameters. Extensive experiments on two real-world datasets demonstrate that PersonalFR can achieve private and personalized performance comparable to that trained by centralizing all users’ data. Moreover, PersonalFR requires significantly less computation and communication overhead than standard FL baselines.
- ICLRPruning Deep Neural Networks from a Sparsity PerspectiveEnmao Diao, Ganghua Wang, Jiawei Zhang, and 3 more authorsIn The Eleventh International Conference on Learning Representations (ICLR), 2022
In recent years, deep network pruning has attracted significant attention in order to enable the rapid deployment of AI into small devices with computation and memory constraints. Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may under-prune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm. Our extensive experiments corroborate the hypothesis that for a generic pruning procedure, PQI decreases first when a large model is being effectively regularized and then increases when its compressibility reaches a limit that appears to correspond to the beginning of underfitting. Subsequently, PQI decreases again when the model collapse and significant deterioration in the performance of the model start to occur. Additionally, our experiments demonstrate that the proposed adaptive pruning algorithm with proper choice of hyper-parameters is superior to the iterative pruning algorithms such as the lottery ticket-based pruning methods, in terms of both compression efficiency and robustness.
- CVMIMultimodal Controller for Generative ModelsEnmao Diao, Jie Ding, and Vahid TarokhIn Computer Vision and Machine Intelligence (CVMI), 2022
Class-conditional generative models are crucial tools for data generation from user-specified class labels. Existing approaches for class-conditional generative models require nontrivial modifications of backbone generative architectures to model conditional information fed into the model. This paper introduces a plug-and-play module named ‘multimodal controller’ to generate multimodal data without introducing additional learning parameters. In the absence of the controllers, our model reduces to non-conditional generative models. We test the efficacy of multimodal controllers on CIFAR10, COIL100, and Omniglot benchmark datasets. We demonstrate that multimodal controlled generative models (including VAE, PixelCNN, Glow, and GAN) can generate class-conditional images of significantly better quality when compared with conditional generative models. Moreover, we show that multimodal controlled models can also create novel modalities of images.
- arXivDecentralized Multi-Target Cross-Domain Recommendation for Multi-Organization CollaborationsEnmao Diao, Vahid Tarokh, and Jie DingarXiv preprint arXiv:2110.13340, 2021
Recommender Systems (RSs) are operated locally by different organizations in many realistic scenarios. If various organizations can fully share their data and perform computation in a centralized manner, they may significantly improve the accuracy of recommendations. However, collaborations among multiple organizations in enhancing the performance of recommendations are primarily limited due to the difficulty of sharing data and models. To address this challenge, we propose Decentralized Multi-Target Cross-Domain Recommendation (DMTCDR) with Multi-Target Assisted Learning (MTAL) and Assisted AutoEncoder (AAE). Our method can help multiple organizations collaboratively improve their recommendation performance in a decentralized manner without sharing sensitive assets. Consequently, it allows decentralized organizations to collaborate and form a community of shared interest. We conduct extensive experiments to demonstrate that the new method can significantly outperform locally trained RSs and mitigate the cold start problem.
- AAAIEmulating Spatio-Temporal Realizations of Three-Dimensional Isotropic Turbulence via Deep Sequence Learning ModelsMohammadreza Momenifar, Enmao Diao, Vahid Tarokh, and 1 more authorAAAI 2022 Workshop on AI to Accelerate Science and Engineering, 2021
We use a data-driven approach to model a three-dimensional turbulent flow using cutting-edge Deep Learning techniques. The deep learning framework incorporates physical constraints on the flow, such as preserving incompressibility and global statistical invariants of velocity gradient tensor. The accuracy of the model is assessed using statistical and physics-based metrics. The data set comes from Direct Numerical Simulation of an incompressible, statistically stationary, isotropic turbulent flow in a cubic box. Since the size of the dataset is memory intensive, we first generate a low-dimensional representation of the velocity data, and then pass it to a sequence prediction network that learns the spatial and temporal correlations of the underlying data. The dimensionality reduction is performed via extraction using Vector-Quantized Autoencoder (VQ-AE), which learns the discrete latent variables. For the sequence forecasting, the idea of Transformer architecture from natural language processing is used, and its performance compared against more standard Recurrent Networks (such as Convolutional LSTM). These architectures are designed and trained to perform a sequence to sequence multi-class classification task in which they take an input sequence with a fixed length (k) and predict a sequence with a fixed length (p), representing the future time instants of the flow. Our results for the short-term predictions show that the accuracy of results for both models deteriorates across predicted snapshots due to autoregressive nature of the predictions. Based on our diagnostics tests, the trained Conv-Transformer model outperforms the Conv-LSTM one and can accurately, both quantitatively and qualitatively, retain the large scales and capture well the inertial scales of flow but fails at recovering the small and intermittent fluid motions.
- DCCDeep Clustering of Compressed Variational EmbeddingsSuya Wu, Enmao Diao, Jie Ding, and 1 more authorIn 2020 Data Compression Conference (DCC), 2020
Motivated by the ever-increasing demands for limited communication bandwidth and low-power consumption, we propose a new methodology, named joint Variational Autoencoders with Bernoulli mixture models (VAB), for performing clustering in the compressed data domain. The idea is to reduce the data dimension by Variational Autoencoders (VAEs) and group data representations by Bernoulli mixture models (BMMs). Once jointly trained for compression and clustering, the model can be decomposed into two parts: a data vendor that encodes the raw data into compressed data, and a data consumer that classifies the received (compressed) data. In this way, the data vendor benefits from data security and communication bandwidth, while the data consumer benefits from low computational complexity. To enable training using the gradient descent algorithm, we propose to use the Gumbel-Softmax distribution to resolve the infeasibility of the back-propagation algorithm when assessing categorical samples.
- DCCDRASIC: Distributed Recurrent Autoencoder for Scalable Image CompressionEnmao Diao, Jie Ding, and Vahid TarokhIn 2020 Data Compression Conference (DCC), 2020
We propose a new architecture for distributed image compression from a group of distributed data sources. The work is motivated by practical needs of data-driven codec design, low power consumption, robustness, and data privacy. The proposed architecture, which we refer to as Distributed Recurrent Autoencoder for Scalable Image Compression (DRASIC), is able to train distributed encoders and one joint decoder on correlated data sources. Its compression capability is much better than the method of training codecs separately. Meanwhile, the performance of our distributed system with 10 distributed sources is only within 2 dB peak signal-to-noise ratio (PSNR) of the performance of a single codec trained with all data sources. We experiment distributed sources with different correlations and show how our data-driven methodology well matches the Slepian-Wolf Theorem in Distributed Source Coding (DSC). To the best of our knowledge, this is the first data-driven DSC framework for general distributed code design with deep learning.
- ICASSPSpeech Emotion Recognition with Dual-Sequence LSTM ArchitectureJianyou Wang, Michael Xue, Ryan Culhane, and 3 more authorsIn ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%—a 6% improvement over current state-of-the-art unimodal models—and is comparable with multimodal models that leverage textual information as well as audio signals.
- TITOn Statistical Efficiency in LearningJie Ding, Enmao Diao, Jiawei Zhou, and 1 more authorIEEE Transactions on Information Theory (TIT), 2020
A central issue of many statistical learning problems is to select an appropriate model from a set of candidate models. Large models tend to inflate the variance (or overfitting), while small models tend to cause biases (or underfitting) for a given fixed dataset. In this work, we address the critical challenge of model selection to strike a balance between model fitting and model complexity, thus gaining reliable predictive power. We consider the task of approaching the theoretical limit of statistical learning, meaning that the selected model has the predictive performance that is as good as the best possible model given a class of potentially misspecified candidate models. We propose a generalized notion of Takeuchi’s information criterion and prove that the proposed method can asymptotically achieve the optimal out-sample prediction loss under reasonable assumptions. It is the first proof of the asymptotic property of Takeuchi’s information criterion to our best knowledge. Our proof applies to a wide variety of nonlinear models, loss functions, and high dimensionality (in the sense that the models’ complexity can grow with sample size). The proposed method can be used as a computationally efficient surrogate for leave-one-out cross-validation. Moreover, for modeling streaming data, we propose an online algorithm that sequentially expands the model complexity to enhance selection stability and reduce computation cost. Experimental studies show that the proposed method has desirable predictive power and significantly less computational cost than some popular methods.
- ICLRHeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous ClientsEnmao Diao, Jie Ding, and Vahid TarokhIn International Conference on Learning Representations (ICLR), 2020
Federated Learning (FL) is a method of training machine learning models on private data distributed over a large number of possibly heterogeneous clients such as mobile phones and IoT devices. In this work, we propose a new federated learning framework named HeteroFL to address heterogeneous clients equipped with very different computation and communication capabilities. Our solution can enable the training of heterogeneous local models with varying computation complexities and still produce a single global inference model. For the first time, our method challenges the underlying assumption of existing work that local models have to share the same architecture as the global model. We demonstrate several strategies to enhance FL training and conduct extensive empirical evaluations, including five computation complexity levels of three model architecture on three datasets. We show that adaptively distributing subnetworks according to clients’ capabilities is both computation and communication efficient.
- Big DataRestricted Recurrent Neural NetworksEnmao Diao, Jie Ding, and Vahid TarokhIn 2019 IEEE International Conference on Big Data (Big Data), 2019
Recurrent Neural Network (RNN) and its variations such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have become standard building blocks for learning online data of sequential nature in many research areas, including natural language processing and speech data analysis. In this paper, we present a new methodology to significantly reduce the number of parameters in RNNs while maintaining performance that is comparable or even better than classical RNNs. The new proposal, referred to as Restricted Recurrent Neural Network (RRNN), restricts the weight matrices corresponding to the input data and hidden states at each time step to share a large proportion of parameters. The new architecture can be regarded as a compression of its classical counterpart, but it does not require pre-training or sophisticated parameter fine-tuning, both of which are major issues in most existing compression techniques. Experiments on natural language modeling show that compared with its classical counterpart, the restricted recurrent architecture generally produces comparable results at about 50% compression rate. In particular, the Restricted LSTM can outperform classical RNN with even less number of parameters.
- ICASSPA Penalized Method for the Predictive Limit of LearningJie Ding, Enmao Diao, Jiawei Zhou, and 1 more authorIn 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
Machine learning systems learn from and make predictions by building models from observed data. Because large models tend to overfit while small models tend to underfit for a given fixed dataset, a critical challenge is to select an appropriate model (e.g. set of variables/features). Model selection aims to strike a balance between the goodness of fit and model complexity, and thus to gain reliable predictive power. In this paper, we study a penalized model selection technique that asymptotically achieves the optimal expected prediction loss (referred to as the limit of learning) offered by a set of candidate models. We prove that the proposed procedure is both statistically efficient in the sense that it asymptotically approaches the limit of learning, and computationally efficient in the sense that it can be much faster than cross validation methods. Our theory applies for a wide variety of model classes, loss functions, and high dimensions (in the sense that the models’ complexity can grow with data size). We released a python package with our proposed method for general usage like logistic regression and neural networks.