# Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections

@article{Wojnowicz2016ProjectingT, title={Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections}, author={Michael Thomas Wojnowicz and Di Zhang and Glenn Chisholm and Xuan Zhao and Matt Wolff}, journal={2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)}, year={2016}, pages={184-193} }

For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1… Expand

#### Paper Mentions

#### 8 Citations

“Influence sketching”: Finding influential samples in large-scale regressions

- Computer Science, Mathematics
- 2016 IEEE International Conference on Big Data (Big Data)
- 2016

A new scalable version of Cook's distance is developed, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions), and a new algorithm which is called “influence sketching” is introduced, which can reliably and successfully discover influential samples. Expand

An Introduction to Johnson-Lindenstrauss Transforms

- Computer Science
- ArXiv
- 2021

Johnson–Lindenstrauss Transforms are powerful tools for reducing the dimensionality of data while preserving key characteristics of that data, and they have found use in many fields from machine… Expand

Wavelet decomposition of software entropy reveals symptoms of malicious code

- Computer Science, Mathematics
- J. Innov. Digit. Ecosyst.
- 2016

A method for automatically quantifying the extent to which patterned variations in a file's entropy signal make it "suspicious" is developed, which can be useful for machine learning models for detecting malware based on extracting millions of features from executable files. Expand

Spotlight: Malware Lead Generation at Scale

- Computer Science
- ACSAC
- 2020

Spotlight, a large-scale malware lead-generation framework, is presented and it is shown that it can produce top-priority clusters with over 99% purity (i.e., homogeneity), which is higher than simpler approaches and prior work. Expand

Speeded Up Visual Tracker with Adaptive Template Updating Method

- Computer Science
- CSPS
- 2017

This paper uses dense SIFT feature to describe an object appearance and randomized principle component analysis (RPCA) to reduce the original feature space dimensionality in a speeded up visual tracker that is not only capable of long-term tracking but also of online tasks. Expand

SUSPEND: Determining software suspiciousness by non-stationary time series modeling of entropy signals

- Computer Science
- Expert Syst. Appl.
- 2017

SUSPEND (S USPicious ENtropy signal Detector), an expert system which evaluates the suspiciousness of an executable file’s entropy signal in order to subserve malware classification, and boosts the predictive performance of traditional entropy analysis from 77.02% to 96.62%. Expand

A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

- Computer Science, Mathematics
- ArXiv
- 2020

This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary background into the challenges in this uniquely complicated space. Expand

#### References

SHOWING 1-10 OF 19 REFERENCES

Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification

- Computer Science
- 2006 5th International Conference on Machine Learning and Applications (ICMLA'06)
- 2006

Two different dimensionality reduction methods, principle component analysis (PCA) and random projection (RP), are investigated for this purpose and compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and five micro array data sets. Expand

Experiments with random projections for machine learning

- Mathematics, Computer Science
- KDD '03
- 2003

It is found that the random projection approach predictively underperforms PCA, but its computational advantages may make it attractive for certain applications. Expand

Very sparse random projections

- Mathematics, Computer Science
- KDD '06
- 2006

This paper proposes sparse random projections, an approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space that multiplies A by a random matrix R in RD x k, reducing the D dimensions down to just k for speeding up the computation. Expand

Randomized Algorithms for Matrices and Data

- Computer Science
- Found. Trends Mach. Learn.
- 2011

This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Expand

Large-scale malware classification using random projections and neural networks

- Computer Science
- 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013

This work uses random projections to further reduce the dimensionality of the original input space and trains several very large-scale neural network systems with over 2.6 million labeled samples thereby achieving classification results with a two-class error rate of 0.49% for a single neural network and 0.42% for an ensemble of neural networks. Expand

An Algorithm for the Principal Component Analysis of Large Data Sets

- Computer Science, Mathematics
- SIAM J. Sci. Comput.
- 2011

This work adapts one of these randomized methods for principal component analysis (PCA) for use with data sets that are too large to be stored in random-access memory (RAM), and reports on the performance of the algorithm. Expand

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

- Mathematics, Computer Science
- SIAM Rev.
- 2011

This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation, and presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. Expand

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

- Mathematics, Computer Science
- ICML
- 2003

Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data. Expand

An algorithmic theory of learning: Robust concepts and random projection

- Computer Science
- Machine Learning
- 2006

This work provides a novel algorithmic analysis via a model of robust concept learning (closely related to “margin classifiers”), and shows that a relatively small number of examples are sufficient to learn rich concept classes. Expand

Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes

- Computer Science, Mathematics
- ArXiv
- 2012

This paper considers 8 different optimization formulations for computing a single sparse loading vector and shows the the AM method is nontrivially equivalent to GPower (Journee et al; JMLR 11:517--553, 2010) for all the authors' formulations. Expand