\begin{center}{\kaishu \zihao{2}{机器学习中的降维方法}}\end{center}

\begin{center}{\kaishu\zihao{4} 摘\ \ \ \ 要}


\addcontentsline{toc}{chapter}{摘\ \ \ \ 要} \kaishu \ \




SIR 提出之后产生了许多后续工作,本文选取其中的轮廓投影法(~CP)研究。CP主要针对椭球分布假设不成立的情形,加入轮廓投影,避免了椭球轴存在问题,实现了SIR的推广。

最后对上述算法分别进行实例研究。首先,将~SIR运用在对高维响应变量的降维(~MP) 中,数据由计算机模拟得到,比对理论结果和实践结果,发现降维结果理想。其次,对网络入侵进行识别,数据为~KDD Cup 于~1999 年给出的高维数据集,每个入侵类型的训练数据为高维,从中提取重要的特征子集实现降维,从而提高入侵识别的效率。

\vskip 1cm \noindent{\kaishu 关键词: \ 降维方法,\主成分分析,\ 高维数据,\ 切片逆回归 }



\begin{center}{\rm Dimensional reduction in machine learning}\end{center}

\begin{center}{\rm\zihao{4} Abstract}




We are in the era of big data, which is characterized by many data sets, large amount of data and high dimensionality. Data dimensionality reduction is the most critical part of data analysis. Data dimensionality reduction is a classic problem in multivariate statistical analysis. The main purpose of this paper is to summarize some important methods of dimensionality reduction.

Principal component analysis is a classical method in dimensionality reduction technology. Its new feature is the combination of the original features, and the generalization method is the principal curve, the principal surface, and the kernel principal component analysis (KPCA). The classical methods often have limitations. Try to improve them as follows: (1) Add the protection cluster structure to the algorithm to obtain PCA for local retention projection; (2) Automatically reduce data for data in clustering.

In regression analysis, reducing dimensionality methods is often used, and slice inverse regression (SIR) is an important method.

After the SIR was proposed, a lot of follow-up work was produced. In this paper, the contour projection method (~CP) is selected. CP mainly focuses on the case where the ellipsoid distribution assumption is not established, and adds contour projection to avoid the problem of the ellipsoid axis and realize the promotion of SIR.

Finally, some examples is carried out. Firstly, SIR is applied to the dimensionality reduction (~MP) of high-dimensional response variables, and the data is simulated by computer. It is found that the dimensionality reduction results are ideal. Secondly, the network intrusion is identified. The data is the high-dimensional data set given by ~KDD Cup in ~1999. The training data of each intrusion type is high-dimensional, and the important feature subsets are extracted to achieve dimensionality reduction, thereby improving the intrusion. The efficiency of identification.

\vskip 0.8cm \noindent{\rm Key Words:\ Dimensionality reduction method,\ PCA,\ high dimensional data, \ SIR }







通常我们在进行数据挖掘时,面临较多的是能够用矩阵表达这种类型的数据,亦或者可称为结构化数据。把收集的每个样本可以用矩阵的每一行表示,各个特征(变量)用矩阵的各个列来表示,每个特征构成维度空间里一个维度,K 维空间中的一个样本点即为拥有~K 维特征的数据对象,亦可视为该特征空间里一个~K 维向量。


(1)比如我们假设有一个~50维特征空间,现在有这样一个划分:从每个纬度的中点分成两部分。需要分~50 次,我们可以得到~$2^{50}$ 个形状相同的单位空间,假设一共有~$10^6$个样本数据点,这个样本量已经很大了。不妨假设样本均匀分布,计算每一个单位空间中落入样本的概率:$$

\frac {10^6}{2^{50}}lt;10^6,





