# Spatial statistical classification analysis

Multivariate statistical analysis is mainly used for data classification and comprehensive evaluation. Data classification methods are an important part of GIS. Generally speaking, the data stored in Geographic Information System (GIS) is primitive, users can extract and analyze the data according to different practical purposes, especially for observation and sampling data, with the different classification and interpolation methods, the results are very different. Therefore, in most cases, a large amount of unclassified data is first entered into the information system database, and then the user is required to establish a specific classification algorithm to obtain the required information.

Comprehensive evaluation model is the basis of regionalization and planning. From the perspective of human cognition, there are two types of precise and fuzzy. Because most geographical phenomena are difficult to be classified and expressed by precise quantitative relations, the fuzzy model is more practical and the results are often closer to reality. Comprehensive evaluation generally goes through four processes:

Selection and simplification of evaluation factors;

The determination of multi-factor importance index (weight);

Determining the membership degree of each type of factor to the evaluation target;

Choose a method to synthesize multiple factors.

The problem of classification and evaluation usually involves a large number of interrelated geographical factors, the principal component analysis method can statistically compress the information of each influencing factor to several synthetic factors, thus greatly simplifying the model; the determination of the weight of the factor is an important step in the establishment of the evaluation model, the correctness of the weight greatly affects the correctness of the evaluation model, the usual factor weight determination depends on more subjective judgmen, the analytic hierarchy process is to integrate the opinions of the people and scientifically determine each A simple and effective mathematical means of influencing factor weights. The different influences of each category in the membership degree response factor on the evaluation target are determined according to the variation of different factors, and are usually calculated by piecewise linear function or other higher order function. Commonly used classification and synthesis methods include two categories: cluster analysis and discriminant analysis. Cluster analysis can divide the evaluation area into several categories according to the degree of similarity of the influencing factors between geographical entities, using some distance indicators related to weight and membership degree; discriminant analysis is similar to the classification method of remote sensing image processing, that is, according to the weight and membership degree of each element, each geographic entity is judged to be the most likely evaluation level or a sequence of ranks indicated by a certain data value according to certain evaluation criteria; classification and grading is the last step of evaluation. The results of clustering are combined according to the actual situation, and the evaluation grade of each category is determined. For the result sequence of discriminant analysis, the criterion of equal or unequal distance is used to divide the final evaluation grade.

The following is a brief introduction to several mathematical methods commonly used in classification and evaluation.

## Principal component analysis (PCA)

Geography problems often involve a large number of interrelated natural and social factors, many elements often bring great difficulties to the construction of the model, and also increase the complexity of the operation. In order to make it easy for users to understand and solve the problem of insufficient existing storage capacity, it is necessary to reduce some data while retaining the most necessary information. Since many variables in a geographic variable are usually related to each other, it is possible to perform mathematical processing on these associations to simplify the data. Principal Component Analysis (PCA) is a mathematical and statistical analysis to obtain the meaningful expression of the linear relationship between the various elements, it compresses the information of many elements into several representative synthetic variables, which overcomes the redundancy and correlation in variable selection, then, it chooses the few factors with the most abundant information to carry out cluster analysis and constructs application models.

There are n samples and household variables. The raw data is converted into a new set of eigenvalues, the principal component, which is a linear combination of the original variables and has orthogonal features. Coming soon *x:sub:`1`, x:sub:`2`,…,x:sub:`p`*integrated into *m(m<p)*indicators*z:sub:`1`, z:sub:`2`,…,z:sub:`m`,*ie:

*z_1 =l_11 *x_1 +l_12 *x_2 +…+l_1p *x_p*

*z_2 =l_21 *x_1 +l_22 *x_2 +…+l_2p *x_p*

*… …*

*z_m =l_m1 *x_1 +l_m2 *x_2 +…+l_mp *x_p*

The comprehensive indicator decided in this way is *z:sub:`1`*, *z:sub:`2`*,…,*z:sub:`m`*The first, second, …, mth principal component. Where *z:sub:`1`*accounts for the largest proportion of the total variance, and the remaining principal components *z:sub:`2`*,*z:sub:`3`*, …, *z:sub:`m`*The variance is decremented in turn. In the actual work, the main components with the largest proportion of the variances are often selected, which reduces the number of indicators and grasps the main contradictions and simplifies the relationship between the indicators.

From a geometric point of view, the problem of determining the principal component is to find the main axis problem of the ellipsoid in the *p*dimensional space, that is, get *x:sub:`1`, x:sub:`2` The feature vectors corresponding to *m*large eigenvalues in the correlation matrix of ,…,x:sub:`p`*are usually calculated by Jacobi method for eigenvalues and eigenvectors.

Obviously, the data analysis technology of principal component analysis (PCA) is a powerful tool to reduce data to a manageable level and to transform complex data into simple categories for easy storage and management.

## Analytic hierarchy process

Analytic Hierarchy Process (AHP) is one of the mathematical tools for system analysis. It layers and quantifies the human thinking process and provides quantitative basis for analysis, decision-making, prediction or control with mathematical methods. In fact, it is a combination of qualitative and quantitative analysis. When the model involves a large number of interrelated and mutually restrictive complex factors, each factor has different importance in the analysis of the problem. It is very important to establish the model to determine the sequence of their importance to the target.

The AHP method divides the interrelated elements into several levels according to their subordinate relations. Experienced experts are invited to give quantitative indicators of the relative importance of each factor at each level, and the weights of the relative importance of each factor at each level are given by using mathematical methods to synthesize expert opinions as the basis of comprehensive analysis.

## Systematic clustering analysis

Systematic clustering is a method to classify geographic entities according to various geographic elements. Classification of different elements often reflects the hierarchical sequence of different objectives, such as land grading and grading, soil erosion intensity grading, etc.

The steps of systematic clustering are generally to merge several categories according to the similarity degree between entities, and the similarity degree is defined by distance or similarity coefficient. The criterion of merging classes is to maximize the differences among classes and minimize the differences within classes.

## Discriminant analysis

Discriminant analysis and cluster analysis belong to the same classification problem. The difference is that discriminant analysis is a method of determining the factor criterion of grade sequence in advance according to theory and practice, and then arranging the geographical entities to be analyzed to the reasonable position of the sequence. It is more suitable for classification system grading problems with certain theoretical basis, such as soil erosion evaluation and land suitability evaluation.

Discriminant analysis can be divided into two types of discrimination, multi-type discrimination and step-by-step discrimination according to the number and method of discrimination.

Usually in two types of discriminant analysis, it is required to linearly combine according to known geographical feature values to form a linear discriminant function *Y*, namely:

*Y=
c_1 *x_1 +c_2 *x_2 +…+c_m *x_p*

In the formula, *c:sub:`k`(k=1,2,…,m)*is a discriminant coefficient, which reflects the direction of action, resolution, and contribution rate of each element or eigenvalue. As long as *c:sub:`k`*is determined, the discriminant function *Y*is also determined. After determining the discriminant function, the discriminant function values are calculated according to each sample, which can be merged into the corresponding categories. Commonly used discriminant analysis includes distance discriminant method, Bayes minimum risk discriminant, and Fisher criterion discriminant.