# Questionnaire and LGBM model to assess health literacy levels of Mongolians in China | BMC Public Health

LH is an essential factor affecting health [26]. People with low NH have poor self-management skills [27]. Poor HL can also lead to high health care costs. This paper aims to introduce an exclusive HL assessment questionnaire and LGBM model for Mongolians in China to improve the assessment accuracy of HL level of Mongolians and find influencing factors in HL by quantitative analysis of each question, which can provide a new idea for the HL. evaluation of other ethnic minorities in China or ethnic minorities in other countries.

Four dimensions are considered during the design of the HL questionnaire, which are health concepts and knowledge literacy, healthy lifestyle and behavior, and healthy skills, as well as health status and disease history. It is different from the existing three-dimensional methods. [28,29,30] and five-dimensional approach [31] in China, because the health status and disease history of the respondents are not considered in [28,29,30,31]. The HL questionnaire with 68 questions is designed both to improve the HLS-EU-Q47 and to analyze the characteristics of Mongolians in China. To verify the presented HL assessment method using a cross-sectional data set, 742 Mongolians in Inner Mongolia of China are invited to answer the above HL questionnaire.

Based on the HL questionnaires completed by 742 Mongolians, the reliability and validity of the designed HL questionnaire is tested using Cronbach’s. $$\alpha$$ coefficient, mutual information score (MIS), KMO, and Bartlett spherical test chi-square value (BSTCV). The results show that the designed NS questionnaire has a high reliability and validity, since we obtain the Cronbach questionnaire $$\alpha =0.807$$MIS=0.803, KMO=0.765 and BSTCV=2486 ($$p<0.001$$) using our Python programs. MIS method is better than Pearson’s correlation coefficient method [32]because the latter can only handle linear correlations, however the former can handle not only linear correlations but also non-linear correlations.

A data set with 742 samples is constructed, where each sample has 68 features and 1 target. 68 characteristics correspond to 68 questions in the HL questionnaire, and 1 objective corresponds to the HL score that each respondent obtained when answering the questionnaire. Based on this data set, the XGB and LGBM regression models are built to predict HL, respectively. 80% of the samples in the above dataset are designed as training samples and others are considered as test samples. The XGB and LGBM regression models are trained on 594 (80%) samples, respectively. Then, the XGB and LGBM regression models are tested with 148 (20%) samples, respectively. the $$R^2 (0 The index is chosen as an evaluation precision index. The big one \(R^2 (0 means the high accuracy of evaluation. The results show that \(R^2$$ The index and the absolute error when using the LGBM regression model are 0.98347 and 11, respectively, which are better than those obtained when applying XGB. It can be seen that the LGBM-based NS assessment model can achieve the assessment results with high accuracy.

Also, existing correlation analysis methods such as covariance method, Pearson’s correlation coefficient and MIS approach can only give quantitative results to analyze the correlation problem between questionnaire questions. This does not meet the growing demand for highly accurate NS assessments. Therefore, we quantitatively analyzed the influence of each questionnaire question on NS assessment results using the feature-importance function in the LGBM-based NS assessment model. The quantitative results of the correlation analysis between all questions are shown in Fig. 6. It can be seen that the largest impact factor is 1105 and the smallest impact factor is 23. Age has the greatest influence on the level NS. It shows that there is a strong correlation between age and LH levels, which is consistent with other studies. [28,29,30,31, 33]. For example, the Japanese survey HL [33] concluded that the level of LH for the Japanese increased with age; HL survey in European countries and Turkey showed that older people tended to have lower HL [33]. The impact index of the salary level of the respondents ($$Column_{-}27$$) is 286, which is the second, but it is much less than the old one. This result is consistent with the conclusions of [28,29,30]. The impact index of the ability of the interviewees to judge the relevant health information in the media ($$Column_{-}36$$) is 270, which is the third. The impact indices of the probability of medical assistance ($$Column_{-}25$$), knowing about vaccinations and check-ups ()$$Column_{-}43$$and obtaining information on healthy eating($$Column_{-}53$$ ) are the fourth, fifth, and sixth, which are 256,254, and 253, respectively. These analyzes are not found in the existing results. The influence of gender$$Column_{-}1$$) at the HL level is 69. The scores of the respondents show that the HL of men is higher than the HL of women, which is consistent with those of the respondents. [29, 34]but quantification of influencing factors was not investigated in [29, 34]. The impact indices of the Territory ($$Column_{-}2$$), Educational background ($$Column_{-}20$$) and professional ($$Column_{-}21$$) are 96, 69, and 71, respectively. And the respondents’ scores show that the HL levels of respondents living in cities are higher than those of residents in towns; there is a positive linear correlation between the level of NS and the academic training of the respondents. These results for the antecedents of Territory and Education are consistent with those of [29]. The fourth dimension (health status and disease history) of the HL questionnaire is reflected in the $$Column_{-}3,4,5,7,8,9,10$$ in Fig. 6, where the health status impact index ($$Column_{-}3$$) is the largest, which is 168. However, they are not considered in [28,29,30]. The impact indices of other questions are not addressed individually, which can be found in Fig. 6. It is worth mentioning that the question that least influences the final result of the HL evaluation is the type of insurance ($$Column_{-}6$$), and its value is 23. However, this factor is not investigated in other works.

From Fig. 6 and the discussion above, it can be seen that the designed questionnaire is reasonable, because there are no features that do not contribute to the assessment of health literacy. It is worth mentioning that the LGBM model of HL assessment and the quantitative analysis method for each question are suitable for the assessment of HL for any other person.