بررسی کنش افتراقی سوالات و عملکرد در آزمون: مقایسه رگرسیون لجستیک، مدل رش و منتل-هنزل

نوع مقاله : علمی پژوهشی(عادی)

نویسندگان

1 استادیاردانشکده زبانها و ادبیات خارجی دانشگاه تهران

2 دانشگاه تهران

چکیده

یکی از ابزارهای بررسی عملکرد آزمون، کنش افتراقی سوالات (Differential item functioning) می‌باشد. این روش،‌ می‌تواند عوامل تاثیرگذار بر عملکرد آزمودنی ها را پیدا کرده و از بروز سوگیری در آزمون جلوگیری نماید. طی دو دهه گذشته ، روش‌های زیادی برای تشخیص پیشنهاد شده است. تعدد روش های تشخیص عملکرد افتراقی سوال گاه باعث سردرگمی پژوهشگران می‌شود. از سوی دیگر،‌ امکان مقایسه یافته‌های پژوهش‌هایی که با روش‌های مختلف به بررسی کنش افتراقی سوال پرداخته‌اند را دشوار می‌سازد. مطالعه حاضر به بررسی و مقایسه نتایج به دست آمده از سه روش تشخیص کنش افتراقی سوال پرداخته است: مدل رش ، رگرسیون لجستیک و منتل-هنزل (MH). داده های استفاده شده در تحلیل‌ها برگرفته از آزمون توانش انگلیسی دانشگاه تهران (UTEPT) می‌باشد که یک آزمون با اهمیت ویژه است و سالانه برای داوطلبان دکترا برگزار میشود. تجزیه و تحلیل کنش افتراقی یکنواخت با سه روش فوق‌الذکر نشان داد که سوال ها در عملکرد خود تفاوت‌های زیادی ندارند. نتایج تحلیل رگرسیون لجستیک، دو سوال را برای وجود کنش افتراقی پیدا کرد که مشابه روش منتل-هنزل می باشد. همچنین سوالاتی به عنوان نشانگر های کنش افتراقی قوی در مدل رش شناسایی شده بودند همان سوالات بودند که در دو مدل دیگر نیز معرفی گردیده بودند. نتایج پژوهش حاضر نشان می‌دهد که استفاده از روش‌های مختلف برای بررسی وجود کنش افتراقی سوال الزاما نتایج متفاوتی را در پی ندارد و می‌توان از هر یک از روش‌های استفاده شده در این پژوهش بهره گرفت.

کلیدواژه‌ها


عنوان مقاله [English]

Differential Item Functioning and test performance: a comparison between the Rasch model, Logistic Regression and Mantel-Haenszel

نویسنده [English]

  • Ali Khodi 2
2 University of Tehran
چکیده [English]

Differential item functioning(DIF) is considered to be one of the tools for the examination of test performance. This method is capable of finding the factors affecting the subjects’ performance and preventing the occurrence of bias in the test. A plethora of methods for detecting Differential Item Functioning has been suggested during the last couple of decades. The multiplicity of methods for diagnosing DIF sometimes is a confusing issue for researchers and it complicates the comparability of the findings of each method. This study has aimed to investigate the comparability of results from three widely used DIF detection techniques: the Rasch model, Logistic Regression, and Mantel-Haenszel (MH). The data comes from an administration of the University of Tehran English Proficiency Test (UTEPT) which is a high-stakes test administered annually to PhD candidates. An analysis of DIF by the three techniques indicated that the items had not significant differences in their performance. The Mantel-Hansel model flagged two items having DIF just similar to the findings of logistic regression model. Likewise, the items that were detected as strong-DIF items in Rasch model were the same as items detected by the two aforementioned models. Thus, it could be stated that logistic regression and Rasch model are among the best models for the assessment of DIF in language tests. It is promising that the application of such methods into the validation process of the tests would increase the quality of assessment and meet the needs for having a fair and justifiable results.

کلیدواژه‌ها [English]

  • Differential Item Functioning
  • Fairness
  • Bias
  • Validity
  • Measurement
Acar, T., & Kelecioglu, H. (2010). Comparison of Differential Item Functioning Determination Techniques: HGLM, LR and IRT-LR. Educational Sciences: Theory and Practice, 10(2), 639-649.
Ackerman, P. L. (1992). Predicting individual differences in complex skill acquisition: Dynamics of ability determinants. Journal of applied psychology, 77(5), 598.                                    
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Cameron, I. M., Scott, N. W., Adler, M., & Reid, I. C. (2014). A comparison of three methods of assessing differential item functioning (DIF) in the Hospital Anxiety Depression Scale: ordinal logistic regression, Rasch analysis and the Mantel chi-square procedure. Quality of life research, 23(10), 2883-2888.
Camilli, G. (2006). Test fairness. In R. Brennan (Ed.), Educational measurement  (4th ed.) (pp. 221-256). New York: American Council on Education & Praeger series on higher education.
Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Newbury Park, CA Sage
Chapelle, C. A. (2020). Validity in language assessment. The Routledge Handbook of Second Language Acquisition and Language Testing, 11.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language Testing, 2(2), 155–163.
Chen, M. Y., Liu, Y., & Zumbo, B. D. (2020). A propensity score method for investigating differential item functioning in performance assessment. Educational and Psychological Measurement, 80(3), 476-498.
Clauser, E. B. & Mazor, M. K. (1998) Using Statistical Procedures to Identify Differentially Functioning Test Items. Educational Measurement: Issues and Practice.17, 31-44.
Elena Oliveri, M., Lawless, R., Robin, F., & Bridgeman, B. (2018). An exploratory analysis of differential item functioning and its possible sources in a higher education admissions context. Applied Measurement in Education, 31(1), 1-16.
Fischer, H. F., Wahl, I., Nolte, S., Liegl, G., Brähler, E., Löwe, B., & Rose, M. (2017). Language‐related differential item functioning between English and German PROMIS Depression items is negligible. International Journal of Methods in Psychiatric Research, 26(4), e1530.
Hale, G. (1988). Student major field and text content: Interactive effects on reading comprehension in the TOEFL. Language Testing, 5(1), 49-61.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied measurement in education, 14(4), 329-349.
Karami, H. (2012) An introduction to Differential Item Functioning. International Journal of Educational and Psychological Assessment, 11(2), 59-76.
Karami, H. (2013) The quest for fairness in language testing. Educational Research and Evaluation, 19(2&3), 158-169.
Linacre, J. M. (2010a). A User's Guide to WINSTEPS®. Retrieved May 2, 2010 from http://www.winsteps.com/ .
Linacre, J. M. (2010b). Winsteps® (Version 3.70.0) [Computer Software]. Beaverton, Oregon: Winsteps.com.
McNamara, T. & C. Roever (2006). Language Testing: The Social Dimension. Malden, MA & Oxford: Blackwell.
Sireci, S. G., & Rios, J. A. (2013). Decisions that make a difference in detecting differential item functioning. Educational Research and Evaluation, 19(2-3), 170-187.
 
Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11(4), 402.
Uyar, Ş., Kelecioğlu, H., & Doğan, N. (2017). Comparing differential item functioning based on manifest groups and latent classes. Kuram ve Uygulamada Egitim Bilimleri, 17(6), 1977-2000.
Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 337–348). Hillsdale, NJ: Lawrence Erlbaum Associates.
Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests. Language testing, 20(2), 136-147.
Zhu, X., & Aryadoust, V. (2020). An investigation of mother tongue differential item functioning in a high-stakes computerized academic reading test. Computer Assisted Language Learning, 1-25.