• Grades 6-12
  • School Leaders

100 Last-Day-of-School Activities Your Students Will Love!

15 Helpful Scoring Rubric Examples for All Grades and Subjects

In the end, they actually make grading easier.

Collage of scoring rubric examples including written response rubric and interactive notebook rubric

When it comes to student assessment and evaluation, there are a lot of methods to consider. In some cases, testing is the best way to assess a student’s knowledge, and the answers are either right or wrong. But often, assessing a student’s performance is much less clear-cut. In these situations, a scoring rubric is often the way to go, especially if you’re using standards-based grading . Here’s what you need to know about this useful tool, along with lots of rubric examples to get you started.

What is a scoring rubric?

In the United States, a rubric is a guide that lays out the performance expectations for an assignment. It helps students understand what’s required of them, and guides teachers through the evaluation process. (Note that in other countries, the term “rubric” may instead refer to the set of instructions at the beginning of an exam. To avoid confusion, some people use the term “scoring rubric” instead.)

A rubric generally has three parts:

  • Performance criteria: These are the various aspects on which the assignment will be evaluated. They should align with the desired learning outcomes for the assignment.
  • Rating scale: This could be a number system (often 1 to 4) or words like “exceeds expectations, meets expectations, below expectations,” etc.
  • Indicators: These describe the qualities needed to earn a specific rating for each of the performance criteria. The level of detail may vary depending on the assignment and the purpose of the rubric itself.

Rubrics take more time to develop up front, but they help ensure more consistent assessment, especially when the skills being assessed are more subjective. A well-developed rubric can actually save teachers a lot of time when it comes to grading. What’s more, sharing your scoring rubric with students in advance often helps improve performance . This way, students have a clear picture of what’s expected of them and what they need to do to achieve a specific grade or performance rating.

Learn more about why and how to use a rubric here.

Types of Rubric

There are three basic rubric categories, each with its own purpose.

Holistic Rubric

A holistic scoring rubric laying out the criteria for a rating of 1 to 4 when creating an infographic

Source: Cambrian College

This type of rubric combines all the scoring criteria in a single scale. They’re quick to create and use, but they have drawbacks. If a student’s work spans different levels, it can be difficult to decide which score to assign. They also make it harder to provide feedback on specific aspects.

Traditional letter grades are a type of holistic rubric. So are the popular “hamburger rubric” and “ cupcake rubric ” examples. Learn more about holistic rubrics here.

Analytic Rubric

Layout of an analytic scoring rubric, describing the different sections like criteria, rating, and indicators

Source: University of Nebraska

Analytic rubrics are much more complex and generally take a great deal more time up front to design. They include specific details of the expected learning outcomes, and descriptions of what criteria are required to meet various performance ratings in each. Each rating is assigned a point value, and the total number of points earned determines the overall grade for the assignment.

Though they’re more time-intensive to create, analytic rubrics actually save time while grading. Teachers can simply circle or highlight any relevant phrases in each rating, and add a comment or two if needed. They also help ensure consistency in grading, and make it much easier for students to understand what’s expected of them.

Learn more about analytic rubrics here.

Developmental Rubric

A developmental rubric for kindergarten skills, with illustrations to describe the indicators of criteria

Source: Deb’s Data Digest

A developmental rubric is a type of analytic rubric, but it’s used to assess progress along the way rather than determining a final score on an assignment. The details in these rubrics help students understand their achievements, as well as highlight the specific skills they still need to improve.

Developmental rubrics are essentially a subset of analytic rubrics. They leave off the point values, though, and focus instead on giving feedback using the criteria and indicators of performance.

Learn how to use developmental rubrics here.

Ready to create your own rubrics? Find general tips on designing rubrics here. Then, check out these examples across all grades and subjects to inspire you.

Elementary School Rubric Examples

These elementary school rubric examples come from real teachers who use them with their students. Adapt them to fit your needs and grade level.

Reading Fluency Rubric

A developmental rubric example for reading fluency

You can use this one as an analytic rubric by counting up points to earn a final score, or just to provide developmental feedback. There’s a second rubric page available specifically to assess prosody (reading with expression).

Learn more: Teacher Thrive

Reading Comprehension Rubric

Reading comprehension rubric, with criteria and indicators for different comprehension skills

The nice thing about this rubric is that you can use it at any grade level, for any text. If you like this style, you can get a reading fluency rubric here too.

Learn more: Pawprints Resource Center

Written Response Rubric

Two anchor charts, one showing

Rubrics aren’t just for huge projects. They can also help kids work on very specific skills, like this one for improving written responses on assessments.

Learn more: Dianna Radcliffe: Teaching Upper Elementary and More

Interactive Notebook Rubric

Interactive Notebook rubric example, with criteria and indicators for assessment

If you use interactive notebooks as a learning tool , this rubric can help kids stay on track and meet your expectations.

Learn more: Classroom Nook

Project Rubric

Rubric that can be used for assessing any elementary school project

Use this simple rubric as it is, or tweak it to include more specific indicators for the project you have in mind.

Learn more: Tales of a Title One Teacher

Behavior Rubric

Rubric for assessing student behavior in school and classroom

Developmental rubrics are perfect for assessing behavior and helping students identify opportunities for improvement. Send these home regularly to keep parents in the loop.

Learn more: Teachers.net Gazette

Middle School Rubric Examples

In middle school, use rubrics to offer detailed feedback on projects, presentations, and more. Be sure to share them with students in advance, and encourage them to use them as they work so they’ll know if they’re meeting expectations.

Argumentative Writing Rubric

An argumentative rubric example to use with middle school students

Argumentative writing is a part of language arts, social studies, science, and more. That makes this rubric especially useful.

Learn more: Dr. Caitlyn Tucker

Role-Play Rubric

A rubric example for assessing student role play in the classroom

Role-plays can be really useful when teaching social and critical thinking skills, but it’s hard to assess them. Try a rubric like this one to evaluate and provide useful feedback.

Learn more: A Question of Influence

Art Project Rubric

A rubric used to grade middle school art projects

Art is one of those subjects where grading can feel very subjective. Bring some objectivity to the process with a rubric like this.

Source: Art Ed Guru

Diorama Project Rubric

A rubric for grading middle school diorama projects

You can use diorama projects in almost any subject, and they’re a great chance to encourage creativity. Simplify the grading process and help kids know how to make their projects shine with this scoring rubric.

Learn more: Historyourstory.com

Oral Presentation Rubric

Rubric example for grading oral presentations given by middle school students

Rubrics are terrific for grading presentations, since you can include a variety of skills and other criteria. Consider letting students use a rubric like this to offer peer feedback too.

Learn more: Bright Hub Education

High School Rubric Examples

In high school, it’s important to include your grading rubrics when you give assignments like presentations, research projects, or essays. Kids who go on to college will definitely encounter rubrics, so helping them become familiar with them now will help in the future.

Presentation Rubric

Example of a rubric used to grade a high school project presentation

Analyze a student’s presentation both for content and communication skills with a rubric like this one. If needed, create a separate one for content knowledge with even more criteria and indicators.

Learn more: Michael A. Pena Jr.

Debate Rubric

A rubric for assessing a student's performance in a high school debate

Debate is a valuable learning tool that encourages critical thinking and oral communication skills. This rubric can help you assess those skills objectively.

Learn more: Education World

Project-Based Learning Rubric

A rubric for assessing high school project based learning assignments

Implementing project-based learning can be time-intensive, but the payoffs are worth it. Try this rubric to make student expectations clear and end-of-project assessment easier.

Learn more: Free Technology for Teachers

100-Point Essay Rubric

Rubric for scoring an essay with a final score out of 100 points

Need an easy way to convert a scoring rubric to a letter grade? This example for essay writing earns students a final score out of 100 points.

Learn more: Learn for Your Life

Drama Performance Rubric

A rubric teachers can use to evaluate a student's participation and performance in a theater production

If you’re unsure how to grade a student’s participation and performance in drama class, consider this example. It offers lots of objective criteria and indicators to evaluate.

Learn more: Chase March

How do you use rubrics in your classroom? Come share your thoughts and exchange ideas in the WeAreTeachers HELPLINE group on Facebook .

Plus, 25 of the best alternative assessment ideas ..

Scoring rubrics help establish expectations and ensure assessment consistency. Use these rubric examples to help you design your own.

You Might Also Like

motivating middle school students

8 Ways to Get Middle School Students to Care About Their Grades

And more importantly, learning. Continue Reading

Copyright © 2024. All rights reserved. 5335 Gate Parkway, Jacksonville, FL 32256

Rubric Best Practices, Examples, and Templates

A rubric is a scoring tool that identifies the different criteria relevant to an assignment, assessment, or learning outcome and states the possible levels of achievement in a specific, clear, and objective way. Use rubrics to assess project-based student work including essays, group projects, creative endeavors, and oral presentations.

Rubrics can help instructors communicate expectations to students and assess student work fairly, consistently and efficiently. Rubrics can provide students with informative feedback on their strengths and weaknesses so that they can reflect on their performance and work on areas that need improvement.

How to Get Started

Best practices, moodle how-to guides.

  • Workshop Recording (Fall 2022)
  • Workshop Registration

Step 1: Analyze the assignment

The first step in the rubric creation process is to analyze the assignment or assessment for which you are creating a rubric. To do this, consider the following questions:

  • What is the purpose of the assignment and your feedback? What do you want students to demonstrate through the completion of this assignment (i.e. what are the learning objectives measured by it)? Is it a summative assessment, or will students use the feedback to create an improved product?
  • Does the assignment break down into different or smaller tasks? Are these tasks equally important as the main assignment?
  • What would an “excellent” assignment look like? An “acceptable” assignment? One that still needs major work?
  • How detailed do you want the feedback you give students to be? Do you want/need to give them a grade?

Step 2: Decide what kind of rubric you will use

Types of rubrics: holistic, analytic/descriptive, single-point

Holistic Rubric. A holistic rubric includes all the criteria (such as clarity, organization, mechanics, etc.) to be considered together and included in a single evaluation. With a holistic rubric, the rater or grader assigns a single score based on an overall judgment of the student’s work, using descriptions of each performance level to assign the score.

Advantages of holistic rubrics:

  • Can p lace an emphasis on what learners can demonstrate rather than what they cannot
  • Save grader time by minimizing the number of evaluations to be made for each student
  • Can be used consistently across raters, provided they have all been trained

Disadvantages of holistic rubrics:

  • Provide less specific feedback than analytic/descriptive rubrics
  • Can be difficult to choose a score when a student’s work is at varying levels across the criteria
  • Any weighting of c riteria cannot be indicated in the rubric

Analytic/Descriptive Rubric . An analytic or descriptive rubric often takes the form of a table with the criteria listed in the left column and with levels of performance listed across the top row. Each cell contains a description of what the specified criterion looks like at a given level of performance. Each of the criteria is scored individually.

Advantages of analytic rubrics:

  • Provide detailed feedback on areas of strength or weakness
  • Each criterion can be weighted to reflect its relative importance

Disadvantages of analytic rubrics:

  • More time-consuming to create and use than a holistic rubric
  • May not be used consistently across raters unless the cells are well defined
  • May result in giving less personalized feedback

Single-Point Rubric . A single-point rubric is breaks down the components of an assignment into different criteria, but instead of describing different levels of performance, only the “proficient” level is described. Feedback space is provided for instructors to give individualized comments to help students improve and/or show where they excelled beyond the proficiency descriptors.

Advantages of single-point rubrics:

  • Easier to create than an analytic/descriptive rubric
  • Perhaps more likely that students will read the descriptors
  • Areas of concern and excellence are open-ended
  • May removes a focus on the grade/points
  • May increase student creativity in project-based assignments

Disadvantage of analytic rubrics: Requires more work for instructors writing feedback

Step 3 (Optional): Look for templates and examples.

You might Google, “Rubric for persuasive essay at the college level” and see if there are any publicly available examples to start from. Ask your colleagues if they have used a rubric for a similar assignment. Some examples are also available at the end of this article. These rubrics can be a great starting point for you, but consider steps 3, 4, and 5 below to ensure that the rubric matches your assignment description, learning objectives and expectations.

Step 4: Define the assignment criteria

Make a list of the knowledge and skills are you measuring with the assignment/assessment Refer to your stated learning objectives, the assignment instructions, past examples of student work, etc. for help.

  Helpful strategies for defining grading criteria:

  • Collaborate with co-instructors, teaching assistants, and other colleagues
  • Brainstorm and discuss with students
  • Can they be observed and measured?
  • Are they important and essential?
  • Are they distinct from other criteria?
  • Are they phrased in precise, unambiguous language?
  • Revise the criteria as needed
  • Consider whether some are more important than others, and how you will weight them.

Step 5: Design the rating scale

Most ratings scales include between 3 and 5 levels. Consider the following questions when designing your rating scale:

  • Given what students are able to demonstrate in this assignment/assessment, what are the possible levels of achievement?
  • How many levels would you like to include (more levels means more detailed descriptions)
  • Will you use numbers and/or descriptive labels for each level of performance? (for example 5, 4, 3, 2, 1 and/or Exceeds expectations, Accomplished, Proficient, Developing, Beginning, etc.)
  • Don’t use too many columns, and recognize that some criteria can have more columns that others . The rubric needs to be comprehensible and organized. Pick the right amount of columns so that the criteria flow logically and naturally across levels.

Step 6: Write descriptions for each level of the rating scale

Artificial Intelligence tools like Chat GPT have proven to be useful tools for creating a rubric. You will want to engineer your prompt that you provide the AI assistant to ensure you get what you want. For example, you might provide the assignment description, the criteria you feel are important, and the number of levels of performance you want in your prompt. Use the results as a starting point, and adjust the descriptions as needed.

Building a rubric from scratch

For a single-point rubric , describe what would be considered “proficient,” i.e. B-level work, and provide that description. You might also include suggestions for students outside of the actual rubric about how they might surpass proficient-level work.

For analytic and holistic rubrics , c reate statements of expected performance at each level of the rubric.

  • Consider what descriptor is appropriate for each criteria, e.g., presence vs absence, complete vs incomplete, many vs none, major vs minor, consistent vs inconsistent, always vs never. If you have an indicator described in one level, it will need to be described in each level.
  • You might start with the top/exemplary level. What does it look like when a student has achieved excellence for each/every criterion? Then, look at the “bottom” level. What does it look like when a student has not achieved the learning goals in any way? Then, complete the in-between levels.
  • For an analytic rubric , do this for each particular criterion of the rubric so that every cell in the table is filled. These descriptions help students understand your expectations and their performance in regard to those expectations.

Well-written descriptions:

  • Describe observable and measurable behavior
  • Use parallel language across the scale
  • Indicate the degree to which the standards are met

Step 7: Create your rubric

Create your rubric in a table or spreadsheet in Word, Google Docs, Sheets, etc., and then transfer it by typing it into Moodle. You can also use online tools to create the rubric, but you will still have to type the criteria, indicators, levels, etc., into Moodle. Rubric creators: Rubistar , iRubric

Step 8: Pilot-test your rubric

Prior to implementing your rubric on a live course, obtain feedback from:

  • Teacher assistants

Try out your new rubric on a sample of student work. After you pilot-test your rubric, analyze the results to consider its effectiveness and revise accordingly.

  • Limit the rubric to a single page for reading and grading ease
  • Use parallel language . Use similar language and syntax/wording from column to column. Make sure that the rubric can be easily read from left to right or vice versa.
  • Use student-friendly language . Make sure the language is learning-level appropriate. If you use academic language or concepts, you will need to teach those concepts.
  • Share and discuss the rubric with your students . Students should understand that the rubric is there to help them learn, reflect, and self-assess. If students use a rubric, they will understand the expectations and their relevance to learning.
  • Consider scalability and reusability of rubrics. Create rubric templates that you can alter as needed for multiple assignments.
  • Maximize the descriptiveness of your language. Avoid words like “good” and “excellent.” For example, instead of saying, “uses excellent sources,” you might describe what makes a resource excellent so that students will know. You might also consider reducing the reliance on quantity, such as a number of allowable misspelled words. Focus instead, for example, on how distracting any spelling errors are.

Example of an analytic rubric for a final paper

Above Average (4)Sufficient (3)Developing (2)Needs improvement (1)
(Thesis supported by relevant information and ideas The central purpose of the student work is clear and supporting ideas always are always well-focused. Details are relevant, enrich the work.The central purpose of the student work is clear and ideas are almost always focused in a way that supports the thesis. Relevant details illustrate the author’s ideas.The central purpose of the student work is identified. Ideas are mostly focused in a way that supports the thesis.The purpose of the student work is not well-defined. A number of central ideas do not support the thesis. Thoughts appear disconnected.
(Sequencing of elements/ ideas)Information and ideas are presented in a logical sequence which flows naturally and is engaging to the audience.Information and ideas are presented in a logical sequence which is followed by the reader with little or no difficulty.Information and ideas are presented in an order that the audience can mostly follow.Information and ideas are poorly sequenced. The audience has difficulty following the thread of thought.
(Correctness of grammar and spelling)Minimal to no distracting errors in grammar and spelling.The readability of the work is only slightly interrupted by spelling and/or grammatical errors.Grammatical and/or spelling errors distract from the work.The readability of the work is seriously hampered by spelling and/or grammatical errors.

Example of a holistic rubric for a final paper

The audience is able to easily identify the central message of the work and is engaged by the paper’s clear focus and relevant details. Information is presented logically and naturally. There are minimal to no distracting errors in grammar and spelling. : The audience is easily able to identify the focus of the student work which is supported by relevant ideas and supporting details. Information is presented in a logical manner that is easily followed. The readability of the work is only slightly interrupted by errors. : The audience can identify the central purpose of the student work without little difficulty and supporting ideas are present and clear. The information is presented in an orderly fashion that can be followed with little difficulty. Grammatical and spelling errors distract from the work. : The audience cannot clearly or easily identify the central ideas or purpose of the student work. Information is presented in a disorganized fashion causing the audience to have difficulty following the author’s ideas. The readability of the work is seriously hampered by errors.

Single-Point Rubric

Advanced (evidence of exceeding standards)Criteria described a proficient levelConcerns (things that need work)
Criteria #1: Description reflecting achievement of proficient level of performance
Criteria #2: Description reflecting achievement of proficient level of performance
Criteria #3: Description reflecting achievement of proficient level of performance
Criteria #4: Description reflecting achievement of proficient level of performance
90-100 points80-90 points<80 points

More examples:

  • Single Point Rubric Template ( variation )
  • Analytic Rubric Template make a copy to edit
  • A Rubric for Rubrics
  • Bank of Online Discussion Rubrics in different formats
  • Mathematical Presentations Descriptive Rubric
  • Math Proof Assessment Rubric
  • Kansas State Sample Rubrics
  • Design Single Point Rubric

Technology Tools: Rubrics in Moodle

  • Moodle Docs: Rubrics
  • Moodle Docs: Grading Guide (use for single-point rubrics)

Tools with rubrics (other than Moodle)

  • Google Assignments
  • Turnitin Assignments: Rubric or Grading Form

Other resources

  • DePaul University (n.d.). Rubrics .
  • Gonzalez, J. (2014). Know your terms: Holistic, Analytic, and Single-Point Rubrics . Cult of Pedagogy.
  • Goodrich, H. (1996). Understanding rubrics . Teaching for Authentic Student Performance, 54 (4), 14-17. Retrieved from   
  • Miller, A. (2012). Tame the beast: tips for designing and using rubrics.
  • Ragupathi, K., Lee, A. (2020). Beyond Fairness and Consistency in Grading: The Role of Rubrics in Higher Education. In: Sanger, C., Gleason, N. (eds) Diversity and Inclusion in Global Higher Education. Palgrave Macmillan, Singapore.
  • AI Essay Grader

Advertisement

Hide all ads with premium or pro membership., premium membership.

With a premium membership, you unlock access to an almost limitless amount of essay gradings. You can either print your graded essays straight away or save them as PDFs for future reference. Plus, all of your graded essays are neatly stored, making them readily accessible whenever you need them. Finally, you can grade up to 10 essays simultaneously.

Premium Essay Grader

Bulk AI Essay Grader

Sample AI Graded Essay

9th grade essay grading rubric

About Our AI Essay Grader

Welcome to the future of education assessment with ClassX’s AI Essay Grader! In an era defined by technological advancements, educators are constantly seeking innovative ways to streamline their tasks while maintaining the quality of education. ClassX’s AI Essay Grader is a revolutionary tool designed to significantly alleviate the burden on teachers, offering a seamless and efficient solution to evaluate students’ essays.

Traditionally, assessing essays has been a time-consuming process, requiring educators to meticulously read through each piece of writing, analyze its content, and apply complex rubrics to assign grades. With the advent of AI, however, the landscape of education evaluation is rapidly changing. ClassX’s AI Essay Grader empowers teachers by automating the grading process without compromising on accuracy or fairness.

The concept is elegantly simple: teachers input or copy the students’ essays into the provided text box, select the appropriate grade level and subject, and ClassX’s AI Essay Grader takes it from there. Leveraging the cutting-edge technology of ChatGPT, the AI system meticulously evaluates essays against a predefined rubric. The rubric encompasses various criteria, ranging from content depth and structure to grammar and style, ensuring a comprehensive assessment of the writing.

Criteria Score 4 Score 3 Score 2 Score 1
Organization Writing has a clear introduction, body, and conclusion with appropriate use of paragraphs. Writing has a clear introduction and conclusion but may have some inconsistencies in paragraphing. Writing has some attempt at organization but lacks a clear introduction, body, or conclusion. Writing is disorganized and lacks clear structure.
Content Writing includes relevant details, facts, or examples that support the main idea. Writing includes some relevant details, facts, or examples, but may lack consistency or specificity. Writing includes limited or unrelated details, facts, or examples. Writing lacks relevant content or is off-topic.
Grammar and Mechanics Writing demonstrates correct use of punctuation, capitalization, and verb tense. Writing has some errors in punctuation, capitalization, or verb tense, but does not significantly impact readability. Writing has frequent errors in punctuation, capitalization, or verb tense that may impact readability. Writing has pervasive errors in punctuation, capitalization, or verb tense that significantly impact readability.
Vocabulary Writing uses a variety of age-appropriate vocabulary with precise word choices. Writing uses some age-appropriate vocabulary but may lack variety or precision. Writing uses limited or basic vocabulary that may not be age-appropriate. Writing lacks appropriate vocabulary or word choices.
Overall Impression Writing is engaging, well-crafted, and demonstrates strong effort and creativity. Writing is generally engaging and shows effort, but may have some areas for improvement. Writing is somewhat engaging but lacks polish or effort. Writing is not engaging, poorly crafted, or lacks effort.

Teachers can now allocate more time to personalized instruction, classroom engagement, and curriculum development, rather than being bogged down by the time-consuming task of manual essay evaluation. The AI’s rapid and consistent grading also means that students receive prompt feedback on their work, enabling them to learn from their mistakes and improve their writing skills at an accelerated pace.

Moreover, the AI Essay Grader enhances objectivity in grading. By removing potential biases and inconsistencies inherent in manual grading, educators can ensure that every student receives a fair and unbiased evaluation of their work. This contributes to a more equitable educational environment where all students have an equal chance to succeed.

In summary, ClassX’s AI Essay Grader represents a groundbreaking leap in the evolution of educational assessment. By seamlessly integrating advanced AI technology with the art of teaching, this tool unburdens educators from the arduous task of essay grading, while maintaining the highest standards of accuracy and fairness. As we embrace the potential of AI in education, ClassX is leading the way in revolutionizing the classroom experience for both teachers and students.

Share this tool

  • Categories: AI Tools , AI Tools for Teachers , Assessments and Grading
  • Keywords: essay , essay grade , grade essay , grading

Privacy Overview

CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
  • Learn the Alphabet
  • Children's Music
  • Video Lessons

AI CONVERSATIONS

  • AI Homework & Project Helper
  • Interview a Historical Figure
  • AI Conversation Practice
  • AI Book Chat
  • AI Country Guide

AI TOOLS FOR STUDENTS

  • AI Dictionary
  • AI Thesaurus
  • AI Sentence Generator
  • AI Grammar Correction
  • AI Paraphraser
  • AI Summarizer
  • AI Lyrics Generator
  • AI Poem Generator
  • AI Ancient Text Translator
  • AI Children's Story Generator
  • Role-Play Game: Fantasy Quest
  • AI Figure of Speech Generators

AI TOOLS FOR TEACHERS

  • AI Rubric Generator
  • AI Prompt Optimizer
  • AI Lesson Creator
  • AI Lesson Plan Creator
  • AI Multiple-Choice Quiz Creator
  • AI True-False Quiz Creator
  • AI Fill-in-the-Blanks Quiz Creator
  • AI Book Quiz Creator
  • AI Report Card Comments
  • AI Comments for English Teachers
  • IEP Generator
  • AI Homework Helper

9th grade essay grading rubric

Suggestions

Suggest an ai tool or improvements to an existing one., create a free account free membership.

working on laptop.png

Create a free account on ClassX to enjoy all the benefits we have to offer.

  • Introduction : Curriculum Components
  • Writing: Skills-Based Writing Instruction
  • Routines: Instruction
  • For All Learners : Access
  • Unit 1: Identity
  • Unit 2: Personality
  • Unit 3: Society and its Structure
  • Unit 4: Otherness
  • Unit 5: Challenging Truths/Coming-of-Age

Unit 6: Establishing Truths/Coming-of-Age

  • Unit 1: The Quest
  • Unit 2: The Unlikely Hero
  • Unit 3: Dystopia
  • Unit 4: The Anti-Hero
  • Unit 5: The Monster
  • Unit 6: The Tragic Hero
  • Unit 1: The Contemporary American Experience
  • Unit 2: The Creation of the American
  • Unit 3: The American and the Changing Landscape
  • Unit 4: The Reawakening of the American
  • ELA Regents: Resources
  • ELA Regents: Multiple Choice
  • ELA Regents: Writing from Sources Essay
  • ELA Regents : Text Analysis Essay
  • Find Resources

Establishing Truths/Coming-of-Age

Student-Friendly Text Analysis Rubric

Identity Archive

Developing Skills: Student-Friendly Text Analysis Rubric

9th grade essay grading rubric

Teacher Feedback

Please comment below with questions, feedback, suggestions, or descriptions of your experience using this resource with students.

EssayGrader Logo - Go to homepage

What Is an Essay Grading Rubric & 11 Tools to Streamline Your Grading

9th grade essay grading rubric

Mastering essay writing is a crucial skill, and grading these essays can be challenging for teachers. That's where essay grading rubrics come into play. These tools for teachers provide a clear framework for assessing student work, making the process more efficient and fair. This blog explores the ins and outs of essay grading rubrics, how they benefit both teachers and students, and how they improve the overall learning process. Dive into this essential aspect of learning assessment if you're looking to streamline your grading process and provide valuable feedback to your students.

What Is an Essay Grading Rubric?

woman with papers infront of her - Essay Grading Rubric

In simple terms, an Essay Grading Rubric is an assessment tool used by educators to evaluate and grade students' essays. A rubric breaks down the essay into various components or criteria that the student needs to fulfill to achieve a certain score. These criteria usually cover aspects like organization, content, language use, and mechanics. The rubric typically includes four performance levels - excellent, good, fair, and poor - to categorize the quality of the student's work. Each performance level comes with a set of descriptors that clearly define what is expected at that level. 

Components of an Essay Grading Rubric

1. criteria.

The criteria in an essay grading rubric outline the specific areas that a student's essay will be assessed on. These criteria vary depending on the teacher's goals for the assignment. They may include elements like thesis statement, organization, supporting evidence, analysis, language use, and mechanics. Each criterion is usually given a numerical value that corresponds to a specific performance level.

2. Performance Levels

Performance levels, or grading scales , provide a framework for evaluating student work in a rubric. These levels are often labeled with phrases like "excellent," "good," "fair," and "poor." By associating these levels with numerical scores, educators can assign a grade to a student's essay based on where their performance falls on the scale.

3. Descriptors

Descriptors are the detailed explanations that accompany each performance level. They give students a clear understanding of what is expected at each level of performance. For example, a rubric might state that an "excellent" analysis has a clear and insightful interpretation of the text, while a "fair" analysis may lack depth or be unclear. Essay grading rubrics help teachers evaluate student work objectively and transparently. They break down the grading process into manageable components and provide students with clear expectations for their assignments.

Related Reading

  • Best AI Tools For Teachers
  • AI Rubric Generator
  • Grading Papers
  • Assessment Tools For Teachers
  • How To Grade Papers
  • Can ChatGPT Grade Essays

Types of Essay Grading Rubrics

blank notebook with glasses on it - Essay Grading Rubric

An analytic rubric breaks down all the components of a project, presentation, or essay into different criteria. Each criterion is scored individually, which can be helpful for providing detailed feedback on specific areas of strength or weakness. While it may be more time-consuming to create and use than a holistic rubric, an analytic rubric allows for each criterion to be weighted to reflect its relative importance. It may require more work for instructors to write feedback for each criterion, but it can help provide students with specific areas for improvement.  On the other hand, a holistic rubric includes all the criteria to be considered together and included in a single evaluation. With a holistic rubric, the rater or grader assigns a single score based on an overall judgment of the student’s work, using descriptions of each performance level to assign the score. Holistic rubrics may save grader time by minimizing the number of evaluations to be made for each student, but they provide less specific feedback than analytic rubrics.

General vs. Task-Specific Essay Grading Rubric

The use of general versus task-specific rubrics depends on the learning objectives and the assessment task. General rubrics are typically used for broad assessments where the criteria can be applied to a wide variety of tasks. In contrast, task-specific rubrics are tailored to the requirements of a particular performance task or an outcome. General rubrics can be used efficiently across a wide range of assignments, providing consistency in evaluation.  They can serve as a guide for what is expected of students without overwhelming them with overly specific criteria. On the other hand, task-specific rubrics are designed to assess a particular task or performance outcome, providing detailed guidelines and expectations for the students. Task-specific rubrics can provide targeted feedback that helps students understand their strengths and areas for improvement in relation to a specific task or learning objective.

6 Benefits of Using an Essay Grading Rubric

teacher helping students with Essay Grading Rubric

Rubrics are essential tools for grading essays. They bring a level of objectivity and fairness that is hard to achieve with other grading methods. Rubrics serve as a guide for what teachers are looking for in the student’s work. This eliminates potential bias from teachers, creating a standard that every student’s work is measured against.  The rubric can also help teachers move quicker through grading essays as they can simply refer to the guide when grading and awarding points. Rubrics can provide a level of transparency for students. This means they understand why they received the grade they did and what could have been done better. In the end, rubrics can save time and ensure more accurate grading for teachers while providing a clear roadmap for students to follow to help them achieve a better grade. 

1. Students know what is expected.

Rubrics outline what students are expected to learn from a particular assignment. They highlight the knowledge, skills, and attitudes students should gain from their work. This means that students know what they need to do to get the grade they desire. They can check off each item as they demonstrate their understanding of the concepts across the rubric.

2. Students see that learning is about gaining specific knowledge, skills, and attitudes

Rubrics can be an essential tool in helping students to understand concepts. Instead of seeing the grade as an endpoint, rubrics can be an opportunity for students to see what they have learned and accomplished. This can be a motivating factor for students to do better in school.

3. Students may self-assess to reflect on their learning

Rubrics can be an essential tool for students , not just teachers. Students can use the rubric to assess themselves and get an idea of what they need to work on. This can help students work on the areas where they need improvement and can be a powerful tool for self-improvement.

4. Teachers and students are clear on what is being assessed

Rubrics are an essential tool for teachers as well. Teachers can use the rubric to let students know what is expected of them. Rubrics make it clear what the teacher is looking for in the student’s work, making it easier for teachers to assess work. 

5. Teachers may consistently assess student work without having to re-write similar comments

Rubrics make it easier for teachers to assess student work. Instead of having to write the same comments over and over, teachers can use the rubric to quickly assess students’ work. This can save teachers time and frustration and make it easier to provide feedback to students.

6. Teachers with high marking loads save considerable time

Rubrics can be a powerful tool for saving teachers time. Instead of having to spend hours grading individual assignments, teachers can use the rubric to quickly assess student work. This can save teachers hours of work and frustration and can make it easier for teachers to assess student work.

Essay Grading Efficiency with EssayGrader's AI Platform

EssayGrader is the most accurate AI grading platform trusted by 30,000+ educators worldwide. On average it takes a teacher 10 minutes to grade a single essay, with EssayGrader that time is cut down to 30 seconds That's a 95% reduction in the time it takes to grade an essay, with the same results.  With EssayGrader, Teachers can:

  • Replicate their grading rubrics (so AI doesn't have to do the guesswork to set the grading criteria)
  • Setup fully custom rubrics
  • Grade essays by class
  • Bulk upload of essays
  • Use our AI detector to catch essays written by AI
  • Summarize essays with our Essay summarizer 

Primary school, high school, and even college professors grade their students' essays with the help of our AI tool. Over half a million essays were graded by 30,000+ teachers on our platform. Save 95% of your time for grading school work with our tool to get high-quality, specific and accurate writing feedback for essays in seconds.  ‍ Get started for free today!

  • Time Management For Teachers
  • ChatGPT For Teachers Free
  • Rubric For Writing Assignments
  • Grading Practices
  • Responding To Students Writing
  • How Much Time Do Teachers Spend Grading
  • Grading Essays
  • How To Give Feedback On An Essay
  • How To Grade Work
  • Feedback Tools For Teachers
  • Grading Tips For Teachers
  • Grading Methods For Teachers
  • Essay Grader Free Online
  • Essay Grader For Teachers
  • Tips for Grading Essays Faster
  • Tips for Grading an Essay
  • Tips for Grading Essays
  • Tips for Quick Essays Grading
  • Tips for Teaching and Grading Five Paragraph Essay
  • Automated Essay Grading
  • Essay Grading Software
  • Essay Grading Website
  • Free Automated Essay Grading Software
  • Grading Essays Online
  • Grading Essays with AI

12 Smart Ideas to Grade Essays Faster

  • How to Grade College Essays Faster
  • How to Grade Essays Faster

How to Design a Grading Rubric

woman designing a Essay Grading Rubric

1. Analyze the assignment

When creating a rubric from scratch, the first step is to thoroughly analyze the assignment you are grading. What are the main objectives of the assignment? What do you want your students to demonstrate through their work? Are there different smaller tasks within the assignment that need to be evaluated separately? By considering these questions, you can better determine the criteria and performance levels to include in your rubric.

2. Decide what kind of rubric you will use

There are various types of rubrics to choose from, including holistic, analytic/descriptive, and single-point rubrics. Deciding which type best suits your assessment will help you structure your grading criteria effectively. Holistic rubrics assess overall performance, analytic/descriptive rubrics break down performance into specific criteria, and single-point rubrics focus on proficiency levels.

3. Look for templates and examples 

Before creating your rubric from scratch, it can be helpful to look for templates or examples online. These resources can give you a starting point and help you align your rubric with the assignment's expectations and learning objectives. Collaborating with colleagues or asking students for input can also provide valuable insights.

4. Define the assignment criteria

Next, define the specific criteria you will use to evaluate the assignment. These criteria should align with the learning objectives and expectations of the assignment. Collaborating with colleagues, teaching assistants, and students can help you brainstorm effective grading criteria and ensure they are precise and unambiguous.

5. Design the rating scale

Consider the number of levels you want to include in your rating scale and whether you will use numbers or descriptive labels. The rating scale should provide a clear assessment of student performance and align with the assignment requirements. Ensure the rubric is organized logically and comprehensible to students.

6. Write descriptions for each level of the rating scale

For each level of the rating scale, provide clear descriptions of what constitutes performance at that level. These descriptions should be observable and measurable, using parallel language across the scale. Consider what distinguishes each level of performance and how it aligns with the assignment's expectations.

7. Create your rubric

Once you have established the criteria and rating scale, create your rubric in a table or spreadsheet format. Online tools can assist in creating a rubric, but you will likely need to transfer the details to your grading platform manually. Ensure the rubric is clear and accessible to students.

8. Pilot-test your rubric

Before using your rubric to grade actual student work, pilot-test it with colleagues, teaching assistants, and students. Collect feedback on the rubric's effectiveness and make necessary revisions based on the results. Piloting the rubric can help ensure it aligns with the assignment objectives and provides valuable feedback to students.

11 Best AI Essay Grading Rubrics Platforms

woman using an app for Essay Grading Rubric

1. EssayGrader

EssayGrader is an exceptional tool that streamlines the grading process for educators. Through AI technology, this platform significantly reduces the time taken to grade essays while maintaining high accuracy. It allows teachers to replicate their grading rubrics and provides the flexibility of setting custom rubrics. Educators can:

  • Upload essays in bulk
  • Detect AI-written essays
  • Summarize essays efficiently

With over half a million essays graded by 30,000+ teachers, EssayGrader is a reliable tool for grading essays across different educational levels.

2. Gradescope

Gradescope stands out as one of the best AI graders, offering a sophisticated platform for automated essay evaluation. It allows for quick and accurate grading , customizable rubrics, and insightful analytics to identify common misconceptions among students. This tool enhances the efficiency and effectiveness of the grading process.

3. Turnitin

Renowned for its plagiarism detection capabilities, Turnitin extends its functionality to include AI essay grading. By leveraging AI technology, Turnitin ensures academic integrity, enhances originality, and offers in-depth feedback on writing quality. This tool supports the automated grading of various writing formats, making the assessment process more efficient.

4. PaperRater

PaperRater utilizes AI for consistent and unbiased grading across various subjects. It provides immediate feedback essential for large-scale online courses and improves scalability. With this tool, educators can ensure fair and efficient grading processes.

5. AI For Teachers

Known for offering free AI graders and tools for teachers, AI For Teachers provides an opportunity to leverage AI technology effortlessly. It comes with personalized rubrics, easy-to-use AI graders, and reliable customer support. This tool enhances the grading experience for educators.

6. Canvas SpeedGrader

Integrated within the Canvas Learning Management System, SpeedGrader utilizes AI for efficient and seamless essay grading. It offers a streamlined workflow, multimedia feedback options, and collaborative grading among educators. This tool enhances the grading experience within the Canvas platform.

7. CoGrader

CoGrader simplifies grading by integrating seamlessly with Google Classroom for easy import and export of assignments. It provides instant feedback on assignments, automates grading, reduces time spent, and ensures fairness by removing biases. This tool enhances the efficiency and accuracy of the grading process.

8. AI For Teachers

For ChatGPT Plus users, AI For Teachers offers free access to exceptional value without additional cost. By integrating ChatGPT capabilities, educators can create interactive learning experiences and tailor AI bots for specific classroom needs. This tool enhances student engagement and learning outcomes.

Smodin offers AI-powered writing assistance tools for rewriting, plagiarism detection, summarizing, and AI writing. With multilingual support and integration with various useful tools, Smodin caters to students, teachers, and content creators. This tool enhances writing and grading processes across different languages.

10. GradeCam

GradeCam facilitates quick scanning and grading of paper tests, providing timely feedback crucial for students' learning adjustments. It offers analytics tools that give insights into student progress and areas for improvement, supporting data-driven instructional decisions. This tool enhances the grading process and student feedback.

11. SnapGrader

SnapGrader enables teachers to quickly digitize paper tests and quizzes, eliminating manual data entry. It provides instant feedback to students, offers insights into grading accuracy and consistency, and supports various question formats. With SnapGrader, educators can efficiently assess student performance and provide timely feedback to enhance learning.

How to Choose the Best Essay Grading Rubric

teachers deciding upon a Essay Grading Rubric

Adaptability

When choosing a rubric for grading essays, it's vital to consider its adaptability. You want a rubric that can easily cater to different types of assignments and essays. Flexibility is key, especially if you assess an array of genres or topics. Being able to customize the rubric to suit the specific needs of your students and course objectives is imperative .

Accuracy and Specificity

The rubric you choose must accurately assess the essays. It should provide specific feedback on grammar, coherence, clarity, and writing style errors. Detailed error reports should be generated, highlighting the mistakes made by students. This specific feedback is crucial as it allows you to provide targeted feedback to help students improve their writing skills.

Ease of Use

You need a rubric that is not only accurate but also user-friendly. The rubric should be intuitive and easy to navigate. A clutter-free interface ensures that grading is efficient and that you can focus on providing valuable feedback to your students.

Bulk Uploading

If you are handling numerous essays, consider a rubric that allows for bulk uploading. This feature streamlines the grading process by enabling you to evaluate an entire class's worth of essays at once. This can be a huge time-saver and helps maintain consistency in grading.

Improvement Suggestions

Look for a rubric that does not just identify errors but also offers suggestions for improvement. Constructive feedback is crucial in helping students enhance their writing skills. A rubric that not only points out mistakes but also provides insights on how to rectify those mistakes can be incredibly beneficial.

Alignment with Traditional Rubrics

If you have been using traditional rubrics, ensure that the AI rubric aligns with your existing grading practices. You should have the flexibility to create custom rubrics based on your preferred criteria. This ensures that the rubric meets the specific needs of your course and students.

Integration

Select AI grading tools that seamlessly integrate with existing Learning Management Systems. This integration fosters efficiency in your overall workflow and helps streamline the grading process. A rubric that easily integrates with your existing systems can save you time and effort in managing assignments and grading student work.

  • Grading Websites
  • Essay Grader For Teachers Free
  • Automated Essay Scoring
  • AI Grading Tools For Teachers
  • Grading Apps For Teachers
  • How To Use Chat GPT To Grade Essays
  • AI Grading Software
  • Cograder Reviews

Save Time While Grading Schoolwork — Join 30,000+ Educators Worldwide & Use EssayGrader AI, The Original AI Essay Grader

EssayGrader is an innovative AI grading platform designed to transform the way educators grade essays. By leveraging cutting-edge technology, we help teachers save time and provide accurate and effective feedback to students. With the power of AI, EssayGrader offers several key features that make the grading process more efficient and precise.

Replication of Grading Rubrics

Our platform allows educators to replicate their grading rubrics, ensuring that AI accurately assesses essays based on the established criteria. This feature eliminates guesswork and guarantees consistent and fair grading for all students.

Custom Rubrics

EssayGrader enables teachers to create fully customized rubrics tailored to their specific requirements. This flexibility allows educators to adapt the grading criteria to the unique needs of their classes and assignments.

Bulk Upload and Class-based Grading

Teachers can streamline the grading process by uploading multiple essays in bulk and grading them by class. This functionality simplifies the workflow for educators who need to assess numerous essays efficiently.

AI Detector

EssayGrader includes an AI detector that can identify essays generated by AI programs. This feature helps maintain academic integrity by flagging essays that may not reflect authentic student work.

Essay Summarizer

Our platform offers an essay summarizer tool that condenses essays into concise summaries. This feature enables educators to quickly grasp the main points of an essay and provide targeted feedback to students. Educators around the world trust EssayGrader to enhance their grading processes and provide valuable feedback to students. By leveraging AI technology, our platform empowers teachers to save time, improve grading accuracy, and deliver high-quality feedback to support student learning and growth. Join the thousands of educators who have already embraced this innovative tool and experience the benefits of efficient and effective essay grading with EssayGrader .

Start grading today

Save hours by grading essays in  30 seconds or less.

Related blogs

9th grade essay grading rubric

How to Grade Essays Faster & More Efficiently

Looking at how to grade essays faster? Discover effective techniques to speed up the grading process without sacrificing quality with this guide.

9th grade essay grading rubric

5 Strategies on How to Grade College Essays Faster

Looking for ways on how to grade college essays faster? We've outlined five proven strategies that will help you grade essays in record time.

Time-saving tips for grading essays! Discover 12 smart ideas to grade essays faster and help you provide valuable feedback to students.

Writing Beginner

Writing Rubrics [Examples, Best Practices, & Free Templates]

Writing rubrics are essential tools for teachers.

Rubrics can improve both teaching and learning. This guide will explain writing rubrics, their benefits, and how to create and use them effectively.

What Is a Writing Rubric?

Writer typing at a vintage desk, with a stormy night outside -- Writing Rubrics

Table of Contents

A writing rubric is a scoring guide used to evaluate written work.

It lists criteria and describes levels of quality from excellent to poor. Rubrics provide a standardized way to assess writing.

They make expectations clear and grading consistent.

Key Components of a Writing Rubric

  • Criteria : Specific aspects of writing being evaluated (e.g., grammar, organization).
  • Descriptors : Detailed descriptions of what each level of performance looks like.
  • Scoring Levels : Typically, a range (e.g., 1-4 or 1-6) showing levels of mastery.

Example Breakdown

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
GrammarNo errorsFew minor errorsSeveral errorsMany errors
OrganizationClear and logicalMostly clearSomewhat clearNot clear
ContentThorough and insightfulGood, but not thoroughBasic, lacks insightIncomplete or off-topic

Benefits of Using Writing Rubrics

Writing rubrics offer many advantages:

  • Clarity : Rubrics clarify expectations for students. They know what is required for each level of performance.
  • Consistency : Rubrics standardize grading. This ensures fairness and consistency across different students and assignments.
  • Feedback : Rubrics provide detailed feedback. Students understand their strengths and areas for improvement.
  • Efficiency : Rubrics streamline the grading process. Teachers can evaluate work more quickly and systematically.
  • Self-Assessment : Students can use rubrics to self-assess. This promotes reflection and responsibility for their learning.

Examples of Writing Rubrics

Here are some examples of writing rubrics.

Narrative Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Story ElementsWell-developedDeveloped, some detailsBasic, missing detailsUnderdeveloped
CreativityHighly creativeCreativeSome creativityLacks creativity
GrammarNo errorsFew minor errorsSeveral errorsMany errors
OrganizationClear and logicalMostly clearSomewhat clearNot clear
Language UseRich and variedVariedLimitedBasic or inappropriate

Persuasive Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
ArgumentStrong and convincingConvincing, some gapsBasic, lacks supportWeak or unsupported
EvidenceStrong and relevantRelevant, but not strongSome relevant, weakIrrelevant or missing
GrammarNo errorsFew minor errorsSeveral errorsMany errors
OrganizationClear and logicalMostly clearSomewhat clearNot clear
Language UsePersuasive and engagingEngagingSomewhat engagingNot engaging

Best Practices for Creating Writing Rubrics

Let’s look at some best practices for creating useful writing rubrics.

1. Define Clear Criteria

Identify specific aspects of writing to evaluate. Be clear and precise.

The criteria should reflect the key components of the writing task. For example, for a narrative essay, criteria might include plot development, character depth, and use of descriptive language.

Clear criteria help students understand what is expected and allow teachers to provide targeted feedback.

Insider Tip : Collaborate with colleagues to establish consistent criteria across grade levels. This ensures uniformity in expectations and assessments.

2. Use Detailed Descriptors

Describe what each level of performance looks like.

This ensures transparency and clarity. Avoid vague language. Instead of saying “good,” describe what “good” entails. For example, “Few minor grammatical errors that do not impede readability.”

Detailed descriptors help students gauge their performance accurately.

Insider Tip : Use student work samples to illustrate each performance level. This provides concrete examples and helps students visualize expectations.

3. Involve Students

Involve students in the rubric creation process. This increases their understanding and buy-in.

Ask for their input on what they think is important in their writing.

This collaborative approach not only demystifies the grading process but also fosters a sense of ownership and responsibility in students.

Insider Tip : Conduct a workshop where students help create a rubric for an upcoming assignment. This interactive session can clarify doubts and make students more invested in their work.

4. Align with Objectives

Ensure the rubric aligns with learning objectives. This ensures relevance and focus.

If the objective is to enhance persuasive writing skills, the rubric should emphasize argument strength, evidence quality, and persuasive techniques.

Alignment ensures that the assessment directly supports instructional goals.

Insider Tip : Regularly revisit and update rubrics to reflect changes in curriculum and instructional priorities. This keeps the rubrics relevant and effective.

5. Review and Revise

Regularly review and revise rubrics. Ensure they remain accurate and effective.

Solicit feedback from students and colleagues. Continuous improvement of rubrics ensures they remain a valuable tool for both assessment and instruction.

Insider Tip : After using a rubric, take notes on its effectiveness. Were students confused by any criteria? Did the rubric cover all necessary aspects of the assignment? Use these observations to make adjustments.

6. Be Consistent

Use the rubric consistently across all assignments.

This ensures fairness and reliability. Consistency in applying the rubric helps build trust with students and maintains the integrity of the assessment process.

Insider Tip : Develop a grading checklist to accompany the rubric. This can help ensure that all criteria are consistently applied and none are overlooked during the grading process.

7. Provide Examples

Provide examples of each performance level.

This helps students understand expectations. Use annotated examples to show why a particular piece of writing meets a specific level.

This visual and practical demonstration can be more effective than descriptions alone.

Insider Tip : Create a portfolio of exemplar works for different assignments. This can be a valuable resource for both new and experienced teachers to standardize grading.

How to Use Writing Rubrics Effectively

Here is how to use writing rubrics like the pros.

1. Introduce Rubrics Early

Introduce rubrics at the beginning of the assignment.

Explain each criterion and performance level. This upfront clarity helps students understand what is expected and guides their work from the start.

Insider Tip : Conduct a rubric walkthrough session where you discuss each part of the rubric in detail. Allow students to ask questions and provide examples to illustrate each criterion.

2. Use Rubrics as a Teaching Tool

Use rubrics to teach writing skills. Discuss what constitutes good writing and why.

This can be an opportunity to reinforce lessons on grammar, organization, and other writing components.

Insider Tip : Pair the rubric with writing workshops. Use the rubric to critique sample essays and show students how to apply the rubric to improve their own writing.

3. Provide Feedback

Use the rubric to give detailed feedback. Highlight strengths and areas for improvement.

This targeted feedback helps students understand their performance and learn how to improve.

Insider Tip : Instead of just marking scores, add comments next to each criterion on the rubric. This personalized feedback can be more impactful and instructive for students.

4. Encourage Self-Assessment

Encourage students to use rubrics to self-assess.

This promotes reflection and growth. Before submitting their work, ask students to evaluate their own writing against the rubric.

This practice fosters self-awareness and critical thinking.

Insider Tip : Incorporate self-assessment as a mandatory step in the assignment process. Provide a simplified version of the rubric for students to use during self-assessment.

5. Use Rubrics for Peer Assessment

Use rubrics for peer assessment. This allows students to learn from each other.

Peer assessments can provide new perspectives and reinforce learning.

Insider Tip : Conduct a peer assessment workshop. Train students on how to use the rubric to evaluate each other’s work constructively. This can improve the quality of peer feedback.

6. Reflect and Improve

Reflect on the effectiveness of the rubric. Make adjustments as needed for future assignments.

Continuous reflection ensures that rubrics remain relevant and effective tools for assessment and learning.

Insider Tip : After an assignment, hold a debrief session with students to gather their feedback on the rubric. Use their insights to make improvements.

Check out this video about using writing rubrics:

Common Mistakes with Writing Rubrics

Creating and using writing rubrics can be incredibly effective, but there are common mistakes that can undermine their effectiveness.

Here are some pitfalls to avoid:

1. Vague Criteria

Vague criteria can confuse students and lead to inconsistent grading.

Ensure that each criterion is specific and clearly defined. Ambiguous terms like “good” or “satisfactory” should be replaced with concrete descriptions of what those levels of performance look like.

2. Overly Complex Rubrics

While detail is important, overly complex rubrics can be overwhelming for both students and teachers.

Too many criteria and performance levels can complicate the grading process and make it difficult for students to understand what is expected.

Keep rubrics concise and focused on the most important aspects of the assignment.

3. Inconsistent Application

Applying the rubric inconsistently can lead to unfair grading.

Ensure that you apply the rubric in the same way for all students and all assignments. Consistency builds trust and ensures that grades accurately reflect student performance.

4. Ignoring Student Input

Ignoring student input when creating rubrics can result in criteria that do not align with student understanding or priorities.

Involving students in the creation process can enhance their understanding and engagement with the rubric.

5. Failing to Update Rubrics

Rubrics should evolve to reflect changes in instructional goals and student needs.

Failing to update rubrics can result in outdated criteria that no longer align with current teaching objectives.

Regularly review and revise rubrics to keep them relevant and effective.

6. Lack of Examples

Without examples, students may struggle to understand the expectations for each performance level.

Providing annotated examples of work that meets each criterion can help students visualize what is required and guide their efforts more effectively.

7. Not Providing Feedback

Rubrics should be used as a tool for feedback, not just scoring.

Simply assigning a score without providing detailed feedback can leave students unclear about their strengths and areas for improvement.

Use the rubric to give comprehensive feedback that guides students’ growth.

8. Overlooking Self-Assessment and Peer Assessment

Self-assessment and peer assessment are valuable components of the learning process.

Overlooking these opportunities can limit students’ ability to reflect on their own work and learn from their peers.

Encourage students to use the rubric for self and peer assessment to deepen their understanding and enhance their skills.

What Is a Holistic Scoring Rubric for Writing?

A holistic scoring rubric for writing is a type of rubric that evaluates a piece of writing as a whole rather than breaking it down into separate criteria

This approach provides a single overall score based on the general impression of the writing’s quality and effectiveness.

Here’s a closer look at holistic scoring rubrics.

Key Features of Holistic Scoring Rubrics

  • Single Overall Score : Assigns one score based on the overall quality of the writing.
  • General Criteria : Focuses on the overall effectiveness, coherence, and impact of the writing.
  • Descriptors : Uses broad descriptors for each score level to capture the general characteristics of the writing.

Example Holistic Scoring Rubric

ScoreDescription
5 : Exceptionally clear, engaging, and well-organized writing. Demonstrates excellent control of language, grammar, and style.
4 : Clear and well-organized writing. Minor errors do not detract from the overall quality. Demonstrates good control of language and style.
3 : Satisfactory writing with some organizational issues. Contains a few errors that may distract but do not impede understanding.
2 : Basic writing that lacks organization and contains several errors. Demonstrates limited control of language and style.
1 : Unclear and poorly organized writing. Contains numerous errors that impede understanding. Demonstrates poor control of language and style.

Advantages of Holistic Scoring Rubrics

  • Efficiency : Faster to use because it involves a single overall judgment rather than multiple criteria.
  • Flexibility : Allows for a more intuitive assessment of the writing’s overall impact and effectiveness.
  • Comprehensiveness : Captures the overall quality of writing, considering all elements together.

Disadvantages of Holistic Scoring Rubrics

  • Less Detailed Feedback : Provides a general score without specific feedback on individual aspects of writing.
  • Subjectivity : Can be more subjective, as it relies on the assessor’s overall impression rather than specific criteria.
  • Limited Diagnostic Use : Less useful for identifying specific areas of strength and weakness for instructional purposes.

When to Use Holistic Scoring Rubrics

  • Quick Assessments : When a quick, overall evaluation is needed.
  • Standardized Testing : Often used in standardized testing scenarios where consistency and efficiency are priorities.
  • Initial Impressions : Useful for providing an initial overall impression before more detailed analysis.

Free Writing Rubric Templates

Feel free to use the following writing rubric templates.

You can easily copy and paste them into a Word Document. Please do credit this website on any written, printed, or published use.

Otherwise, go wild.

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Well-developed, engaging, and clear plot, characters, and setting.Developed plot, characters, and setting with some details missing.Basic plot, characters, and setting; lacks details.Underdeveloped plot, characters, and setting.
Highly creative and original.Creative with some originality.Some creativity but lacks originality.Lacks creativity and originality.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Rich, varied, and appropriate language.Varied and appropriate language.Limited language variety.Basic or inappropriate language.
Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Strong, clear, and convincing argument.Convincing argument with minor gaps.Basic argument; lacks strong support.Weak or unsupported argument.
Strong, relevant, and well-integrated evidence.Relevant evidence but not strong.Some relevant evidence, but weak.Irrelevant or missing evidence.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Persuasive and engaging language.Engaging language.Somewhat engaging language.Not engaging language.

Expository Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Thorough, accurate, and insightful content.Accurate content with some details missing.Basic content; lacks depth.Incomplete or inaccurate content.
Clear and concise explanations.Mostly clear explanations.Somewhat clear explanations.Unclear explanations.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Precise and appropriate language.Appropriate language.Limited language variety.Basic or inappropriate language.

Descriptive Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Vivid and detailed imagery that engages the senses.Detailed imagery with minor gaps.Basic imagery; lacks vivid details.Little to no imagery.
Highly creative and original descriptions.Creative with some originality.Some creativity but lacks originality.Lacks creativity and originality.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Rich, varied, and appropriate language.Varied and appropriate language.Limited language variety.Basic or inappropriate language.

Analytical Writing Rubric

Criteria4 (Excellent)3 (Good)2 (Fair)1 (Poor)
Insightful, thorough, and well-supported analysis.Good analysis with some depth.Basic analysis; lacks depth.Weak or unsupported analysis.
Strong, relevant, and well-integrated evidence.Relevant evidence but not strong.Some relevant evidence, but weak.Irrelevant or missing evidence.
No grammatical errors.Few minor grammatical errors.Several grammatical errors.Numerous grammatical errors.
Clear and logical structure.Mostly clear structure.Somewhat clear structure.Lacks clear structure.
Precise and appropriate language.Appropriate language.Limited language variety.Basic or inappropriate language.

Final Thoughts: Writing Rubrics

I have a lot more resources for teaching on this site.

Check out some of the blog posts I’ve listed below. I think you might enjoy them.

Read This Next:

  • Narrative Writing Graphic Organizer [Guide + Free Templates]
  • 100 Best A Words for Kids (+ How to Use Them)
  • 100 Best B Words For Kids (+How to Teach Them)
  • 100 Dictation Word Ideas for Students and Kids
  • 50 Tricky Words to Pronounce and Spell (How to Teach Them)

Study.com

In order to continue enjoying our site, we ask that you confirm your identity as a human. Thank you very much for your cooperation.

grade 9-12 narrative writing rubric

9th-12th grade narrative writing rubric

Rubric for formative and summative assignments with tasks involving writing that tells a story, either personal or literary.

Rubric suitable for formative and summative assignments with tasks involving writing that tells a story, either personal or literary. Use this rubric when asking students to write a personal essay or a fictional narrative.

Consider using the 9th-12th Grade Narrative QuickMark set with this rubric. These drag-and-drop comments were tailor-made by veteran educators to give actionable, formative feedback directly to students. While they were explicitly aligned to this particular rubric, you can edit or add your own content to any QuickMark.

This rubric is available and ready to use in your Feedback Studio account. However, if you would like to customize its criteria, you can "Duplicate this rubric" in your Feedback Studio account and then edit the rubric as needed. Or, you can download this .rbc file and then import to your account to begin editing the content.

All Formats

Resource types, all resource types.

  • Rating Count
  • Price (Ascending)
  • Price (Descending)
  • Most Recent

9th grade writing rubrics

Preview of Active Reading Strategies Bundle | Differentiated Passages Comprehension Posters

Active Reading Strategies Bundle | Differentiated Passages Comprehension Posters

9th grade essay grading rubric

Reading Intervention Activities Program & Assessment for RTI Science of Reading

9th grade essay grading rubric

A Fractured Fairy Tale Bundle: Reading Comprehension & Fluency

9th grade essay grading rubric

Middle School ELA Classroom Decor Floral Wildflowers

9th grade essay grading rubric

Romeo & Juliet — Adapted for ELL & IEP Students | Print Ready PDF & ePub

9th grade essay grading rubric

Murder Mystery 3 Pack: Build Classroom Culture and Argument and Reasoning Skills

9th grade essay grading rubric

Algebra Activities Bundle with Digital Updates

9th grade essay grading rubric

"Why did Australia join WW1?" Worksheet, Reading, Comprehension

9th grade essay grading rubric

Editable Rubric Templates Google Docs Blank, Grading & Student Self Assessment

9th grade essay grading rubric

  • Google Docs™

Preview of Kid-Friendly Rubrics for Reading, Writing, Speaking, Listening {WIDA informed}

Kid-Friendly Rubrics for Reading, Writing, Speaking, Listening {WIDA informed}

9th grade essay grading rubric

Graphic Novels Before and After Reading Teaching Activity Bundle

9th grade essay grading rubric

Essay Writing Revision Checklist Rubric - Argumentative Summative Thesis

9th grade essay grading rubric

Annotated Bibliography MLA Format | Evaluate Sources | Summarize Articles

9th grade essay grading rubric

Essay Writing Rubrics - Summary, Narrative, Persuasive, Response to Literature

Preview of 10 Argumentative Text Essay Samples w Argument Writing Rubric Graphic Organizer

10 Argumentative Text Essay Samples w Argument Writing Rubric Graphic Organizer

9th grade essay grading rubric

One Trait Rubric / Single Point Rubric

9th grade essay grading rubric

Research Paper Unit | MLA Format | Annotated Bibliography | Proposal | Outline

Preview of Persuasive Writing Unit - Lesson, Graphic Organizer, Essay Topics, and Rubric

Persuasive Writing Unit - Lesson, Graphic Organizer, Essay Topics, and Rubric

9th grade essay grading rubric

Literary Analysis Essay: Build a "Case File" of Text Evidence for ANY Novel!

9th grade essay grading rubric

Writing a News Article - OLC News Report Template & 4-Level Rubric - OLC4O OSSLC

  • Google Apps™

Preview of Project based learning: Who Killed Edgar Allen Poe

Project based learning: Who Killed Edgar Allen Poe

9th grade essay grading rubric

Writing Workshop 2 - Argumentative Essay Middle School & High School

9th grade essay grading rubric

Paragraph Writing Rubric

9th grade essay grading rubric

Middle or High School Book Report / Post-reading Projects Bundle

9th grade essay grading rubric

Literary Analysis Essay Writing Bundle for Middle School and High School ELA ESL

9th grade essay grading rubric

Writing Pre-assessment Rubrics for Middle and High School

9th grade essay grading rubric

Peer Review

9th grade essay grading rubric

Creative Writing Assignments Inspired by Mentor Texts

9th grade essay grading rubric

Essay Rubric Bundle for Expository, Persuasive and Literary Analysis Writing

9th grade essay grading rubric

Yearbook Photography Tips and Assignment: Composition and Photojournalism

9th grade essay grading rubric

Writing Rubric for Special Education and Early Learners

9th grade essay grading rubric

RACE RUBRIC & POSTER: For Writing and Grading Open Response- Common Core Aligned

9th grade essay grading rubric

Find Writing resources | TPT

Learn more about writing resources.

Writing worksheets can help your child develop essential writing and literacy skills needed for school and life. If you’re a teacher or parent looking for printable and digital writing resources to help your student learn a writing concept, look no further! TPT has an extensive collection of resources, created by other teachers, that are designed to help with any need across grade levels.

For elementary students who are just learning to write, you can use worksheets to practice letter formation. Students in middle and high school can use learning stations to learn how to write and revise essays. With plenty of TPT resources at your fingertips, you can sharpen your student's writing skills in no time. Extend writing activities beyond the classroom and observe as your child nurtures their imagination, enriches their vocabulary, and enhances their storytelling prowess.

Fun and engaging writing activities to try

Here are a few ideas for writing activities — from our teacher-created resources — that you can find on TPT and that are designed to teach students how to write effectively. (Pro tip: These worksheets serve as an excellent complement to our reading materials.)

Encourage students to keep daily journals where they can freely express their thoughts, feelings, and experiences. This practice helps them develop their writing style and build the habit of writing regularly.

Writing Prompts

Provide engaging prompts that encourage imaginative storytelling. For instance, you could ask students to write about a world without the internet, or ask them to describe something only using one of their five senses (sight, sound, smell, touch, or taste).

Peer Editing

Have students exchange their written work with a peer for feedback. This helps them strengthen their ability to identify and correct mistakes in grammar, punctuation, and spelling; give constructive criticism; and revise their writing based on feedback.

Sentence and Paragraph Construction

Provide sentence and paragraph building exercises to help students understand the basic structure of writing and how to organize their ideas coherently.

Letter Writing

Ask students to write letters to real or fictional recipients. They could compose formal letters, persuasive letters on specific topics, thank-you notes, or postcards.

Create a classroom blog where students can publish their writing for a wider audience. This teaches them to write for a purpose and consider their audience's perspective.

Research Papers

Guide students through the process of researching and writing informative or argumentative essays. Teach them how to construct persuasive arguments and counterarguments on various topics, include evidence, and cite sources.

Poetry Writing

Explore different forms of poetry, such as haikus, sonnets, and free verse. Encourage students to experiment with imagery, rhythm, and metaphor.

By incorporating these (and other!) writing activities into your lesson plans, you can nurture a love for writing.

Frequently asked questions about teaching writing

What types of writing resources are available on tpt.

There are many different types of writing resources sold by Sellers on TPT. Some popular writing lessons include creative writing, poetry, writing essays, writing expository, and handwriting.

How do I find writing lessons on TPT?

Educators can save time preparing writing lessons with resources created by experienced teachers. Simply start a search for writing resources on the TPT marketplace, and filter by grade level, price, and/or resource type to find materials that've been proven to work in classrooms like yours. No matter what you’re teaching, there are plenty of writing lessons and activities sold by Sellers on TPT that are tailored to meet your students' skill levels.

  • We're hiring
  • Help & FAQ
  • Privacy policy
  • Student privacy
  • Terms of service
  • Tell us what you think

PrepScholar

Choose Your Test

Sat / act prep online guides and tips, sat essay rubric: full analysis and writing strategies.

feature_satessay

We're about to dive deep into the details of that least beloved* of SAT sections, the SAT essay . Prepare for a discussion of the SAT essay rubric and how the SAT essay is graded based on that. I'll break down what each item on the rubric means and what you need to do to meet those requirements.

On the SAT, the last section you'll encounter is the (optional) essay. You have 50 minutes to read a passage, analyze the author's argument, and write an essay. If you don’t write on the assignment, plagiarize, or don't use your own original work, you'll get a 0 on your essay. Otherwise, your essay scoring is done by two graders - each one grades you on a scale of 1-4 in Reading, Analysis, and Writing, for a total essay score out of 8 in each of those three areas . But how do these graders assign your writing a numerical grade? By using an essay scoring guide, or rubric.

*may not actually be the least belovèd.

Feature image credit: Day 148: the end of time by Bruce Guenter , used under CC BY 2.0 /Cropped from original. 

UPDATE: SAT Essay No Longer Offered

(adsbygoogle = window.adsbygoogle || []).push({});.

In January 2021, the College Board announced that after June 2021, it would no longer offer the Essay portion of the SAT (except at schools who opt in during School Day Testing). It is now no longer possible to take the SAT Essay, unless your school is one of the small number who choose to offer it during SAT School Day Testing.

While most colleges had already made SAT Essay scores optional, this move by the College Board means no colleges now require the SAT Essay. It will also likely lead to additional college application changes such not looking at essay scores at all for the SAT or ACT, as well as potentially requiring additional writing samples for placement.

What does the end of the SAT Essay mean for your college applications? Check out our article on the College Board's SAT Essay decision for everything you need to know.

The Complete SAT Essay Grading Rubric: Item-by-Item Breakdown

Based on the CollegeBoard’s stated Reading, Analysis, and Writing criteria, I've created the below charts (for easier comparison across score points). For the purpose of going deeper into just what the SAT is looking for in your essay, I've then broken down each category further (with examples).

The information in all three charts is taken from the College Board site .

The biggest change to the SAT essay (and the thing that really distinguishes it from the ACT essay) is that you are required to read and analyze a text , then write about your analysis of the author's argument in your essay. Your "Reading" grade on the SAT essay reflects how well you were able to demonstrate your understanding of the text and the author's argument in your essay.

(Inadequate)

The response demonstrates little or no comprehension of the source text.

The response fails to show an understanding of the text’s central idea(s), and may include only details without reference to central idea(s).

The response may contain numerous errors of fact and/or interpretation with regard to the text.

The response makes little or no use of textual evidence (quotations, paraphrases, or both), demonstrating little or no understanding of the source text.

(Partial)

The response demonstrates some comprehension of the source text.

The response shows an understanding of the text’s central idea(s) but not of important details.

The response may contain errors of fact and/or interpretation with regard to the text.  

The response makes limited and/or haphazard use of textual evidence (quotations, paraphrases, or both), demonstrating some understanding of the source text.

(Proficient)

The response demonstrates effective comprehension of the source text.

The response shows an understanding of the text’s central idea(s) and important details.

The response is free of substantive errors of fact and interpretation with regard to the text.

The response makes appropriate use of textual evidence (quotations, paraphrases, or both), demonstrating an understanding of the source text.

(Advanced)

The response demonstrates thorough comprehension of the source text.

The response shows an understanding of the text’s central idea(s) and of most important details and how they interrelate, demonstrating a comprehensive understanding of the text.

The response is free of errors of fact or interpretation with regard to the text.

The response makes skillful use of textual evidence (quotations, paraphrases, or both), demonstrating a complete understanding of the source text.

You'll need to show your understanding of the text on two different levels: the surface level of getting your facts right and the deeper level of getting the relationship of the details and the central ideas right.

Surface Level: Factual Accuracy

One of the most important ways you can show you've actually read the passage is making sure you stick to what is said in the text . If you’re writing about things the author didn’t say, or things that contradict other things the author said, your argument will be fundamentally flawed.

For instance, take this quotation from a (made-up) passage about why a hot dog is not a sandwich:

“The fact that you can’t, or wouldn’t, cut a hot dog in half and eat it that way, proves that a hot dog is once and for all NOT a sandwich”

Here's an example of a factually inaccurate paraphrasing of this quotation:

The author builds his argument by discussing how, since hot-dogs are often served cut in half, this makes them different from sandwiches.

The paraphrase contradicts the passage, and so would negatively affect your reading score. Now let's look at an accurate paraphrasing of the quotation:

The author builds his argument by discussing how, since hot-dogs are never served cut in half, they are therefore different from sandwiches.

It's also important to be faithful to the text when you're using direct quotations from the passage. Misquoting or badly paraphrasing the author’s words weakens your essay, because the evidence you’re using to support your points is faulty.

Higher Level: Understanding of Central Ideas

The next step beyond being factually accurate about the passage is showing that you understand the central ideas of the text and how details of the passage relate back to this central idea.

Why does this matter? In order to be able to explain why the author is persuasive, you need to be able to explain the structure of the argument. And you can’t deconstruct the author's argument if you don’t understand the central idea of the passage and how the details relate to it.

Here's an example of a statement about our fictional "hot dogs are sandwiches" passage that shows understanding of the central idea of the passage:

Hodgman’s third primary defense of why hot dogs are not sandwiches is that a hot dog is not a subset of any other type of food. He uses the analogy of asking the question “is cereal milk a broth, sauce, or gravy?” to show that making such a comparison between hot dogs and sandwiches is patently illogical.

The above statement takes one step beyond merely being factually accurate to explain the relation between different parts of the passage (in this case, the relation between the "what is cereal milk?" analogy and the hot dog/sandwich debate).

Of course, if you want to score well in all three essay areas, you’ll need to do more in your essay than merely summarizing the author’s argument. This leads directly into the next grading area of the SAT Essay.

The items covered under this criterion are the most important when it comes to writing a strong essay. You can use well-spelled vocabulary in sentences with varied structure all you want, but if you don't analyze the author's argument, demonstrate critical thinking, and support your position, you will not get a high Analysis score .

(Inadequate)

The response offers little or no analysis or ineffective analysis of the source text and demonstrates little or no understanding of the analytic task.

The response identifies without explanation some aspects of the author’s use of evidence, reasoning, and/or stylistic and persuasive elements, and/or feature(s) of the student’s choosing,

Or numerous aspects of the response’s analysis are unwarranted based on the text.

The response contains little or no support for claim(s) or point(s) made, or support is largely irrelevant.

The response may not focus on features of the text that are relevant to addressing the task,

Or the response offers no discernible analysis (e.g., is largely or exclusively summary).

(Partial)

The response offers limited analysis of the source text and demonstrates only partial understanding of the analytical task.

The response identifies and attempts to describe the author’s use of evidence, reasoning, and/or stylistic and persuasive elements, and/or feature(s) of the student’s own choosing, but merely asserts rather than explains their importance, or one or more aspects of the response’s analysis are unwarranted based on the text.

The response contains little or no support for claim(s) or point(s) made.

The response may lack a clear focus on those features of the text that are most relevant to addressing the task.

(Proficient)

The response offers an effective analysis of the source text and demonstrates an understanding of the analytical task.

The response competently evaluates the author’s use of evidence, reasoning, and/or stylistic and persuasive elements, and/or feature(s) of the student’s own choosing.

The response contains relevant and sufficient support for claim(s) or point(s) made.

The response focuses primarily on those features of the text that are most relevant to addressing the task.

(Advanced)

The response offers an insightful analysis of the source text and demonstrates a sophisticated understanding of the analytical task.

The response offers a thorough, well-considered evaluation of the author’s use of evidence, reasoning, and/or stylistic and persuasive elements, and/or feature(s) of the student’s own choosing.

The response contains relevant, sufficient, and strategically chosen support for claim(s) or point(s) made.

The response focuses consistently on those features of the text that are most relevant to addressing the task.

Because this category is so important, I've broken it down even further into its two different (but equally important) component parts to make sure everything is as clearly explained as possible.

Part I: Critical Thinking (Logic)

Critical thinking, also known as critical reasoning, also known as logic, is the skill that SAT essay graders are really looking to see displayed in the essay. You need to be able to evaluate and analyze the claim put forward in the prompt. This is where a lot of students may get tripped up, because they think “oh, well, if I can just write a lot, then I’ll do well.” While there is some truth to the assertion that longer essays tend to score higher , if you don’t display critical thinking you won’t be able to get a top score on your essay.

What do I mean by critical thinking? Let's take the previous prompt example:

Write an essay in which you explain how Hodgman builds an argument to persuade his audience that the hot dog cannot, and never should be, considered a sandwich.

An answer to this prompt that does not display critical thinking (and would fall into a 1 or 2 on the rubric) would be something like:

The author argues that hot dogs aren’t sandwiches, which is persuasive to the reader.

While this does evaluate the prompt (by providing a statement that the author's claim "is persuasive to the reader"), there is no corresponding analysis. An answer to this prompt that displays critical thinking (and would net a higher score on the rubric) could be something like this:

The author uses analogies to hammer home his point that hot dogs are not sandwiches. Because the readers will readily believe the first part of the analogy is true, they will be more likely to accept that the second part (that hot dogs aren't sandwiches) is true as well.

See the difference? Critical thinking involves reasoning your way through a situation (analysis) as well as making a judgement (evaluation) . On the SAT essay, however, you can’t just stop at abstract critical reasoning - analysis involves one more crucial step...

Part II: Examples, Reasons, and Other Evidence (Support)

The other piece of the puzzle (apparently this is a tiny puzzle) is making sure you are able to back up your point of view and critical thinking with concrete evidence . The SAT essay rubric says that the best (that is, 4-scoring) essay uses “ relevant, sufficient, and strategically chosen support for claim(s) or point(s) made. ” This means you can’t just stick to abstract reasoning like this:

That explanation is a good starting point, but if you don't back up your point of view with quoted or paraphrased information from the text to support your discussion of the way the author builds his/her argument, you will not be able to get above a 3 on the Analysis portion of the essay (and possibly the Reading portion as well, if you don't show you've read the passage). Let's take a look of an example of how you might support an interpretation of the author's effect on the reader using facts from the passage :

The author’s reference to the Biblical story about King Solomon elevates the debate about hot dogs from a petty squabble between friends to a life-or-death disagreement. The reader cannot help but see the parallels between the two situations and thus find themselves agreeing with the author on this point.

Does the author's reference to King Solomon actually "elevate the debate," causing the reader to agree with the author? From the sentences above, it certainly seems plausible that it might. While your facts do need to be correct,  you get a little more leeway with your interpretations of how the author’s persuasive techniques might affect the audience. As long as you can make a convincing argument for the effect a technique the author uses might have on the reader, you’ll be good.

body_saywhat

Say whaaat?! #tbt by tradlands , used under CC BY 2.0 /Cropped and color-adjusted from original.

Did I just blow your mind? Read more about the secrets the SAT doesn’t want you to know in this article . 

Your Writing score on the SAT essay is not just a reflection of your grasp of the conventions of written English (although it is that as well). You'll also need to be focused, organized, and precise.

(Inadequate)

The response demonstrates little or no cohesion and inadequate skill in the use and control of language.

The response may lack a clear central claim or controlling idea.

The response lacks a recognizable introduction and conclusion. The response does not have a discernible progression of ideas.

The response lacks variety in sentence structures; sentence structures may be repetitive. The response demonstrates general and vague word choice; word choice may be poor or inaccurate. The response may lack a formal style and objective tone.

The response shows a weak control of the conventions of standard written English and may contain numerous errors that undermine the quality of writing.

(Partial)

The response demonstrates little or no cohesion and limited skill in the use and control of language.

The response may lack a clear central claim or controlling idea or may deviate from the claim or idea over the course of the response.

The response may include an ineffective introduction and/or conclusion. The response may demonstrate some progression of ideas within paragraphs but not throughout the response.

The response has limited variety in sentence structures; sentence structures may be repetitive.

The response demonstrates general or vague word choice; word choice may be repetitive. The response may deviate noticeably from a formal style and objective tone.

The response shows a limited control of the conventions of standard written English and contains errors that detract from the quality of writing and may impede understanding.

(Proficient)

The response is mostly cohesive and demonstrates effective use and control of language.

The response includes a central claim or implicit controlling idea.

The response includes an effective introduction and conclusion. The response demonstrates a clear progression of ideas both within paragraphs and throughout the essay.

The response has variety in sentence structures. The response demonstrates some precise word choice. The response maintains a formal style and objective tone.

The response shows a good control of the conventions of standard written English and is free of significant errors that detract from the quality of writing.

(Advanced)

The response is cohesive and demonstrates a highly effective use and command of language.

The response includes a precise central claim.

The response includes a skillful introduction and conclusion. The response demonstrates a deliberate and highly effective progression of ideas both within paragraphs and throughout the essay.

The response has a wide variety in sentence structures. The response demonstrates a consistent use of precise word choice. The response maintains a formal style and objective tone.

The response shows a strong command of the conventions of standard written English and is free or virtually free of errors.

Because there's a lot of different factors that go into calculating your Writing score, I've divided the discussion of this rubric area into five separate items:

Precise Central Claim

Organization, vocab and word choice, sentence structure, grammar, etc..

One of the most basic rules of the SAT essay is that you need to express a clear opinion on the "assignment" (the prompt) . While in school (and everywhere else in life, pretty much) you’re encouraged to take into account all sides of a topic, it behooves you to NOT do this on the SAT essay. Why? Because you only have 50 minutes to read the passage, analyze the author's argument, and write the essay, there's no way you can discuss every single way in which the author builds his/her argument, every single detail of the passage, or a nuanced argument about what works and what doesn't work.

Instead, I recommend focusing your discussion on a few key ways the author is successful in persuading his/her audience of his/her claim.

Let’s go back to the assignment we've been using as an example throughout this article:

"Write an essay in which you explain how Hodgman builds an argument to persuade his audience that the hot dog cannot, and never should be, considered a sandwich."

Your instinct (trained from many years of schooling) might be to answer:

"There are a variety of ways in which the author builds his argument."

This is a nice, vague statement that leaves you a lot of wiggle room. If you disagree with the author, it's also a way of avoiding having to say that the author is persuasive. Don't fall into this trap! You do not necessarily have to agree with the author's claim in order to analyze how the author persuades his/her readers that the claim is true.

Here's an example of a precise central claim about the example assignment:

The author effectively builds his argument that hot dogs are not sandwiches by using logic, allusions to history and mythology, and factual evidence.

In contrast to the vague claim that "There are a variety of ways in which the author builds his argument," this thesis both specifies what the author's argument is and the ways in which he builds the argument (that you'll be discussing in the essay).

While it's extremely important to make sure your essay has a clear point of view, strong critical reasoning, and support for your position, that's not enough to get you a top score. You need to make sure that your essay  "demonstrates a deliberate and highly effective progression of ideas both within paragraphs and throughout the essay."

What does this mean? Part of the way you can make sure your essay is "well organized" has to do with following standard essay construction points. Don't write your essay in one huge paragraph; instead, include an introduction (with your thesis stating your point of view), body paragraphs (one for each example, usually), and a conclusion. This structure might seem boring, but it really works to keep your essay organized, and the more clearly organized your essay is, the easier it will be for the essay grader to understand your critical reasoning.

The second part of this criteria has to do with keeping your essay focused, making sure it contains "a deliberate and highly effective progression of ideas." You can't just say "well, I have an introduction, body paragraphs, and a conclusion, so I guess my essay is organized" and expect to get a 4/4 on your essay. You need to make sure that each paragraph is also organized . Recall the sample prompt:

“Write an essay in which you explain how Hodgman builds an argument to persuade his audience that the hot dog cannot, and never should be, considered a sandwich.”

And our hypothetical thesis:

Let's say that you're writing the paragraph about the author's use of logic to persuade his reader that hot dogs aren't sandwiches. You should NOT just list ways that the author is logical in support of his claim, then explain why logic in general is an effective persuasive device. While your points might all be valid, your essay would be better served by connecting each instance of logic in the passage with an explanation of how that example of logic persuades the reader to agree with the author.

Above all, it is imperative that you make your thesis (your central claim) clear in the opening paragraph of your essay - this helps the grader keep track of your argument. There's no reason you’d want to make following your reasoning more difficult for the person grading your essay (unless you’re cranky and don’t want to do well on the essay. Listen, I don’t want to tell you how to live your life).

In your essay, you should use a wide array of vocabulary (and use it correctly). An essay that scores a 4 in Writing on the grading rubric “demonstrates a consistent use of precise word choice.”

You’re allowed a few errors, even on a 4-scoring essay, so you can sometimes get away with misusing a word or two. In general, though, it’s best to stick to using words you are certain you not only know the meaning of, but also know how to use. If you’ve been studying up on vocab, make sure you practice using the words you’ve learned in sentences, and have those sentences checked by someone who is good at writing (in English), before you use those words in an SAT essay.

Creating elegant, non-awkward sentences is the thing I struggle most with under time pressure. For instance, here’s my first try at the previous sentence: “Making sure a sentence structure makes sense is the thing that I have the most problems with when I’m writing in a short amount of time” (hahaha NOPE - way too convoluted and wordy, self). As another example, take a look at these two excerpts from the hypothetical essay discussing how the author persuaded his readers that a hot dog is not a sandwich:

Score of 2: "The author makes his point by critiquing the argument against him. The author pointed out the logical fallacy of saying a hot dog was a sandwich because it was meat "sandwiched" between two breads. The author thus persuades the reader his point makes sense to be agreed with and convinces them."

The above sentences lack variety in structure (they all begin with the words "the author"), and the last sentence has serious flaws in its structure (it makes no sense).

Score of 4: "The author's rigorous examination of his opponent's position invites the reader, too, to consider this issue seriously. By laying out his reasoning, step by step, Hodgman makes it easy for the reader to follow along with his train of thought and arrive at the same destination that he has. This destination is Hodgman's claim that a hot dog is not a sandwich."

The above sentences demonstrate variety in sentence structure (they don't all begin with the same word and don't have the same underlying structure) that presumably forward the point of the essay.

In general, if you're doing well in all the other Writing areas, your sentence structures will also naturally vary. If you're really worried that your sentences are not varied enough, however, my advice for working on "demonstrating meaningful variety in sentence structure" (without ending up with terribly worded sentences) is twofold:

  • Read over what you’ve written before you hand it in and change any wordings that seem awkward, clunky, or just plain incorrect.
  • As you’re doing practice essays, have a friend, family member, or teacher who is good at (English) writing look over your essays and point out any issues that arise. 

This part of the Writing grade is all about the nitty gritty details of writing: grammar, punctuation, and spelling . It's rare that an essay with serious flaws in this area can score a 4/4 in Reading, Analysis, or Writing, because such persistent errors often "interfere with meaning" (that is, persistent errors make it difficult for the grader to understand what you're trying to get across).

On the other hand, if they occur in small quantities, grammar/punctuation/spelling errors are also the things that are most likely to be overlooked. If two essays are otherwise of equal quality, but one writer misspells "definitely" as "definately" and the other writer fails to explain how one of her examples supports her thesis, the first writer will receive a higher essay score. It's only when poor grammar, use of punctuation, and spelling start to make it difficult to understand your essay that the graders start penalizing you.

My advice for working on this rubric area is the same advice as for sentence structure: look over what you’ve written to double check for mistakes, and ask someone who’s good at writing to look over your practice essays and point out your errors. If you're really struggling with spelling, simply typing up your (handwritten) essay into a program like Microsoft Word and running spellcheck can alert you to problems. We've also got a great set of articles up on our blog about SAT Writing questions that may help you better understand any grammatical errors you are making.

How Do I Use The SAT Essay Grading Rubric?

Now that you understand the SAT essay rubric, how can you use it in your SAT prep? There are a couple of different ways.

Use The SAT Essay Rubric To...Shape Your Essays

Since you know what the SAT is looking for in an essay, you can now use that knowledge to guide what you write about in your essays!

A tale from my youth: when I was preparing to take the SAT for the first time, I did not really know what the essay was looking for, and assumed that since I was a good writer, I’d be fine.

Not true! The most important part of the SAT essay is using specific examples from the passage and explaining how they convince the reader of the author's point. By reading this article and realizing there's more to the essay than "being a strong writer," you’re already doing better than high school me.

body_readsleeping

Change the object in that girl’s left hand from a mirror to a textbook and you have a pretty good sketch of what my junior year of high school looked like.

Use The SAT Essay Rubric To...Grade Your Practice Essays

The SAT can’t exactly give you an answer key to the essay. Even when an example of an essay that scored a particular score is provided, that essay will probably use different examples than you did, make different arguments, maybe even argue different interpretations of the text...making it difficult to compare the two. The SAT essay rubric is the next best thing to an answer key for the essay - use it as a lens through which to view and assess your essay.

Of course, you don’t have the time to become an expert SAT essay grader - that’s not your job. You just have to apply the rubric as best as you can to your essays and work on fixing your weak areas . For the sentence structure, grammar, usage, and mechanics stuff I highly recommend asking a friend, teacher, or family member who is really good at (English) writing to take a look over your practice essays and point out the mistakes.

If you really want custom feedback on your practice essays from experienced essay graders, may I also suggest the PrepScholar test prep platform ? I manage the essay grading and so happen to know quite a bit about the essay part of this platform, which gives you both an essay grade and custom feedback for each essay you complete. Learn more about how it all works here .

What’s Next?

Are you so excited by this article that you want to read even more articles on the SAT essay? Of course you are. Don't worry, I’ve got you covered. Learn how to write an SAT essay step-by-step and read about the 6 types of SAT essay prompts .

Want to go even more in depth with the SAT essay? We have a complete list of past SAT essay prompts as well as tips and strategies for how to get a 12 on the SAT essay .

Still not satisfied? Maybe a five-day free trial of our very own PrepScholar test prep platform (which includes essay practice and feedback) is just what you need.

Trying to figure out whether the old or new SAT essay is better for you? Take a look at our article on the new SAT essay assignment to find out!

Want to improve your SAT score by 160 points?   Check out our best-in-class online SAT prep classes. We guarantee your money back if you don't improve your SAT score by 160 points or more.   Our classes are entirely online, and they're taught by SAT experts. If you liked this article, you'll love our classes. Along with expert-led classes, you'll get personalized homework with thousands of practice problems organized by individual skills so you learn most effectively. We'll also give you a step-by-step, custom program to follow so you'll never be confused about what to study next.   Try it risk-free today:

Laura graduated magna cum laude from Wellesley College with a BA in Music and Psychology, and earned a Master's degree in Composition from the Longy School of Music of Bard College. She scored 99 percentile scores on the SAT and GRE and loves advising students on how to excel in high school.

Ask a Question Below

Have any questions about this article or other topics? Ask below and we'll reply!

Improve With Our Famous Guides

  • For All Students

The 5 Strategies You Must Be Using to Improve 160+ SAT Points

How to Get a Perfect 1600, by a Perfect Scorer

Series: How to Get 800 on Each SAT Section:

Score 800 on SAT Math

Score 800 on SAT Reading

Score 800 on SAT Writing

Series: How to Get to 600 on Each SAT Section:

Score 600 on SAT Math

Score 600 on SAT Reading

Score 600 on SAT Writing

Free Complete Official SAT Practice Tests

What SAT Target Score Should You Be Aiming For?

15 Strategies to Improve Your SAT Essay

The 5 Strategies You Must Be Using to Improve 4+ ACT Points

How to Get a Perfect 36 ACT, by a Perfect Scorer

Series: How to Get 36 on Each ACT Section:

36 on ACT English

36 on ACT Math

36 on ACT Reading

36 on ACT Science

Series: How to Get to 24 on Each ACT Section:

24 on ACT English

24 on ACT Math

24 on ACT Reading

24 on ACT Science

What ACT target score should you be aiming for?

ACT Vocabulary You Must Know

ACT Writing: 15 Tips to Raise Your Essay Score

How to Get Into Harvard and the Ivy League

How to Get a Perfect 4.0 GPA

How to Write an Amazing College Essay

What Exactly Are Colleges Looking For?

Is the ACT easier than the SAT? A Comprehensive Guide

Should you retake your SAT or ACT?

When should you take the SAT or ACT?

Stay Informed

Follow us on Facebook (icon)

Get the latest articles and test prep tips!

Looking for Graduate School Test Prep?

Check out our top-rated graduate blogs here:

GRE Online Prep Blog

GMAT Online Prep Blog

TOEFL Online Prep Blog

Holly R. "I am absolutely overjoyed and cannot thank you enough for helping me!”
  • What can IELTS do for you
  • Ways to take IELTS
  • Who accepts IELTS?
  • Sample test questions
  • IELTS Trial Test

Understanding your score

  • Trust IELTS
  • On test day
  • Test centres
  • IELTS One Skill Retake
  • Cancellations, refunds...
  • Access arrangements
  • Getting and sharing...
  • Improving your results
  • Academic Institutions
  • Why accept IELTS?
  • IELTS Scoring
  • Compare IELTS
  • IELTS for your sector
  • Get started with IELTS
  • Verifying IELTS results
  • Research reports
  • Test statistics​
  • Research funding
  • Awards and scholarships
  • Previously funded...
  • News and Insights

Need help finding something? Enter a search term below

9th grade essay grading rubric

When you get your IELTS test results you will be given an overall test score of between 0 and 9 on the IELTS scale. You will also get a score for each of the four sections of the test – Listening, Reading, Writing, and Speaking. The IELTS scale helps organisations understand your level of English.

IELTS Band Score:

9 Skill Level: Expert

  • The test taker has fully operational command of the language.
  • Their use of English is appropriate, accurate and fluent, and shows complete understanding.

8 Skill Level: Very good

  • The test taker has fully operational command of the language with only occasional unsystematic inaccuracies and inappropriate usage.
  • They may misunderstand some things in unfamiliar situations. They handle complex and detailed argumentation well.

7 Skill Level: Good

  • The test taker has operational command of the language, though with occasional inaccuracies, inappropriate usage and misunderstandings in some situations.
  • They generally handle complex language well and understand detailed reasoning.

6 Skill Level: Competent

  • The test taker has an effective command of the language despite some inaccuracies, inappropriate usage and misunderstandings.
  • They can use and understand reasonably complex language, particularly in familiar situations.

5 Skill Level: Modest

  • The test taker has a partial command of the language and copes with overall meaning in most situations, although they are likely to make many mistakes.
  • They should be able to handle basic communication in their own field.

4 Skill Level: Limited

  • The test taker's basic competence is limited to familiar situations.
  • They frequently show problems in understanding and expression.

3 Skill Level: Extremely limited

  • The test taker conveys and understands only general meaning in very familiar situations.
  • There are frequent breakdowns in communication.

2 Skill Level: Intermittent

  • The test taker has great difficulty understanding spoken and written English.

1 Skill Level: Non-user

  • The test taker has no ability to use the language except a few isolated words.

0 Skill Level: Did not attempt the test

  • The test taker did not answer the questions.

Female professional looking right

Calculating your overall score

For each section of the test – Listening, Reading, Writing, and Speaking – you will be awarded a score on the IELTS scale. Your overall score is the average of your section scores, rounded to the closest half-point on the scale.

How long is my IELTS score valid?

We recommend that IELTS test results are considered valid for two years after a test has been taken. This is based on research into second-language loss. Organisations that accept IELTS can choose to accept results for a longer period, so you should check with them directly.

What score do I need for UKVI visa applications?

Find out what CEFR (Common European Framework of Reference for Languages) level you need for UKVI and the IELTS band score equivalent .

  • Accessibility
  • Legal & policies

2024. IELTS is jointly owned by the British Council; IDP IELTS; and Cambridge University Press & Assessment

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Receiving timely and personalized feedback is essential for second-language learners, especially when human instructors are unavailable. This study explores the effectiveness of Large Language Models (LLMs), including both proprietary and open-source models, for Automated Essay Scoring (AES). Through extensive experiments with public and private datasets, we find that while LLMs do not surpass conventional state-of-the-art (SOTA) grading models in performance, they exhibit notable consistency, generalizability, and explainability. We propose an open-source LLM-based AES system, inspired by the dual-process theory. Our system offers accurate grading and high-quality feedback, at least comparable to that of fine-tuned proprietary LLMs, in addition to its ability to alleviate misgrading. Furthermore, we conduct human-AI co-grading experiments with both novice and expert graders. We find that our system not only automates the grading process but also enhances the performance and efficiency of human graders, particularly for essays where the model has lower confidence. These results highlight the potential of LLMs to facilitate effective human-AI collaboration in the educational context, potentially transforming learning experiences through AI-generated feedback.

Changrong Xiao 1 ,  Wenxing Ma 1  Qingping Song 2  Sean Xin Xu 1 , Kunpeng Zhang 3 ,   Yufang Wang 4 ,   Qi Fu 4 1 School of Economics and Management, Tsinghua University 2 Department of Information Systems, City University of Hong Kong 3 Department of Decision, Operations & Information Technologies, University of Maryland 4 Beijing Xicheng Educational Research Institute [email protected] ,   [email protected] ,   [email protected] ,   [email protected] , [email protected] ,   [email protected] ,   [email protected]

1 Introduction

Refer to caption

Writing practice is an essential component of second-language learning. While the provision of timely and reliable feedback poses a considerable challenge for educators in China due to the high student-teacher ratio. This limitation hampers students’ academic progress, especially those who are keen on self-directed learning. Automated Essay Scoring (AES) systems provide valuable assistance to students by offering immediate and consistent feedback on their work, and also simplifying the grading process for educators.

However, implementing AES systems effectively in real-world educational scenarios presents several challenges. First, the diverse range of exercise contexts and the inherent ambiguity in scoring rubrics complicate the ability of traditional models to deliver accurate scores. Second, interviews with high school teachers indicate that despite receiving accurate score predictions, they must still review essays to mitigate potential errors from the models. Consequently, relying exclusively on this system without human supervision is impractical in real-world scenarios. Thus, there is a clear need for AES systems that not only predict scores accurately but also facilitate effective human-AI collaboration. This should be supported by natural language explanations and additional assistive features to enhance usability.

To effectively tackle these challenges, it is crucial to highlight the latest advancements in the field of Natural Language Processing (NLP), particularly focusing on the development of large language models (LLMs). LLMs, such as OpenAI’s ChatGPT 1 1 1 https://chat.openai.com , not only showcase impressive capabilities of robust logical reasoning but also exhibit a remarkable ability to comprehend and faithfully follow human instructions (Ouyang et al., 2022 ) . Furthermore, recent studies have highlighted the potential of leveraging LLMs in AES tasks (Mizumoto and Eguchi, 2023 ; Yancey et al., 2023 ; Naismith et al., 2023 ) .

In this study, we explore the potential of proprietary and open-source LLMs such as GPT-3.5, GPT-4, and LLaMA3 for AES tasks. We conducted extensive experiments with public essay-scoring datasets as well as a private collection of student essays to assess the zero-shot and few-shot performance of these models. Additionally, we enhanced their effectiveness through supervised fine-tuning (SFT). Drawing inspiration from the dual-process Theory, we developed an AES system based on LLaMA3 that matches the grading accuracy and feedback quality of fine-tuned LLaMA3. Our human-LLM co-grading experiment further revealed that this system significantly improves the performance and efficiency of both novice and expert graders, offering valuable insights into the educational impacts and potential for effective human-AI collaboration. Overall, our study contributes three major advancements to the field:

We pioneer the exploration of LLMs’ capabilities as AES systems, especially in complex scenarios featuring tailored grading criteria. Leveraging dual-process theory, our novel AES framework demonstrates remarkable accuracy, efficiency, and explainability.

We introduce an extensive essay-scoring dataset, which includes 13,372 essays written by Chinese high school students. These essays are evaluated with multi-dimensional scores by expert educators. This dataset significantly enhances the resources available for AI in Education (AIEd) 2 2 2 Codes and resources can be found in https://github.com/Xiaochr/LLM-AES .

Our findings from the human-LLM co-grading task highlight the potential of LLM-generated feedback to elevate the proficiency of individuals with limited domain expertise to a level akin to that of experts. Additionally, it enhances the efficiency and robustness of human graders by integrating model confidence scores and explanations. These insights set the stage for future investigation into human-AI collaboration and AI-assisted learning within educational contexts.

2 Related Work

2.1 automated essay scoring (aes), traditional methods.

Automated Essay Scoring (AES) stands as a pivotal research area at the intersection of NLP and education. Traditional AES methods are usually regression-based or classification-based machine learning models (Sultan et al., 2016 ; Mathias and Bhattacharyya, 2018b , a ; Salim et al., 2019 ) trained with textual features extracted from the target essays. With the advancement of deep learning, AES has witnessed the integration of advanced techniques such as convolutional neural networks (CNNs) (Dong and Zhang, 2016 ) , long short-term memory networks (LSTMs) (Taghipour and Ng, 2016 ) , and also pre-trained language models (Rodriguez et al., 2019 ; Lun et al., 2020 ) . These innovations have led to more precise score predictions, and state-of-the-art methods are primarily based on Bidirectional Encoder Representations from Transformers (BERT) (Yang et al., 2020 ; Wang et al., 2022 ; Boquio and Naval, 2024 ) .

LLM Applications in AES

Recent studies have explored The potential of leveraging the capabilities of modern LLMs in AES tasks. Mizumoto and Eguchi ( 2023 ) provided ChatGPT with specific IELTS scoring rubrics for essay evaluation but found limited improvements when incorporating GPT scores into the regression model. In a different approach, Yancey et al. ( 2023 ) used GPT-4’s few-shot capabilities to predict Common European Framework of Reference for Languages (CEFR) levels for short essays written by second-language learners. However, the Quadratic Weighted Kappa (QWK) scores still did not surpass those achieved by the XGBoost baseline model or human annotators. Similarly, Han et al. ( 2023 ); Stahl et al. ( 2024 ) introduced prompting frameworks that did not outperform the conventional baselines.

2.2 AI-Assisted Decision Making

Researchers have extensively investigated human-AI teams, in which AI supports the decision-making process by providing recommendations or suggestions, while the human remains responsible for the final decision (van den Bosch et al., 2019 ) . The objective of such human-AI collaboration is to achieve complementary performance, where the combined team performance exceeds that of either party operating independently (Bansal et al., 2021 ) . To realize this, it is crucial to design an AI-assisted decision-making process that allows humans to effectively monitor and counteract any unpredictable or undesirable behavior exhibited by AI models (Eigner and Händler, 2024 ) . This design aims to leverage the strengths of both humans and AI to enhance overall performance (Holstein and Aleven, 2022 ) . To our knowledge, no studies have yet investigated AES systems from this angle of collaborative co-grading.

2.3 Dual-Process Theory

Recent studies have developed architectures that imitate human cognitive processes to enhance the capabilities of LLMs, particularly in reasoning and planning (Benfeghoul et al., 2024 ) . According to dual-process theory in psychology (Wason and Evans, 1974 ; Kahneman, 2011 ) , human cognition operates via two distinct systems: System 1 involves rapid, intuitive "Fast Thinking", while System 2 entails conscious and deliberate "Slow Thinking" processes. LLM architectures inspired by this theory have been implemented in complex interactive tasks (Lin et al., 2024 ; Tian et al., 2023 ) , aiming to mitigate issues like social biases (Kamruzzaman and Kim, 2024 ) and hallucination (Bellini-Leite, 2023 ) . These adaptations have demonstrated improved performances in various areas.

ASAP dataset

Our chinese student english essay (csee) dataset.

Refer to caption

In this section, we present the details of the models used in this study, including traditional AES baselines, LLM-based approaches, and our proposed Fast and Slow Thinking AES framework.

4.1 Traditional Essay Scoring Baselines

Bert classifier.

Similar to the model used in Yang et al. ( 2020 ); Han et al. ( 2023 ) ’s work, we implemented a simple yet effective baseline model for score prediction based on BERT. This model integrated a fully connected prediction layer following the BERT output, and the BERT parameters remained unfrozen during training. Both the BERT model and the prediction layer were jointly trained on the training essay set (details in Appendix B ).

SOTA baselines

We also incorporate models such as R 2 BERT (Yang et al., 2020 ) and Tran-BERT-MS-ML-R (Wang et al., 2022 ) , which represent the highest levels of performance in the ASAP AES task. These models serve as the high-level benchmarks against which we evaluate the performance of our LLM-based models.

4.2 Prompting LLMs

We considered various prompting strategies including with or without detailed rubrics context, zero-shot or few-shot settings. An illustrative example of a prompt and its corresponding model-generated output can be found in Table 9 in the Appendices.

GPT-4, zero-shot, without rubrics

In this setting, we simply provide the prompt and the target essay to GPT-4. The model then evaluates the essay and assigns a score based on its comprehension within the specified score range.

GPT-4, zero-shot, with rubrics

Alongside the prompt and the target essay, we also provide GPT-4 with explicit scoring rubrics, guiding its evaluation.

GPT-4, few-shot, with rubrics

In addition to the zero-shot settings, the few-shot prompts include sample essays and their corresponding scores. This assists GPT-4 in understanding the latent scoring patterns. With the given prompt, target essay, scoring rubrics, and a set of k 𝑘 k italic_k essay examples, GPT-4 provides an appropriate score reflecting this enriched context. See Appendix C for details.

In all these configurations, we adopted the Chain-of-Thought (CoT) (Wei et al., 2022 ) strategy. This approach instructed the LLM to analyze and explain the provided materials before making final score determinations. Studies (Lampinen et al., 2022 ; Zhou et al., 2023 ; Li et al., 2023 ) have shown that this structured approach significantly enhances the capabilities of the LLM, optimizing performance in tasks that require inference and reasoning.

4.3 Fine-tuning LLMs

We conducted additional investigations into the effectiveness of supervised fine-tuning methods. Given that the ASAP and our CSEE dataset only include scores without expert explanations, we augmented these original datasets with explanations generated by GPT-4. To guide the explanation generation process, we provided GPT-4 with a few expert-curated explanations and a structured template. By organizing the data into an instructional format, we created fine-tuning inputs that enable the LLMs to not only generate accurate scores but also provide high-quality feedback.

We first fine-tuned OpenAI’s GPT-3.5-turbo , one of the best-performing LLMs. However, due to the proprietary nature of GPT-3.5 and considerations such as data privacy, training and inference costs, and flexibility in fine-tuning, we also fine-tune an LLaMA3-8B 4 4 4 https://llama.meta.com/llama3/ model with both original and augmented datasets. This recent open-source model mitigates these concerns and has remarkable capabilities, making it a more practical choice for use in educational scenarios.

4.4 Our Proposed Method

As previously mentioned, score prediction and explanation generation are distinct but interrelated tasks within the context of AES. Explanation generation, which covers the evaluation of content, language, and structure, necessitates deliberate and meticulous reasoning. On the other hand, score prediction can either be a swift process based on intuition and experiences or concluded after step-by-step inference. These features align with the idea of dual-process theory. Consequently, we have designed an integrated system that includes separate modules for each task: the Fast Module for quick score prediction and the Slow Module for detailed explanation generation. The framework of our proposed AES system is shown in Figure 2 .

Slow Module: Fine-tuned LLM

The Slow Module forms the core of our AES system, capable of analyzing essays in depth, providing evidence based on specific rubrics, and deriving appropriate scores. This detailed process is time-intensive but yields valuable natural language reasoning that informs the final grading decision. In this study, we implemented the fine-tuned LLaMA3-8B as the Slow Module. It is worth noting that this module is interchangeable with any other qualified LLM, demonstrating the flexibility of our framework.

Fast Module: Lightweight Classifier

In many cases, swift score prediction is preferable to detailed reasoning. To reduce the time and computational cost associated with generating detailed explanations, we introduced a simple fully connected layer as a bypass before the initiation of text generation by the Slow Module. By using only the embeddings of the input text, the Fast Module not only conserves resources but also leverages the latent knowledge acquired during the fine-tuning of the Slow Module, aligning with the ’intuitive’ facet of Fast Thinking.

When to switch from the Fast to Slow Thinking module is one of the challenges in the design of dual-process LLM. Previous frameworks employed heuristic rules or error feedback as the triggering criteria (Lin et al., 2024 ; Tian et al., 2023 ) , which might be impractical in real-world cases. Our Fast module also calculates the probabilities of each possible output score, which we standardize and treat as confidence scores. Predictions with low confidence are considered unreliable, triggering the Slow Module for self-reflection, or passing to external judges (either human or AI). This design aims to enhance essay scoring accuracy and foster effective human-AI collaboration, potentially elevating the complementary team performance.

For training, we first fine-tune the Slow Module using our explanation-augmented dataset. Subsequently, we employ the Slow Module to derive input embeddings, which, paired with the rated scores, are used to train the Fast Classifier from scratch. During inference, essay inputs initially pass through the fine-tuned LLM and are transformed into the embedding format. They are then processed by the Fast Module to quickly derive scores. The Slow Module is activated only when prediction confidence is low or based on specific additional requirements.

5 Experimental Results

E. Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Avg.
BERT Classifier 0.6486 0.6284 0.7327 0.7669 0.7432 0.6810 0.7165 0.4624 0.6725
Tran-BERT-MS-ML-R 0.8340 0.7160 0.7140 0.8120 0.8130 0.8360 0.8390 0.7660 0.7910
R BERT 0.8170 0.7190 0.6980 0.8450 0.8410 0.8470 0.8390 0.7440 0.7940
GPT-4, zero-shot, w/o rubrics 0.0423 0.4017 0.2805 0.5571 0.3659 0.5021 0.0809 0.4188 0.3312
GPT-4, zero-shot, with rubrics 0.0715 0.3003 0.3661 0.6266 0.5227 0.3448 0.1101 0.4072 0.3437
GPT-4, few-shot, with rubrics 0.2801 0.3376 0.3308 0.7839 0.6226 0.7284 0.2570 0.4541 0.4743
Fine-tuned GPT-3.5 0.7406 0.6183 0.7041 0.8593 0.7959 0.8480 0.7271 0.6135 0.7384
Fine-tuned LLaMA3 0.7137 0.6696 0.6558 0.7712 0.7452 0.7489 0.6938 0.2952 0.6617
Ours 0.7612 0.6517 0.7238 0.8093 0.8118 0.7764 0.7071 0.4885 0.7162
   Fast module 0.7580 0.6395 0.7228 0.7995 0.8023 0.7753 0.7157 0.5075 0.7151
   Slow module 0.6048 0.5621 0.5700 0.6992 0.6774 0.5943 0.5772 0.2677 0.5691
E. Overall Content Language Structure
BERT Classifier 0.7674 0.7312 0.7203 0.6650
GPT-4, zero-shot, w/o rubrics 0.4688 0.4412 0.3081 0.5757
GPT-4, zero-shot, with rubrics 0.5344 0.5391 0.4660 0.4256
GPT-4, few-shot, with rubrics 0.6729 0.6484 0.6278 0.4661
Fine-tuned GPT-3.5 0.7532 0.7241 0.7513 0.6576
Fine-tuned LLaMA3 0.7544 0.7321 0.7084 0.6461
Ours 0.7634 0.7347 0.7192 0.6656
   Fast module 0.7364 0.7272 0.7072 0.6627
   Slow module 0.7310 0.6810 0.6990 0.6412

5.1 Performance of LLM-based Methods

We conducted experiments across all eight subsets of the ASAP dataset using both the LLM-based methods and baseline approaches. We adopted Cohen’s Quadratic Weighted Kappa (QWK) as our primary evaluation metric, which is the most widely recognized automatic metric in AES tasks (Ramesh and Sanampudi, 2022 ) . A higher QWK value indicates a greater degree of agreement between the predicted score and the ground truth. For methods requiring a training dataset, we divided the data for each subset using an 80:20 split ratio between training and testing.

Our extensive experiments, as detailed in Table 1 , revealed that despite using carefully curated prompts and providing detailed context, the zero-shot and few-shot capabilities of GPT-4 did not yield high QWK scores on the ASAP dataset. In zero-shot scenarios, GPT-4’s performance was notably low, with some subsets scoring nearly as poorly as random guessing. For instance, Set 1 recorded a QWK of 0.0423 and Set 7 a QWK of 0.0809. This underperformance may be due to the broad scoring ranges and complex rubrics in ASAP, suggesting that even advanced LLMs like GPT-4 may struggle to fully comprehend and adhere to complicated human instructions. In few-shot settings, although there was an improvement in scoring performance, particularly for Sets 4-6, GPT-4 still significantly lagged behind SOTA grading methods. This is consistent with findings from recent studies that utilize LLMs for essay scoring.

When fine-tuned with the training dataset, the LLMs demonstrated significantly improved performance compared to the zero-shot and few-shot results, with QWK scores generally exceeding 0.7. However, these fine-tuned LLMs still did not surpass traditional SOTA methods. Within our framework, the performance of the fine-tuned open-source LLaMA3-8B was comparable to that of fine-tuned proprietary models. Even simple supervised fine-tuning (SFT) of LLaMA3 achieved notable results, suggesting that open-source LLMs might be a cost-effective choice for AES tasks. The findings from our CSEE dataset (see Table 2 ) align with those on the ASAP dataset, indicating that our framework predicts reliable scores across content, language, and structure dimensions.

Although LLMs do not match traditional methods in terms of scoring accuracy, they excel at generating detailed explanations, benefiting both educators and students. Notably, when trained to produce both scores and explanations in a single output (our proposed Slow Module), LLaMA3-8B experienced a performance drop in grading accuracy. This decrease may be attributed to the model’s optimization process, where numerical score values are treated similarly to textual data in the output, leading to suboptimal accuracy. In our Fast and Slow Thinking framework, however, separating numerical from textual outputs and integrating them based on a trigger condition improved the QWK scores, enhancing collaborative performance. Additionally, we evaluated the quality of explanations generated by our AES system against those produced by GPT-4. Through a comparison competition among crowdsourced workers, analyzing 20 sets of paired essay grading explanations, our system achieved a win rate of 35%, a tie rate of 40%, and a loss rate of 25%. These results demonstrate that our explanations are of high quality and comparable to those generated by GPT-4.

5.2 Further Analyses

Consistency.

To assess the consistency of scores predicted by LLM-based methods, we conducted the same experiment three times, each with the temperature parameter of the LLMs set to 0 0 . We observed that over 80 % percent 80 80\% 80 % of the ratings remained unchanged across these trials, indicating a high level of consistency. We then computed the average of these three values to determine the final results.

Generalizability

The eight subsets of the ASAP dataset, featuring diverse scoring criteria and ranges, serve as an excellent framework for evaluating the generalization capabilities of models. For methods such as fine-tuning and traditional baselines that require training data, we first trained the models on one subset and then assessed their performance across the remaining datasets. For example, we trained on Set 1 and tested on Sets 2-8, keeping the model weights fixed. We selected fine-tuned GPT-3.5 and the BERT Classifier to represent LLM-based and traditional methods, respectively. As detailed in Table 7 , our fine-tuned GPT-3.5 generally outperformed the BERT classifier, although there were instances of underperformance, notably when trained on Set 4 and tested on Sets 1 and 7. The BERT classifier showed particularly weak generalization when trained on Sets 7 and 8, performing close to random guessing.

Prediction Confidence and Self-Reflection

To assess the reliability of confidence scores, we segmented the test samples based on the output confidence and observed a strong correlation between these scores and model performance in Figure 3 . The trigger condition for switching from the Fast to the Slow Module is set when the confidence score falls below 0.2. Although the Slow Module generally exhibits lower performance compared to the Fast Module, the overall performance of the integrated system improved. This enhancement suggests that employing detailed reasoning for cases with low confidence is an effective grading strategy.

Time Efficiency

Training the Slow Module for each epoch with our explanation-augmented dataset requires around 2 hours using an RTX 4090 24GB GPU, and the inference process consumes about 10 GPU hours. Meanwhile, training the Fast Module takes less than 0.5 hours, and scoring predictions are completed in just 0.2 hours. Our proposed framework, which incorporates a confidence trigger condition, offers an effective trade-off by enhancing both accuracy and efficiency.

Refer to caption

6 Human-AI Co-Grading Experiment

Given that the AES system not only provides score predictions but also functions as a teammate to educators, we further explore the effectiveness of our proposed system in assisting human grading.

6.1 Experiment Design

To investigate the performance of human-only, AI-only, and human-AI team collaboration, we conducted a two-stage within-group experiment. We randomly selected 50 essays from the test set of our CSEE dataset, all on the same topic. We recruited 10 college students from a Normal University in Beijing, who are prospective high school teachers with no current grading experience, to serve as novice evaluators. Additionally, 5 experienced high school English teachers participated as expert evaluators. Initially, all evaluators graded the essays independently using standard rubrics. Subsequently, they were provided with the scores, prediction confidence levels, and explanations generated by our AES system and had the option to revise their initial scores based on this augmented information. To gather feedback on the process, we distributed questionnaires where evaluators rated their experience on a 5-point Likert scale, with higher scores indicating better-perceived performance.

In short, we mainly focus on the following research questions:

Can novice and expert human evaluators achieve complementary performance in terms of accuracy and efficiency using the proposed AES system and collaborative workflow?

Does the design of prediction confidence and explanation generation contribute to performance improvements?

6.2 Results

Feedback generated by llm elevates novice evaluators to expert level..

As depicted in Figure 4 and Table 3 , our findings reveal that novice graders, with the assistance of LLM-generated feedback (including both scores and explanations), achieved a significant improvement in performance. Their average QWK improved from 0.5256 to 0.6609, with a p-value of less than 0.01. Furthermore, when comparing the performance of LLM-assisted novice evaluators (mean QWK of 0.6609) to that of expert graders (mean QWK of 0.7117), no statistical difference was found between the two groups (p-value = 0.27). This indicates that with LLM support, novice evaluators achieved a level of grading proficiency comparable to that of experienced experts. Similar trends were observed in the scores for content, language, and structure, with detailed results presented in Table 5 .

Diff. statistic -value
Expert vs. Novice 0.1860*** 3.2152 0.0068
Novice+LLM vs. Novice 0.1353*** 2.8882 0.0098
Expert+LLM vs. Expert 0.0617 1.7128 0.1251
Novice+LLM vs. Expert -0.0508 -1.1566 0.2682

Refer to caption

Feedback generated by LLM boosts expert efficiency and consistency.

The integration of LLM-generated feedback into the expert grading process led to an increase in the average QWK from 0.7117 to 0.7734, which also surpassed the performance of AES systems (a QWK of 0.7302) for these essay samples, thereby achieving superior complementary performance (where the Human-AI team outperforms both individual human and AI). Although this improvement is not statistically significant ( p 𝑝 p italic_p -value = 0.13), the benefits of LLM augmentation for experts were evident in several other aspects. According to self-report questionnaires (refer to Table 4 ), experts required less time to complete grading tasks when assisted by the LLM. Furthermore, a reduction in the standard deviation of expert ratings was observed, indicating a higher level of consensus among experts. This suggests that LLM-generated feedback leads to more consistent evaluations of student essays. Experienced domain experts also commended the accuracy and practicality of the LLM-generated feedback, particularly praising the prediction confidence mechanism which alerted them to scrutinize more challenging cases. These findings highlight the potential to augment the human grading process with our AES system in real-world educational environments.

Score
Perceived accuracy of LLM overall score 4.3/5
Perceived accuracy of LLM content score 4.0/5
Perceived accuracy of LLM language score 3.9/5
Perceived accuracy of LLM structure score 3.8/5
Helpfulness of the predicted scores 4.6/5
Helpfulness of the confidence scores 4.8/5
Helpfulness of LLM explanations 4.7/5
Efficiency of LLM assistance 4.4/5
Willingness to use our AES system 4.3/5

The Importance of Prediction Confidence and Explanations

We previously assessed the reliability of prediction confidence from our Fast Module and noted a modest improvement in model performance after self-reflection by the Slow Module (as shown in the LLM column of Figure 4 ). In the context of human-AI collaboration, we focused on cases where the predicted scores presented to human evaluators were of low confidence (below 0.2). We observed that the overall QWK scores for expert and novice evaluators were 0.6809 and 0.5680. These QWK values, lower than the average human performances, suggest that these essays are inherently challenging to grade, even for humans. However, human performances exceeded that of the LLM Slow Module’s 0.5478 QWK, achieving complementary team performance. These findings support a practical, intuitive LLM-assisted decision-making workflow: the model manages routine cases with high confidence and minimal human intervention, while low-confidence cases are presented to human collaborators for in-depth analysis and final decision-making.

7 Conclusion

In this study, we explored the capabilities of LLMs within AES systems. With detailed contexts, clear rubrics, and high-quality examples, GPT-4 demonstrated satisfactory performance, consistency, and generalizability. Further accuracy enhancements were achieved through supervised fine-tuning using task-specific instruction datasets, bringing LLM performance close to conventional SOTA methods. To leverage the LLMs’ ability to generate natural language explanations along with predicted scores, we introduced an open-source Fast and Slow Thinking AES framework. This framework not only matches the quality of proprietary models but also offers greater efficiency.

Our research extended into human-AI co-grading experiments within this new framework. A notable finding was that LLMs not only automated the grading process but also augmented the grading skills of human evaluators. Novice graders, with support from our AES framework, reached accuracy levels comparable to those of experienced graders, while expert graders showed gains in efficiency and consistency. The collaboration between humans and AI particularly enhanced performance in handling low-confidence cases, demonstrating a significant synergy that approached the upper limits of team performance. These results highlight the transformative potential of AI-assisted and human-centered decision-making workflows, especially in elevating those with limited domain knowledge to expert-level proficiency. This study illuminates promising future directions for human-AI collaboration and underscores the evolving role of AI in educational contexts.

Limitations

This study has certain limitations. Firstly, although our CSEE dataset includes a substantial number of student essays, these essays originate from only two final exams designed for high school English learners in China. This raises concerns about the robustness of our proposed AES system when applied to a broader range of topics and diverse student demographics. Secondly, our human-AI collaboration experiment, while indicative of promising directions for future human-AI co-grading tasks, is a pilot study that yields general results. Further experiments are necessary to thoroughly explore the mechanisms of complementary team performance, such as identifying circumstances under which humans are likely to recognize and correct their errors following AI feedback, or instances where unreliable AI feedback could potentially mislead them. A deeper understanding of these collaboration mechanisms will enable researchers to develop AES systems that offer more effective support to educators.

Ethical Considerations

We secured Institutional Review Board (IRB) approval for both the data collection and the human-AI co-grading experiment (details provided in the online materials). In our CSEE dataset, all personal information concerning the students has been anonymized to safeguard their privacy. The dataset comprises solely of essays and the corresponding scores, omitting any additional information that might raise ethical concerns. However, details of the data annotation process remain undisclosed to us, including the number of teachers involved in the scoring and the level of inter-annotator agreement among them. We have also obtained explicit consent to use the data exclusively for research purposes from both teachers and students.

  • Bansal et al. (2021) Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI conference on human factors in computing systems , pages 1–16.
  • Bellini-Leite (2023) Samuel C Bellini-Leite. 2023. Dual process theory for large language models: An overview of using psychology to address hallucination and reliability issues . Adaptive Behavior .
  • Benfeghoul et al. (2024) Martin Benfeghoul, Umais Zahid, Qinghai Guo, and Zafeirios Fountas. 2024. When in doubt, think slow: Iterative reasoning with latent imagination. arXiv preprint arXiv:2402.15283 .
  • Boquio and Naval (2024) Eujene Nikka V. Boquio and Prospero C. Naval, Jr. 2024. Beyond canonical fine-tuning: Leveraging hybrid multi-layer pooled representations of BERT for automated essay scoring . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages 2285–2295, Torino, Italia. ELRA and ICCL.
  • Dong and Zhang (2016) Fei Dong and Yue Zhang. 2016. Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing , pages 1072–1077.
  • Eigner and Händler (2024) Eva Eigner and Thorsten Händler. 2024. Determinants of llm-assisted decision-making. arXiv preprint arXiv:2402.17385 .
  • Han et al. (2023) Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Hyunseung Lim, Yoonsu Kim, Tak Yeon Lee, Hwajung Hong, Juho Kim, So-Yeon Ahn, et al. 2023. Fabric: Automated scoring and feedback generation for essays. arXiv preprint arXiv:2310.05191 .
  • Holstein and Aleven (2022) Kenneth Holstein and Vincent Aleven. 2022. Designing for human–ai complementarity in k-12 education. AI Magazine , 43(2):239–248.
  • Kahneman (2011) D. Kahneman. 2011. Thinking, Fast and Slow . Harvard Library e-reader collection. Farrar, Straus and Giroux.
  • Kamruzzaman and Kim (2024) Mahammed Kamruzzaman and Gene Louis Kim. 2024. Prompting techniques for reducing social bias in llms through system 1 and system 2 cognitive processes. arXiv preprint arXiv:2404.17218 .
  • Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through memorization: Nearest neighbor language models . In International Conference on Learning Representations .
  • Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. 2022. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 537–563, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Li et al. (2023) Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023. Making language models better reasoners with step-aware verifier . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
  • Lin et al. (2024) Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, and Xiang Ren. 2024. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. Advances in Neural Information Processing Systems , 36.
  • Lun et al. (2020) Jiaqi Lun, Jia Zhu, Yong Tang, and Min Yang. 2020. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 13389–13396.
  • Mathias and Bhattacharyya (2018a) Sandeep Mathias and Pushpak Bhattacharyya. 2018a. Asap++: Enriching the asap automated essay grading dataset with essay attribute scores. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) .
  • Mathias and Bhattacharyya (2018b) Sandeep Mathias and Pushpak Bhattacharyya. 2018b. Thank “goodness”! a way to measure style in student essays. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications , pages 35–41.
  • Mizumoto and Eguchi (2023) Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics , 2(2):100050.
  • Naismith et al. (2023) Ben Naismith, Phoebe Mulcaire, and Jill Burstein. 2023. Automated evaluation of written discourse coherence using GPT-4 . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages 394–403, Toronto, Canada. Association for Computational Linguistics.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744.
  • Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 .
  • Ramesh and Sanampudi (2022) Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review , 55(3):2495–2527.
  • Rodriguez et al. (2019) Pedro Uria Rodriguez, Amir Jafari, and Christopher M Ormerod. 2019. Language models and automated essay scoring. arXiv preprint arXiv:1909.09482 .
  • Salim et al. (2019) Yafet Salim, Valdi Stevanus, Edwardo Barlian, Azani Cempaka Sari, and Derwin Suhartono. 2019. Automated english digital essay grader using machine learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE) , pages 1–6. IEEE.
  • Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 .
  • Stahl et al. (2024) Maja Stahl, Leon Biermann, Andreas Nehring, and Henning Wachsmuth. 2024. Exploring llm prompting strategies for joint essay scoring and feedback generation. arXiv preprint arXiv:2404.15845 .
  • Sultan et al. (2016) Md Arafat Sultan, Cristobal Salazar, and Tamara Sumner. 2016. Fast and easy short answer grading with high accuracy. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1070–1075.
  • Taghipour and Ng (2016) Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing , pages 1882–1891.
  • Tian et al. (2023) Xiaoyu Tian, Liangyu Chen, Na Liu, Yaxuan Liu, Wei Zou, Kaijiang Chen, and Ming Cui. 2023. Duma: a dual-mind conversational agent with fast and slow thinking. arXiv preprint arXiv:2310.18075 .
  • van den Bosch et al. (2019) Karel van den Bosch, Tjeerd Schoonderwoerd, Romy Blankendaal, and Mark Neerincx. 2019. Six challenges for human-ai co-learning. In Adaptive Instructional Systems: First International Conference, AIS 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings 21 , pages 572–589. Springer.
  • Wang et al. (2022) Yongjie Wang, Chuang Wang, Ruobing Li, and Hui Lin. 2022. On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 3416–3425, Seattle, United States. Association for Computational Linguistics.
  • Wason and Evans (1974) Peter C Wason and J St BT Evans. 1974. Dual processes in reasoning? Cognition , 3(2):141–154.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824–24837.
  • Yancey et al. (2023) Kevin P. Yancey, Geoffrey Laflair, Anthony Verardi, and Jill Burstein. 2023. Rating short L2 essays on the CEFR scale with GPT-4 . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages 576–584, Toronto, Canada. Association for Computational Linguistics.
  • Yang et al. (2020) Ruosong Yang, Jiannong Cao, Zhiyuan Wen, Youzheng Wu, and Xiaodong He. 2020. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 1560–1569.
  • Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models . In The Eleventh International Conference on Learning Representations .

Appendix A Datasets

The details of the ASAP dataset are presented in Table 6 . As previously mentioned, this dataset is composed of 8 subsets, each with unique prompts and scoring rubrics. Our Chinese Student English Essay (CSEE) dataset consists of 13,372 essays, along with their corresponding scores carefully rated by experienced English teachers based on the scoring standards in the Chinese National College Entrance Examination (Table 8 ). The basic statistics of this dataset are outlined in Table 5 .

Chinese Student English Essay Dataset
# of schools 29
# of essay prompts 2
# of student essays 13,372
avg. essay length 124.74
avg. Overall score 10.72
avg. Content score 4.13
avg. Language score 4.05
avg. Structure score 2.55
Essay Set Essay Type Grade Level # of Essays Avg. Length Score Range
1 Persuasive/Narrative/Expository 8 1783 350 [2, 12]
2 Persuasive/Narrative/Expository 10 1800 350 [1, 6]
3 Source Dependent Responses 10 1726 150 [0, 3]
4 Source Dependent Responses 10 1772 150 [0, 3]
5 Source Dependent Responses 8 1805 150 [0, 4]
6 Source Dependent Responses 10 1800 150 [0, 4]
7 Persuasive/Narrative/Expository 7 1569 300 [0, 12]
8 Persuasive/Narrative/Expository 10 723 650 [0, 36]

Appendix B Details of BERT Classifier Baseline

We employed the bert-base-uncased BERT model from the huggingface transformers library 5 5 5 https://huggingface.co/docs/transformers/ using PyTorch. A simple fully connected layer was added to perform the classification task. The datasets were divided into training and testing sets at an 8:2 ratio. To ensure better reproducibility, we set all random seeds, including those for dataset splitting and model training, to the value 42. During training, we used cross-entropy loss as our loss function. We allowed BERT parameters to be fine-tuned, without freezing them, in line with the objective function. AdamW was chosen as the optimizer, with a learning rate set to 10 − 5 superscript 10 5 10^{-5} 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and epsilon at 10 − 6 superscript 10 6 10^{-6} 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT . With a batch size of 16 and a maximum of 10 training epochs, we also integrated an early stopping strategy to mitigate potential overfitting. All the experiments of the BERT baseline were run with 2 RTX A4000 16G GPUs in around one week.

Appendix C Details of LLM-based Methods

C.1 llm prompts.

The prompts used for LLMs in our study fall into two distinct categories: firstly, the zero-shot and few-shot configurations of GPT-4; secondly, the instructions for fine-tuning and inference of GPT-3.5 and LLaMA3-8B. The prompts for the few-shot scenario incorporate those used in the zero-shot setting and overlap with the fine-tuning prompts. Therefore, for clarity and conciseness, we present examples of the GPT-4, few-shot, with rubrics and the inputs of fine-tuned LLaMA3-8B in Table 9 .

C.2 Few-Shot GPT-4

In the few-shot setting of GPT-4 with k 𝑘 k italic_k essay examples, as indicated by prior studies in AES tasks (Yancey et al., 2023 ) , increasing the value of k 𝑘 k italic_k did not consistently yield better results, showing a trend of diminishing marginal returns. Therefore, we choose a suitable k = 3 𝑘 3 k=3 italic_k = 3 in the study.

We explored two sampling approaches. The first involved randomly selecting essays from various levels of quality to help LLM understand the approximate level of the target essay. The second method adopted a retrieval-based approach, which has been proven to be effective in enhancing LLM performance (Khandelwal et al., 2020 ; Shi et al., 2023 ; Ram et al., 2023 ) . Leveraging OpenAI’s text-embedding-ada-002 model, we calculated the embedding for each essay. This allowed us to identify the top k 𝑘 k italic_k similar essays based on cosine similarity (excluding the target essay). Our experiments demonstrated that this retrieval strategy consistently yielded superior results. Therefore, we focused on the latter approach in this study.

C.3 Fine-tuning LLaMA3

We fine-tuned the llama-3-8b-bnb-4bit model using the unsloth framework 6 6 6 https://github.com/unslothai/unsloth . For this process, we employed a Parameter-Efficient Fine-Tuning (PEFT) approach with a rank of 16 and a LoRA alpha value of 16. We utilized an 8-bit AdamW optimizer, starting with an initial learning rate of 2 × 10 − 4 2 superscript 10 4 2\times 10^{-4} 2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT . After 50 warm-up steps, the learning rate was scheduled to decay linearly, with the weight decay parameter set at 0.01. We maintained all random seeds at 3407 and completed the fine-tuning over 2 epochs. All experiments involving the fine-tuned LLaMA3-8B were conducted using a single RTX 4090 24GB GPU, spanning approximately three weeks.

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8
Trained on Set 1 BERT Classifier - 0.3299 0.1680 0.1380 0.3045 0.1234 0.3002 0.1541
Fine-tuned GPT-3.5 - 0.5216 0.5405 0.4891 0.5076 0.6344 0.6306 0.3126
Trained on Set 2 BERT Classifier 0.2776 - 0.1975 0.2392 0.1750 0.1453 0.2474 0.3783
Fine-tuned GPT-3.5 0.4270 - 0.4131 0.4619 0.5958 0.5579 0.5438 0.6684
Trained on Set 3 BERT Classifier 0.3468 0.4444 - 0.6230 0.6319 0.5299 0.4368 0.2427
Fine-tuned GPT-3.5 0.3991 0.2488 - 0.7674 0.7714 0.7150 0.4964 0.1134
Trained on Set 4 BERT Classifier 0.3257 0.5332 0.6267 - 0.5483 0.4959 0.4659 0.3204
Fine-tuned GPT-3.5 0.0631 0.3493 0.4908 - 0.6515 0.7420 0.0865 0.3419
Trained on Set 5 BERT Classifier 0.4051 0.3341 0.4264 0.4202 - 0.5243 0.3255 0.2035
Fine-tuned GPT-3.5 0.4354 0.4301 0.5765 0.6877 - 0.7368 0.1061 0.3118
Trained on Set 6 BERT Classifier 0.3164 0.3462 0.4000 0.3067 0.4882 - 0.2303 0.3047
Fine-tuned GPT-3.5 0.1342 0.3607 0.4579 0.3157 0.3734 - 0.0061 0.0859
Trained on Set 7 BERT Classifier 0.0975 0.0086 0.1854 0.0328 0.0554 0.1244 - 0.2917
Fine-tuned GPT-3.5 0.5862 0.3993 0.4865 0.4425 0.4494 0.4417 - 0.2157
Trained on Set 8 BERT Classifier 0.0560 0.1102 0.0110 0.0164 0.0371 0.0454 0.1777 -
Fine-tuned GPT-3.5 0.2714 0.4822 0.4768 0.6009 0.4199 0.3231 0.5460 -

Appendix D Human-AI Co-Grading Details

In our LLM-assisted human grading experiment, the 10 college students were all from a Normal University in Beijing, and had a male-to-female ratio of 4:6, with ages ranging from 19 to 23 years (from freshmen to seniors). Their English capabilities were certified by China’s College English Test (CET). None of the novices have the experience of grading student essays currently. The 5 expert evaluators comprised experienced English teachers from Beijing high schools, with teaching tenures ranging from 8 to 20 years. Before evaluation, all participants received training on the standard scoring rubrics. They were also incentivized with appropriate remuneration for their participation.

The instructions for the evaluators include the standard scoring rubrics of the College Entrance Examination in China and several grading examples. The writing exercise and the essays designated for assessment will be presented to the evaluators. Moreover, supplementary feedback (scores, output confidences, and explanations) will be provided for the experimental groups. To enhance the evaluators’ comprehension and avoid possible misunderstandings, all the information provided has been translated into Chinese.

The results of Overall, Content, Language, and Structure scores in the human-AI co-grading experiment are presented in Figure 5 . We observed that the Content and Language scores exhibit a similar trend as the Overall score discussed in the Results section. The expert evaluators noted that the Structure dimension is the most ambiguous and difficult part of the grading task which has the lowest QWK values among the three dimensions.

Rubrics
Overall Score (20 points) = Content Score (8 points) + Language Score (8 points) + Structure Score (4 points) Content Dimension (8 points in total) 6-8 points: Content is complete with appropriate details Expression is closely related to the topic 3-5 points: Content is mostly complete Expression is fundamentally related to the topic 0-2 points: Content is incomplete Expression is barely related or completely unrelated to the topic Language Dimension (8 points in total) 6-8 points: Language is accurate with diverse sentence structures and little or no errors (2 errors or fewer, 8 points; 3-4 errors, 7 points; 5-6 errors, 6 points) Language expression is mostly appropriate 3-5 points: Language is not quite accurate, with some variation in sentence structures and several errors, but they don’t impede understanding (7-8 errors, 5 points; 9-10 errors, 4 points; 11-12 errors, 3 points) Language expression is somewhat inappropriate 0-2 points: Language is hopelessly inaccurate with numerous language errors, hindering understanding (more than 12 errors) Language expression is completely inappropriate Structure Dimension (4 points in total) 3-4 points: Clearly and logically structured Smooth and coherent transitions 1-2 points: Mostly clearly and logically structured Relatively smooth and coherent transitions 0-1 points: Not clearly and logically structured Fragmented and disconnected structures and sentences
Example Prompt Example Output
As a virtual evaluator with expertise in English composition, your role is to critically analyze and grade student essays according to a predetermined set of rubrics. You are to act as an impartial judge and evaluate the essays based on the quality of the writing and adherence to the essay prompt. Here are the specific guidelines for each score: {Rubrics} Sample Essay Prompt: {Essay Prompt} The graded example essays: {Examples} Student’s Essay to Evaluate: {Target Student Essay} Task Breakdown: 1. Carefully read the provided essay prompt, scoring guidelines, and the student’s essay. 2. In the Explanations part, identify specific elements in the essay referring to the rubrics. In the language dimension, list all the spelling and grammar errors, and count the number of them to determine the Language Score. The Explanations for each dimension should be as detailed as possible. 3. Determine the appropriate scores according to the analysis above. Please present your evaluation in the following manner: Explanations: …, Content Score: … Explanations: …, Language Score: … Explanations: …, Structure Score: … Explanations: …, Total Score: … Your final evaluation: [Total Score: …, Content Score: …, Language Score: …, Structure Score: …] Explanations: The student’s essay responds accurately to Jim’s request for suggestions on how to learn Chinese, providing thoughtful advice and support. Detailed suggestions include listening to Chinese radio, reading Chinese literature, and using mobile applications. The tone is hopeful and supportive. Content Score: 8 Explanations: There are no language errors; the grammar and spelling are all accurate. The student used formal, respectful language suitable for this kind of communication and also included diversity in sentence structures. Language Score: 8 Explanations: The essay is well-structured and organized. It follows the accepted structure of a reply email, with an opening greeting, a body containing the main content, and a closing remark. Transitions are also smooth and coherent, logically linking one thought to another. Structure Score: 4 Total Score: This is the sum of the content, language, and structure scores. Total Score: 20 Your final evaluation: [Total Score: 20, Content Score: 8, Language Score: 8, Structure Score: 4]
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: You are an experienced English teacher analyzing high school students’ essays according to a specific rubric. Evaluate the following essay based on three dimensions: Content, Language, and Structure, and provide the overall assessment. Please provide your evaluation in the following JSON format: {Output Format} ### Input: Scoring rubrics: {Rubrics} Essay Prompt: {Essay Prompt} Student’s Essay to Evaluate: {Target Student Essay} ### Response: {} { ’content’: {’completeness’: ’The essay covers the basic requirements of the prompt by introducing design ideas for the graduation ceremony and explaining the reasons behind these designs. However, the explanations are quite brief and lack depth.’, ’topic_relevance’: ’The essay is related to the given topic, as it discusses ideas for a graduation ceremony design.’, ’content_details’: ’The details provided are minimal and lack depth. For example, the student mentions launching a ball in the dining hall and writing gratitude letters but does not elaborate on how these activities contribute to the overall design or the reasons behind these choices in detail.’, ’score_level’: ’Level 2’, ’score_point’: 4.5}, ’language’: {…}, ’structure’: {…}, ’overall’: {…} }

Refer to caption

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 03 June 2024

Applying large language models for automated essay scoring for non-native Japanese

  • Wenchao Li 1 &
  • Haitao Liu 2  

Humanities and Social Sciences Communications volume  11 , Article number:  723 ( 2024 ) Cite this article

535 Accesses

14 Altmetric

Metrics details

  • Language and linguistics

Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated listening tests, and automated oral proficiency assessments. The application of LLMs for AES in the context of non-native Japanese, however, remains limited. This study explores the potential of LLM-based AES by comparing the efficiency of different models, i.e. two conventional machine training technology-based methods (Jess and JWriter), two LLMs (GPT and BERT), and one Japanese local LLM (Open-Calm large model). To conduct the evaluation, a dataset consisting of 1400 story-writing scripts authored by learners with 12 different first languages was used. Statistical analysis revealed that GPT-4 outperforms Jess and JWriter, BERT, and the Japanese language-specific trained Open-Calm large model in terms of annotation accuracy and predicting learning levels. Furthermore, by comparing 18 different models that utilize various prompts, the study emphasized the significance of prompts in achieving accurate and reliable evaluations using LLMs.

Similar content being viewed by others

9th grade essay grading rubric

Scoring method of English composition integrating deep learning in higher vocational colleges

9th grade essay grading rubric

ChatGPT-3.5 as writing assistance in students’ essays

9th grade essay grading rubric

Detecting contract cheating through linguistic fingerprint

Conventional machine learning technology in aes.

AES has experienced significant growth with the advancement of machine learning technologies in recent decades. In the earlier stages of AES development, conventional machine learning-based approaches were commonly used. These approaches involved the following procedures: a) feeding the machine with a dataset. In this step, a dataset of essays is provided to the machine learning system. The dataset serves as the basis for training the model and establishing patterns and correlations between linguistic features and human ratings. b) the machine learning model is trained using linguistic features that best represent human ratings and can effectively discriminate learners’ writing proficiency. These features include lexical richness (Lu, 2012 ; Kyle and Crossley, 2015 ; Kyle et al. 2021 ), syntactic complexity (Lu, 2010 ; Liu, 2008 ), text cohesion (Crossley and McNamara, 2016 ), and among others. Conventional machine learning approaches in AES require human intervention, such as manual correction and annotation of essays. This human involvement was necessary to create a labeled dataset for training the model. Several AES systems have been developed using conventional machine learning technologies. These include the Intelligent Essay Assessor (Landauer et al. 2003 ), the e-rater engine by Educational Testing Service (Attali and Burstein, 2006 ; Burstein, 2003 ), MyAccess with the InterlliMetric scoring engine by Vantage Learning (Elliot, 2003 ), and the Bayesian Essay Test Scoring system (Rudner and Liang, 2002 ). These systems have played a significant role in automating the essay scoring process and providing quick and consistent feedback to learners. However, as touched upon earlier, conventional machine learning approaches rely on predetermined linguistic features and often require manual intervention, making them less flexible and potentially limiting their generalizability to different contexts.

In the context of the Japanese language, conventional machine learning-incorporated AES tools include Jess (Ishioka and Kameda, 2006 ) and JWriter (Lee and Hasebe, 2017 ). Jess assesses essays by deducting points from the perfect score, utilizing the Mainichi Daily News newspaper as a database. The evaluation criteria employed by Jess encompass various aspects, such as rhetorical elements (e.g., reading comprehension, vocabulary diversity, percentage of complex words, and percentage of passive sentences), organizational structures (e.g., forward and reverse connection structures), and content analysis (e.g., latent semantic indexing). JWriter employs linear regression analysis to assign weights to various measurement indices, such as average sentence length and total number of characters. These weights are then combined to derive the overall score. A pilot study involving the Jess model was conducted on 1320 essays at different proficiency levels, including primary, intermediate, and advanced. However, the results indicated that the Jess model failed to significantly distinguish between these essay levels. Out of the 16 measures used, four measures, namely median sentence length, median clause length, median number of phrases, and maximum number of phrases, did not show statistically significant differences between the levels. Additionally, two measures exhibited between-level differences but lacked linear progression: the number of attributives declined words and the Kanji/kana ratio. On the other hand, the remaining measures, including maximum sentence length, maximum clause length, number of attributive conjugated words, maximum number of consecutive infinitive forms, maximum number of conjunctive-particle clauses, k characteristic value, percentage of big words, and percentage of passive sentences, demonstrated statistically significant between-level differences and displayed linear progression.

Both Jess and JWriter exhibit notable limitations, including the manual selection of feature parameters and weights, which can introduce biases into the scoring process. The reliance on human annotators to label non-native language essays also introduces potential noise and variability in the scoring. Furthermore, an important concern is the possibility of system manipulation and cheating by learners who are aware of the regression equation utilized by the models (Hirao et al. 2020 ). These limitations emphasize the need for further advancements in AES systems to address these challenges.

Deep learning technology in AES

Deep learning has emerged as one of the approaches for improving the accuracy and effectiveness of AES. Deep learning-based AES methods utilize artificial neural networks that mimic the human brain’s functioning through layered algorithms and computational units. Unlike conventional machine learning, deep learning autonomously learns from the environment and past errors without human intervention. This enables deep learning models to establish nonlinear correlations, resulting in higher accuracy. Recent advancements in deep learning have led to the development of transformers, which are particularly effective in learning text representations. Noteworthy examples include bidirectional encoder representations from transformers (BERT) (Devlin et al. 2019 ) and the generative pretrained transformer (GPT) (OpenAI).

BERT is a linguistic representation model that utilizes a transformer architecture and is trained on two tasks: masked linguistic modeling and next-sentence prediction (Hirao et al. 2020 ; Vaswani et al. 2017 ). In the context of AES, BERT follows specific procedures, as illustrated in Fig. 1 : (a) the tokenized prompts and essays are taken as input; (b) special tokens, such as [CLS] and [SEP], are added to mark the beginning and separation of prompts and essays; (c) the transformer encoder processes the prompt and essay sequences, resulting in hidden layer sequences; (d) the hidden layers corresponding to the [CLS] tokens (T[CLS]) represent distributed representations of the prompts and essays; and (e) a multilayer perceptron uses these distributed representations as input to obtain the final score (Hirao et al. 2020 ).

figure 1

AES system with BERT (Hirao et al. 2020 ).

The training of BERT using a substantial amount of sentence data through the Masked Language Model (MLM) allows it to capture contextual information within the hidden layers. Consequently, BERT is expected to be capable of identifying artificial essays as invalid and assigning them lower scores (Mizumoto and Eguchi, 2023 ). In the context of AES for nonnative Japanese learners, Hirao et al. ( 2020 ) combined the long short-term memory (LSTM) model proposed by Hochreiter and Schmidhuber ( 1997 ) with BERT to develop a tailored automated Essay Scoring System. The findings of their study revealed that the BERT model outperformed both the conventional machine learning approach utilizing character-type features such as “kanji” and “hiragana”, as well as the standalone LSTM model. Takeuchi et al. ( 2021 ) presented an approach to Japanese AES that eliminates the requirement for pre-scored essays by relying solely on reference texts or a model answer for the essay task. They investigated multiple similarity evaluation methods, including frequency of morphemes, idf values calculated on Wikipedia, LSI, LDA, word-embedding vectors, and document vectors produced by BERT. The experimental findings revealed that the method utilizing the frequency of morphemes with idf values exhibited the strongest correlation with human-annotated scores across different essay tasks. The utilization of BERT in AES encounters several limitations. Firstly, essays often exceed the model’s maximum length limit. Second, only score labels are available for training, which restricts access to additional information.

Mizumoto and Eguchi ( 2023 ) were pioneers in employing the GPT model for AES in non-native English writing. Their study focused on evaluating the accuracy and reliability of AES using the GPT-3 text-davinci-003 model, analyzing a dataset of 12,100 essays from the corpus of nonnative written English (TOEFL11). The findings indicated that AES utilizing the GPT-3 model exhibited a certain degree of accuracy and reliability. They suggest that GPT-3-based AES systems hold the potential to provide support for human ratings. However, applying GPT model to AES presents a unique natural language processing (NLP) task that involves considerations such as nonnative language proficiency, the influence of the learner’s first language on the output in the target language, and identifying linguistic features that best indicate writing quality in a specific language. These linguistic features may differ morphologically or syntactically from those present in the learners’ first language, as observed in (1)–(3).

我-送了-他-一本-书

Wǒ-sòngle-tā-yī běn-shū

1 sg .-give. past- him-one .cl- book

“I gave him a book.”

Agglutinative

彼-に-本-を-あげ-まし-た

Kare-ni-hon-o-age-mashi-ta

3 sg .- dat -hon- acc- give.honorification. past

Inflectional

give, give-s, gave, given, giving

Additionally, the morphological agglutination and subject-object-verb (SOV) order in Japanese, along with its idiomatic expressions, pose additional challenges for applying language models in AES tasks (4).

足-が 棒-に なり-ました

Ashi-ga bo-ni nar-mashita

leg- nom stick- dat become- past

“My leg became like a stick (I am extremely tired).”

The example sentence provided demonstrates the morpho-syntactic structure of Japanese and the presence of an idiomatic expression. In this sentence, the verb “なる” (naru), meaning “to become”, appears at the end of the sentence. The verb stem “なり” (nari) is attached with morphemes indicating honorification (“ます” - mashu) and tense (“た” - ta), showcasing agglutination. While the sentence can be literally translated as “my leg became like a stick”, it carries an idiomatic interpretation that implies “I am extremely tired”.

To overcome this issue, CyberAgent Inc. ( 2023 ) has developed the Open-Calm series of language models specifically designed for Japanese. Open-Calm consists of pre-trained models available in various sizes, such as Small, Medium, Large, and 7b. Figure 2 depicts the fundamental structure of the Open-Calm model. A key feature of this architecture is the incorporation of the Lora Adapter and GPT-NeoX frameworks, which can enhance its language processing capabilities.

figure 2

GPT-NeoX Model Architecture (Okgetheng and Takeuchi 2024 ).

In a recent study conducted by Okgetheng and Takeuchi ( 2024 ), they assessed the efficacy of Open-Calm language models in grading Japanese essays. The research utilized a dataset of approximately 300 essays, which were annotated by native Japanese educators. The findings of the study demonstrate the considerable potential of Open-Calm language models in automated Japanese essay scoring. Specifically, among the Open-Calm family, the Open-Calm Large model (referred to as OCLL) exhibited the highest performance. However, it is important to note that, as of the current date, the Open-Calm Large model does not offer public access to its server. Consequently, users are required to independently deploy and operate the environment for OCLL. In order to utilize OCLL, users must have a PC equipped with an NVIDIA GeForce RTX 3060 (8 or 12 GB VRAM).

In summary, while the potential of LLMs in automated scoring of nonnative Japanese essays has been demonstrated in two studies—BERT-driven AES (Hirao et al. 2020 ) and OCLL-based AES (Okgetheng and Takeuchi, 2024 )—the number of research efforts in this area remains limited.

Another significant challenge in applying LLMs to AES lies in prompt engineering and ensuring its reliability and effectiveness (Brown et al. 2020 ; Rae et al. 2021 ; Zhang et al. 2021 ). Various prompting strategies have been proposed, such as the zero-shot chain of thought (CoT) approach (Kojima et al. 2022 ), which involves manually crafting diverse and effective examples. However, manual efforts can lead to mistakes. To address this, Zhang et al. ( 2021 ) introduced an automatic CoT prompting method called Auto-CoT, which demonstrates matching or superior performance compared to the CoT paradigm. Another prompt framework is trees of thoughts, enabling a model to self-evaluate its progress at intermediate stages of problem-solving through deliberate reasoning (Yao et al. 2023 ).

Beyond linguistic studies, there has been a noticeable increase in the number of foreign workers in Japan and Japanese learners worldwide (Ministry of Health, Labor, and Welfare of Japan, 2022 ; Japan Foundation, 2021 ). However, existing assessment methods, such as the Japanese Language Proficiency Test (JLPT), J-CAT, and TTBJ Footnote 1 , primarily focus on reading, listening, vocabulary, and grammar skills, neglecting the evaluation of writing proficiency. As the number of workers and language learners continues to grow, there is a rising demand for an efficient AES system that can reduce costs and time for raters and be utilized for employment, examinations, and self-study purposes.

This study aims to explore the potential of LLM-based AES by comparing the effectiveness of five models: two LLMs (GPT Footnote 2 and BERT), one Japanese local LLM (OCLL), and two conventional machine learning-based methods (linguistic feature-based scoring tools - Jess and JWriter).

The research questions addressed in this study are as follows:

To what extent do the LLM-driven AES and linguistic feature-based AES, when used as automated tools to support human rating, accurately reflect test takers’ actual performance?

What influence does the prompt have on the accuracy and performance of LLM-based AES methods?

The subsequent sections of the manuscript cover the methodology, including the assessment measures for nonnative Japanese writing proficiency, criteria for prompts, and the dataset. The evaluation section focuses on the analysis of annotations and rating scores generated by LLM-driven and linguistic feature-based AES methods.

Methodology

The dataset utilized in this study was obtained from the International Corpus of Japanese as a Second Language (I-JAS) Footnote 3 . This corpus consisted of 1000 participants who represented 12 different first languages. For the study, the participants were given a story-writing task on a personal computer. They were required to write two stories based on the 4-panel illustrations titled “Picnic” and “The key” (see Appendix A). Background information for the participants was provided by the corpus, including their Japanese language proficiency levels assessed through two online tests: J-CAT and SPOT. These tests evaluated their reading, listening, vocabulary, and grammar abilities. The learners’ proficiency levels were categorized into six levels aligned with the Common European Framework of Reference for Languages (CEFR) and the Reference Framework for Japanese Language Education (RFJLE): A1, A2, B1, B2, C1, and C2. According to Lee et al. ( 2015 ), there is a high level of agreement (r = 0.86) between the J-CAT and SPOT assessments, indicating that the proficiency certifications provided by J-CAT are consistent with those of SPOT. However, it is important to note that the scores of J-CAT and SPOT do not have a one-to-one correspondence. In this study, the J-CAT scores were used as a benchmark to differentiate learners of different proficiency levels. A total of 1400 essays were utilized, representing the beginner (aligned with A1), A2, B1, B2, C1, and C2 levels based on the J-CAT scores. Table 1 provides information about the learners’ proficiency levels and their corresponding J-CAT and SPOT scores.

A dataset comprising a total of 1400 essays from the story writing tasks was collected. Among these, 714 essays were utilized to evaluate the reliability of the LLM-based AES method, while the remaining 686 essays were designated as development data to assess the LLM-based AES’s capability to distinguish participants with varying proficiency levels. The GPT 4 API was used in this study. A detailed explanation of the prompt-assessment criteria is provided in Section Prompt . All essays were sent to the model for measurement and scoring.

Measures of writing proficiency for nonnative Japanese

Japanese exhibits a morphologically agglutinative structure where morphemes are attached to the word stem to convey grammatical functions such as tense, aspect, voice, and honorifics, e.g. (5).

食べ-させ-られ-まし-た-か

tabe-sase-rare-mashi-ta-ka

[eat (stem)-causative-passive voice-honorification-tense. past-question marker]

Japanese employs nine case particles to indicate grammatical functions: the nominative case particle が (ga), the accusative case particle を (o), the genitive case particle の (no), the dative case particle に (ni), the locative/instrumental case particle で (de), the ablative case particle から (kara), the directional case particle へ (e), and the comitative case particle と (to). The agglutinative nature of the language, combined with the case particle system, provides an efficient means of distinguishing between active and passive voice, either through morphemes or case particles, e.g. 食べる taberu “eat concusive . ” (active voice); 食べられる taberareru “eat concusive . ” (passive voice). In the active voice, “パン を 食べる” (pan o taberu) translates to “to eat bread”. On the other hand, in the passive voice, it becomes “パン が 食べられた” (pan ga taberareta), which means “(the) bread was eaten”. Additionally, it is important to note that different conjugations of the same lemma are considered as one type in order to ensure a comprehensive assessment of the language features. For example, e.g., 食べる taberu “eat concusive . ”; 食べている tabeteiru “eat progress .”; 食べた tabeta “eat past . ” as one type.

To incorporate these features, previous research (Suzuki, 1999 ; Watanabe et al. 1988 ; Ishioka, 2001 ; Ishioka and Kameda, 2006 ; Hirao et al. 2020 ) has identified complexity, fluency, and accuracy as crucial factors for evaluating writing quality. These criteria are assessed through various aspects, including lexical richness (lexical density, diversity, and sophistication), syntactic complexity, and cohesion (Kyle et al. 2021 ; Mizumoto and Eguchi, 2023 ; Ure, 1971 ; Halliday, 1985 ; Barkaoui and Hadidi, 2020 ; Zenker and Kyle, 2021 ; Kim et al. 2018 ; Lu, 2017 ; Ortega, 2015 ). Therefore, this study proposes five scoring categories: lexical richness, syntactic complexity, cohesion, content elaboration, and grammatical accuracy. A total of 16 measures were employed to capture these categories. The calculation process and specific details of these measures can be found in Table 2 .

T-unit, first introduced by Hunt ( 1966 ), is a measure used for evaluating speech and composition. It serves as an indicator of syntactic development and represents the shortest units into which a piece of discourse can be divided without leaving any sentence fragments. In the context of Japanese language assessment, Sakoda and Hosoi ( 2020 ) utilized T-unit as the basic unit to assess the accuracy and complexity of Japanese learners’ speaking and storytelling. The calculation of T-units in Japanese follows the following principles:

A single main clause constitutes 1 T-unit, regardless of the presence or absence of dependent clauses, e.g. (6).

ケンとマリはピクニックに行きました (main clause): 1 T-unit.

If a sentence contains a main clause along with subclauses, each subclause is considered part of the same T-unit, e.g. (7).

天気が良かった の で (subclause)、ケンとマリはピクニックに行きました (main clause): 1 T-unit.

In the case of coordinate clauses, where multiple clauses are connected, each coordinated clause is counted separately. Thus, a sentence with coordinate clauses may have 2 T-units or more, e.g. (8).

ケンは地図で場所を探して (coordinate clause)、マリはサンドイッチを作りました (coordinate clause): 2 T-units.

Lexical diversity refers to the range of words used within a text (Engber, 1995 ; Kyle et al. 2021 ) and is considered a useful measure of the breadth of vocabulary in L n production (Jarvis, 2013a , 2013b ).

The type/token ratio (TTR) is widely recognized as a straightforward measure for calculating lexical diversity and has been employed in numerous studies. These studies have demonstrated a strong correlation between TTR and other methods of measuring lexical diversity (e.g., Bentz et al. 2016 ; Čech and Miroslav, 2018 ; Çöltekin and Taraka, 2018 ). TTR is computed by considering both the number of unique words (types) and the total number of words (tokens) in a given text. Given that the length of learners’ writing texts can vary, this study employs the moving average type-token ratio (MATTR) to mitigate the influence of text length. MATTR is calculated using a 50-word moving window. Initially, a TTR is determined for words 1–50 in an essay, followed by words 2–51, 3–52, and so on until the end of the essay is reached (Díez-Ortega and Kyle, 2023 ). The final MATTR scores were obtained by averaging the TTR scores for all 50-word windows. The following formula was employed to derive MATTR:

\({\rm{MATTR}}({\rm{W}})=\frac{{\sum }_{{\rm{i}}=1}^{{\rm{N}}-{\rm{W}}+1}{{\rm{F}}}_{{\rm{i}}}}{{\rm{W}}({\rm{N}}-{\rm{W}}+1)}\)

Here, N refers to the number of tokens in the corpus. W is the randomly selected token size (W < N). \({F}_{i}\) is the number of types in each window. The \({\rm{MATTR}}({\rm{W}})\) is the mean of a series of type-token ratios (TTRs) based on the word form for all windows. It is expected that individuals with higher language proficiency will produce texts with greater lexical diversity, as indicated by higher MATTR scores.

Lexical density was captured by the ratio of the number of lexical words to the total number of words (Lu, 2012 ). Lexical sophistication refers to the utilization of advanced vocabulary, often evaluated through word frequency indices (Crossley et al. 2013 ; Haberman, 2008 ; Kyle and Crossley, 2015 ; Laufer and Nation, 1995 ; Lu, 2012 ; Read, 2000 ). In line of writing, lexical sophistication can be interpreted as vocabulary breadth, which entails the appropriate usage of vocabulary items across various lexicon-grammatical contexts and registers (Garner et al. 2019 ; Kim et al. 2018 ; Kyle et al. 2018 ). In Japanese specifically, words are considered lexically sophisticated if they are not included in the “Japanese Education Vocabulary List Ver 1.0”. Footnote 4 Consequently, lexical sophistication was calculated by determining the number of sophisticated word types relative to the total number of words per essay. Furthermore, it has been suggested that, in Japanese writing, sentences should ideally have a length of no more than 40 to 50 characters, as this promotes readability. Therefore, the median and maximum sentence length can be considered as useful indices for assessment (Ishioka and Kameda, 2006 ).

Syntactic complexity was assessed based on several measures, including the mean length of clauses, verb phrases per T-unit, clauses per T-unit, dependent clauses per T-unit, complex nominals per clause, adverbial clauses per clause, coordinate phrases per clause, and mean dependency distance (MDD). The MDD reflects the distance between the governor and dependent positions in a sentence. A larger dependency distance indicates a higher cognitive load and greater complexity in syntactic processing (Liu, 2008 ; Liu et al. 2017 ). The MDD has been established as an efficient metric for measuring syntactic complexity (Jiang, Quyang, and Liu, 2019 ; Li and Yan, 2021 ). To calculate the MDD, the position numbers of the governor and dependent are subtracted, assuming that words in a sentence are assigned in a linear order, such as W1 … Wi … Wn. In any dependency relationship between words Wa and Wb, Wa is the governor and Wb is the dependent. The MDD of the entire sentence was obtained by taking the absolute value of governor – dependent:

MDD = \(\frac{1}{n}{\sum }_{i=1}^{n}|{\rm{D}}{{\rm{D}}}_{i}|\)

In this formula, \(n\) represents the number of words in the sentence, and \({DD}i\) is the dependency distance of the \({i}^{{th}}\) dependency relationship of a sentence. Building on this, the annotation of sentence ‘Mary-ga-John-ni-keshigomu-o-watashita was [Mary- top -John- dat -eraser- acc -give- past] ’. The sentence’s MDD would be 2. Table 3 provides the CSV file as a prompt for GPT 4.

Cohesion (semantic similarity) and content elaboration aim to capture the ideas presented in test taker’s essays. Cohesion was assessed using three measures: Synonym overlap/paragraph (topic), Synonym overlap/paragraph (keywords), and word2vec cosine similarity. Content elaboration and development were measured as the number of metadiscourse markers (type)/number of words. To capture content closely, this study proposed a novel-distance based representation, by encoding the cosine distance between the essay (by learner) and essay task’s (topic and keyword) i -vectors. The learner’s essay is decoded into a word sequence, and aligned to the essay task’ topic and keyword for log-likelihood measurement. The cosine distance reveals the content elaboration score in the leaners’ essay. The mathematical equation of cosine similarity between target-reference vectors is shown in (11), assuming there are i essays and ( L i , …. L n ) and ( N i , …. N n ) are the vectors representing the learner and task’s topic and keyword respectively. The content elaboration distance between L i and N i was calculated as follows:

\(\cos \left(\theta \right)=\frac{{\rm{L}}\,\cdot\, {\rm{N}}}{\left|{\rm{L}}\right|{\rm{|N|}}}=\frac{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}{N}_{i}}{\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{L}_{i}^{2}}\sqrt{\mathop{\sum }\nolimits_{i=1}^{n}{N}_{i}^{2}}}\)

A high similarity value indicates a low difference between the two recognition outcomes, which in turn suggests a high level of proficiency in content elaboration.

To evaluate the effectiveness of the proposed measures in distinguishing different proficiency levels among nonnative Japanese speakers’ writing, we conducted a multi-faceted Rasch measurement analysis (Linacre, 1994 ). This approach applies measurement models to thoroughly analyze various factors that can influence test outcomes, including test takers’ proficiency, item difficulty, and rater severity, among others. The underlying principles and functionality of multi-faceted Rasch measurement are illustrated in (12).

\(\log \left(\frac{{P}_{{nijk}}}{{P}_{{nij}(k-1)}}\right)={B}_{n}-{D}_{i}-{C}_{j}-{F}_{k}\)

(12) defines the logarithmic transformation of the probability ratio ( P nijk /P nij(k-1) )) as a function of multiple parameters. Here, n represents the test taker, i denotes a writing proficiency measure, j corresponds to the human rater, and k represents the proficiency score. The parameter B n signifies the proficiency level of test taker n (where n ranges from 1 to N). D j represents the difficulty parameter of test item i (where i ranges from 1 to L), while C j represents the severity of rater j (where j ranges from 1 to J). Additionally, F k represents the step difficulty for a test taker to move from score ‘k-1’ to k . P nijk refers to the probability of rater j assigning score k to test taker n for test item i . P nij(k-1) represents the likelihood of test taker n being assigned score ‘k-1’ by rater j for test item i . Each facet within the test is treated as an independent parameter and estimated within the same reference framework. To evaluate the consistency of scores obtained through both human and computer analysis, we utilized the Infit mean-square statistic. This statistic is a chi-square measure divided by the degrees of freedom and is weighted with information. It demonstrates higher sensitivity to unexpected patterns in responses to items near a person’s proficiency level (Linacre, 2002 ). Fit statistics are assessed based on predefined thresholds for acceptable fit. For the Infit MNSQ, which has a mean of 1.00, different thresholds have been suggested. Some propose stricter thresholds ranging from 0.7 to 1.3 (Bond et al. 2021 ), while others suggest more lenient thresholds ranging from 0.5 to 1.5 (Eckes, 2009 ). In this study, we adopted the criterion of 0.70–1.30 for the Infit MNSQ.

Moving forward, we can now proceed to assess the effectiveness of the 16 proposed measures based on five criteria for accurately distinguishing various levels of writing proficiency among non-native Japanese speakers. To conduct this evaluation, we utilized the development dataset from the I-JAS corpus, as described in Section Dataset . Table 4 provides a measurement report that presents the performance details of the 14 metrics under consideration. The measure separation was found to be 4.02, indicating a clear differentiation among the measures. The reliability index for the measure separation was 0.891, suggesting consistency in the measurement. Similarly, the person separation reliability index was 0.802, indicating the accuracy of the assessment in distinguishing between individuals. All 16 measures demonstrated Infit mean squares within a reasonable range, ranging from 0.76 to 1.28. The Synonym overlap/paragraph (topic) measure exhibited a relatively high outfit mean square of 1.46, although the Infit mean square falls within an acceptable range. The standard error for the measures ranged from 0.13 to 0.28, indicating the precision of the estimates.

Table 5 further illustrated the weights assigned to different linguistic measures for score prediction, with higher weights indicating stronger correlations between those measures and higher scores. Specifically, the following measures exhibited higher weights compared to others: moving average type token ratio per essay has a weight of 0.0391. Mean dependency distance had a weight of 0.0388. Mean length of clause, calculated by dividing the number of words by the number of clauses, had a weight of 0.0374. Complex nominals per T-unit, calculated by dividing the number of complex nominals by the number of T-units, had a weight of 0.0379. Coordinate phrases rate, calculated by dividing the number of coordinate phrases by the number of clauses, had a weight of 0.0325. Grammatical error rate, representing the number of errors per essay, had a weight of 0.0322.

Criteria (output indicator)

The criteria used to evaluate the writing ability in this study were based on CEFR, which follows a six-point scale ranging from A1 to C2. To assess the quality of Japanese writing, the scoring criteria from Table 6 were utilized. These criteria were derived from the IELTS writing standards and served as assessment guidelines and prompts for the written output.

A prompt is a question or detailed instruction that is provided to the model to obtain a proper response. After several pilot experiments, we decided to provide the measures (Section Measures of writing proficiency for nonnative Japanese ) as the input prompt and use the criteria (Section Criteria (output indicator) ) as the output indicator. Regarding the prompt language, considering that the LLM was tasked with rating Japanese essays, would prompt in Japanese works better Footnote 5 ? We conducted experiments comparing the performance of GPT-4 using both English and Japanese prompts. Additionally, we utilized the Japanese local model OCLL with Japanese prompts. Multiple trials were conducted using the same sample. Regardless of the prompt language used, we consistently obtained the same grading results with GPT-4, which assigned a grade of B1 to the writing sample. This suggested that GPT-4 is reliable and capable of producing consistent ratings regardless of the prompt language. On the other hand, when we used Japanese prompts with the Japanese local model “OCLL”, we encountered inconsistent grading results. Out of 10 attempts with OCLL, only 6 yielded consistent grading results (B1), while the remaining 4 showed different outcomes, including A1 and B2 grades. These findings indicated that the language of the prompt was not the determining factor for reliable AES. Instead, the size of the training data and the model parameters played crucial roles in achieving consistent and reliable AES results for the language model.

The following is the utilized prompt, which details all measures and requires the LLM to score the essays using holistic and trait scores.

Please evaluate Japanese essays written by Japanese learners and assign a score to each essay on a six-point scale, ranging from A1, A2, B1, B2, C1 to C2. Additionally, please provide trait scores and display the calculation process for each trait score. The scoring should be based on the following criteria:

Moving average type-token ratio.

Number of lexical words (token) divided by the total number of words per essay.

Number of sophisticated word types divided by the total number of words per essay.

Mean length of clause.

Verb phrases per T-unit.

Clauses per T-unit.

Dependent clauses per T-unit.

Complex nominals per clause.

Adverbial clauses per clause.

Coordinate phrases per clause.

Mean dependency distance.

Synonym overlap paragraph (topic and keywords).

Word2vec cosine similarity.

Connectives per essay.

Conjunctions per essay.

Number of metadiscourse markers (types) divided by the total number of words.

Number of errors per essay.

Japanese essay text

出かける前に二人が地図を見ている間に、サンドイッチを入れたバスケットに犬が入ってしまいました。それに気づかずに二人は楽しそうに出かけて行きました。やがて突然犬がバスケットから飛び出し、二人は驚きました。バスケット の 中を見ると、食べ物はすべて犬に食べられていて、二人は困ってしまいました。(ID_JJJ01_SW1)

The score of the example above was B1. Figure 3 provides an example of holistic and trait scores provided by GPT-4 (with a prompt indicating all measures) via Bing Footnote 6 .

figure 3

Example of GPT-4 AES and feedback (with a prompt indicating all measures).

Statistical analysis

The aim of this study is to investigate the potential use of LLM for nonnative Japanese AES. It seeks to compare the scoring outcomes obtained from feature-based AES tools, which rely on conventional machine learning technology (i.e. Jess, JWriter), with those generated by AI-driven AES tools utilizing deep learning technology (BERT, GPT, OCLL). To assess the reliability of a computer-assisted annotation tool, the study initially established human-human agreement as the benchmark measure. Subsequently, the performance of the LLM-based method was evaluated by comparing it to human-human agreement.

To assess annotation agreement, the study employed standard measures such as precision, recall, and F-score (Brants 2000 ; Lu 2010 ), along with the quadratically weighted kappa (QWK) to evaluate the consistency and agreement in the annotation process. Assume A and B represent human annotators. When comparing the annotations of the two annotators, the following results are obtained. The evaluation of precision, recall, and F-score metrics was illustrated in equations (13) to (15).

\({\rm{Recall}}(A,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,A}\)

\({\rm{Precision}}(A,\,B)=\frac{{\rm{Number}}\,{\rm{of}}\,{\rm{identical}}\,{\rm{nodes}}\,{\rm{in}}\,A\,{\rm{and}}\,B}{{\rm{Number}}\,{\rm{of}}\,{\rm{nodes}}\,{\rm{in}}\,B}\)

The F-score is the harmonic mean of recall and precision:

\({\rm{F}}-{\rm{score}}=\frac{2* ({\rm{Precision}}* {\rm{Recall}})}{{\rm{Precision}}+{\rm{Recall}}}\)

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or recall are zero.

In accordance with Taghipour and Ng ( 2016 ), the calculation of QWK involves two steps:

Step 1: Construct a weight matrix W as follows:

\({W}_{{ij}}=\frac{{(i-j)}^{2}}{{(N-1)}^{2}}\)

i represents the annotation made by the tool, while j represents the annotation made by a human rater. N denotes the total number of possible annotations. Matrix O is subsequently computed, where O_( i, j ) represents the count of data annotated by the tool ( i ) and the human annotator ( j ). On the other hand, E refers to the expected count matrix, which undergoes normalization to ensure that the sum of elements in E matches the sum of elements in O.

Step 2: With matrices O and E, the QWK is obtained as follows:

K = 1- \(\frac{\sum i,j{W}_{i,j}\,{O}_{i,j}}{\sum i,j{W}_{i,j}\,{E}_{i,j}}\)

The value of the quadratic weighted kappa increases as the level of agreement improves. Further, to assess the accuracy of LLM scoring, the proportional reductive mean square error (PRMSE) was employed. The PRMSE approach takes into account the variability observed in human ratings to estimate the rater error, which is then subtracted from the variance of the human labels. This calculation provides an overall measure of agreement between the automated scores and true scores (Haberman et al. 2015 ; Loukina et al. 2020 ; Taghipour and Ng, 2016 ). The computation of PRMSE involves the following steps:

Step 1: Calculate the mean squared errors (MSEs) for the scoring outcomes of the computer-assisted tool (MSE tool) and the human scoring outcomes (MSE human).

Step 2: Determine the PRMSE by comparing the MSE of the computer-assisted tool (MSE tool) with the MSE from human raters (MSE human), using the following formula:

\({\rm{PRMSE}}=1-\frac{({\rm{MSE}}\,{\rm{tool}})\,}{({\rm{MSE}}\,{\rm{human}})\,}=1-\,\frac{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-{\hat{{\rm{y}}}}_{{\rm{i}}})}^{2}}{{\sum }_{i}^{n}=1{({{\rm{y}}}_{i}-\hat{{\rm{y}}})}^{2}}\)

In the numerator, ŷi represents the scoring outcome predicted by a specific LLM-driven AES system for a given sample. The term y i − ŷ i represents the difference between this predicted outcome and the mean value of all LLM-driven AES systems’ scoring outcomes. It quantifies the deviation of the specific LLM-driven AES system’s prediction from the average prediction of all LLM-driven AES systems. In the denominator, y i − ŷ represents the difference between the scoring outcome provided by a specific human rater for a given sample and the mean value of all human raters’ scoring outcomes. It measures the discrepancy between the specific human rater’s score and the average score given by all human raters. The PRMSE is then calculated by subtracting the ratio of the MSE tool to the MSE human from 1. PRMSE falls within the range of 0 to 1, with larger values indicating reduced errors in LLM’s scoring compared to those of human raters. In other words, a higher PRMSE implies that LLM’s scoring demonstrates greater accuracy in predicting the true scores (Loukina et al. 2020 ). The interpretation of kappa values, ranging from 0 to 1, is based on the work of Landis and Koch ( 1977 ). Specifically, the following categories are assigned to different ranges of kappa values: −1 indicates complete inconsistency, 0 indicates random agreement, 0.0 ~ 0.20 indicates extremely low level of agreement (slight), 0.21 ~ 0.40 indicates moderate level of agreement (fair), 0.41 ~ 0.60 indicates medium level of agreement (moderate), 0.61 ~ 0.80 indicates high level of agreement (substantial), 0.81 ~ 1 indicates almost perfect level of agreement. All statistical analyses were executed using Python script.

Results and discussion

Annotation reliability of the llm.

This section focuses on assessing the reliability of the LLM’s annotation and scoring capabilities. To evaluate the reliability, several tests were conducted simultaneously, aiming to achieve the following objectives:

Assess the LLM’s ability to differentiate between test takers with varying levels of oral proficiency.

Determine the level of agreement between the annotations and scoring performed by the LLM and those done by human raters.

The evaluation of the results encompassed several metrics, including: precision, recall, F-Score, quadratically-weighted kappa, proportional reduction of mean squared error, Pearson correlation, and multi-faceted Rasch measurement.

Inter-annotator agreement (human–human annotator agreement)

We started with an agreement test of the two human annotators. Two trained annotators were recruited to determine the writing task data measures. A total of 714 scripts, as the test data, was utilized. Each analysis lasted 300–360 min. Inter-annotator agreement was evaluated using the standard measures of precision, recall, and F-score and QWK. Table 7 presents the inter-annotator agreement for the various indicators. As shown, the inter-annotator agreement was fairly high, with F-scores ranging from 1.0 for sentence and word number to 0.666 for grammatical errors.

The findings from the QWK analysis provided further confirmation of the inter-annotator agreement. The QWK values covered a range from 0.950 ( p  = 0.000) for sentence and word number to 0.695 for synonym overlap number (keyword) and grammatical errors ( p  = 0.001).

Agreement of annotation outcomes between human and LLM

To evaluate the consistency between human annotators and LLM annotators (BERT, GPT, OCLL) across the indices, the same test was conducted. The results of the inter-annotator agreement (F-score) between LLM and human annotation are provided in Appendix B-D. The F-scores ranged from 0.706 for Grammatical error # for OCLL-human to a perfect 1.000 for GPT-human, for sentences, clauses, T-units, and words. These findings were further supported by the QWK analysis, which showed agreement levels ranging from 0.807 ( p  = 0.001) for metadiscourse markers for OCLL-human to 0.962 for words ( p  = 0.000) for GPT-human. The findings demonstrated that the LLM annotation achieved a significant level of accuracy in identifying measurement units and counts.

Reliability of LLM-driven AES’s scoring and discriminating proficiency levels

This section examines the reliability of the LLM-driven AES scoring through a comparison of the scoring outcomes produced by human raters and the LLM ( Reliability of LLM-driven AES scoring ). It also assesses the effectiveness of the LLM-based AES system in differentiating participants with varying proficiency levels ( Reliability of LLM-driven AES discriminating proficiency levels ).

Reliability of LLM-driven AES scoring

Table 8 summarizes the QWK coefficient analysis between the scores computed by the human raters and the GPT-4 for the individual essays from I-JAS Footnote 7 . As shown, the QWK of all measures ranged from k  = 0.819 for lexical density (number of lexical words (tokens)/number of words per essay) to k  = 0.644 for word2vec cosine similarity. Table 9 further presents the Pearson correlations between the 16 writing proficiency measures scored by human raters and GPT 4 for the individual essays. The correlations ranged from 0.672 for syntactic complexity to 0.734 for grammatical accuracy. The correlations between the writing proficiency scores assigned by human raters and the BERT-based AES system were found to range from 0.661 for syntactic complexity to 0.713 for grammatical accuracy. The correlations between the writing proficiency scores given by human raters and the OCLL-based AES system ranged from 0.654 for cohesion to 0.721 for grammatical accuracy. These findings indicated an alignment between the assessments made by human raters and both the BERT-based and OCLL-based AES systems in terms of various aspects of writing proficiency.

Reliability of LLM-driven AES discriminating proficiency levels

After validating the reliability of the LLM’s annotation and scoring, the subsequent objective was to evaluate its ability to distinguish between various proficiency levels. For this analysis, a dataset of 686 individual essays was utilized. Table 10 presents a sample of the results, summarizing the means, standard deviations, and the outcomes of the one-way ANOVAs based on the measures assessed by the GPT-4 model. A post hoc multiple comparison test, specifically the Bonferroni test, was conducted to identify any potential differences between pairs of levels.

As the results reveal, seven measures presented linear upward or downward progress across the three proficiency levels. These were marked in bold in Table 10 and comprise one measure of lexical richness, i.e. MATTR (lexical diversity); four measures of syntactic complexity, i.e. MDD (mean dependency distance), MLC (mean length of clause), CNT (complex nominals per T-unit), CPC (coordinate phrases rate); one cohesion measure, i.e. word2vec cosine similarity and GER (grammatical error rate). Regarding the ability of the sixteen measures to distinguish adjacent proficiency levels, the Bonferroni tests indicated that statistically significant differences exist between the primary level and the intermediate level for MLC and GER. One measure of lexical richness, namely LD, along with three measures of syntactic complexity (VPT, CT, DCT, ACC), two measures of cohesion (SOPT, SOPK), and one measure of content elaboration (IMM), exhibited statistically significant differences between proficiency levels. However, these differences did not demonstrate a linear progression between adjacent proficiency levels. No significant difference was observed in lexical sophistication between proficiency levels.

To summarize, our study aimed to evaluate the reliability and differentiation capabilities of the LLM-driven AES method. For the first objective, we assessed the LLM’s ability to differentiate between test takers with varying levels of oral proficiency using precision, recall, F-Score, and quadratically-weighted kappa. Regarding the second objective, we compared the scoring outcomes generated by human raters and the LLM to determine the level of agreement. We employed quadratically-weighted kappa and Pearson correlations to compare the 16 writing proficiency measures for the individual essays. The results confirmed the feasibility of using the LLM for annotation and scoring in AES for nonnative Japanese. As a result, Research Question 1 has been addressed.

Comparison of BERT-, GPT-, OCLL-based AES, and linguistic-feature-based computation methods

This section aims to compare the effectiveness of five AES methods for nonnative Japanese writing, i.e. LLM-driven approaches utilizing BERT, GPT, and OCLL, linguistic feature-based approaches using Jess and JWriter. The comparison was conducted by comparing the ratings obtained from each approach with human ratings. All ratings were derived from the dataset introduced in Dataset . To facilitate the comparison, the agreement between the automated methods and human ratings was assessed using QWK and PRMSE. The performance of each approach was summarized in Table 11 .

The QWK coefficient values indicate that LLMs (GPT, BERT, OCLL) and human rating outcomes demonstrated higher agreement compared to feature-based AES methods (Jess and JWriter) in assessing writing proficiency criteria, including lexical richness, syntactic complexity, content, and grammatical accuracy. Among the LLMs, the GPT-4 driven AES and human rating outcomes showed the highest agreement in all criteria, except for syntactic complexity. The PRMSE values suggest that the GPT-based method outperformed linguistic feature-based methods and other LLM-based approaches. Moreover, an interesting finding emerged during the study: the agreement coefficient between GPT-4 and human scoring was even higher than the agreement between different human raters themselves. This discovery highlights the advantage of GPT-based AES over human rating. Ratings involve a series of processes, including reading the learners’ writing, evaluating the content and language, and assigning scores. Within this chain of processes, various biases can be introduced, stemming from factors such as rater biases, test design, and rating scales. These biases can impact the consistency and objectivity of human ratings. GPT-based AES may benefit from its ability to apply consistent and objective evaluation criteria. By prompting the GPT model with detailed writing scoring rubrics and linguistic features, potential biases in human ratings can be mitigated. The model follows a predefined set of guidelines and does not possess the same subjective biases that human raters may exhibit. This standardization in the evaluation process contributes to the higher agreement observed between GPT-4 and human scoring. Section Prompt strategy of the study delves further into the role of prompts in the application of LLMs to AES. It explores how the choice and implementation of prompts can impact the performance and reliability of LLM-based AES methods. Furthermore, it is important to acknowledge the strengths of the local model, i.e. the Japanese local model OCLL, which excels in processing certain idiomatic expressions. Nevertheless, our analysis indicated that GPT-4 surpasses local models in AES. This superior performance can be attributed to the larger parameter size of GPT-4, estimated to be between 500 billion and 1 trillion, which exceeds the sizes of both BERT and the local model OCLL.

Prompt strategy

In the context of prompt strategy, Mizumoto and Eguchi ( 2023 ) conducted a study where they applied the GPT-3 model to automatically score English essays in the TOEFL test. They found that the accuracy of the GPT model alone was moderate to fair. However, when they incorporated linguistic measures such as cohesion, syntactic complexity, and lexical features alongside the GPT model, the accuracy significantly improved. This highlights the importance of prompt engineering and providing the model with specific instructions to enhance its performance. In this study, a similar approach was taken to optimize the performance of LLMs. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. Model 1 was used as the baseline, representing GPT-4 without any additional prompting. Model 2, on the other hand, involved GPT-4 prompted with 16 measures that included scoring criteria, efficient linguistic features for writing assessment, and detailed measurement units and calculation formulas. The remaining models (Models 3 to 18) utilized GPT-4 prompted with individual measures. The performance of these 18 different models was assessed using the output indicators described in Section Criteria (output indicator) . By comparing the performances of these models, the study aimed to understand the impact of prompt engineering on the accuracy and effectiveness of GPT-4 in AES tasks.

  

Model 1: GPT-4

  

  

Model 2: GPT-4 + 17 measures

  

  

Model 3: GPT-4 + MATTR

Model 4: GPT-4 + LD

Model 5: GPT-4 + LS

Model 6: GPT-4 + MLC

Model 7: GPT-4 + VPT

Model 8: GPT-4 + CT

Model 9: GPT-4 + DCT

Model 10: GPT-4 + CNT

Model 11: GPT-4 + ACC

Model 12: GPT-4 + CPC

Model 13: GPT-4 + MDD

Model 14: GPT-4 + SOPT

Model 15: GPT-4 + SOPK

Model 16: GPT-4 + word2vec

 

Model 17: GPT-4 + IMM

Model 18: GPT-4 + GER

 

Based on the PRMSE scores presented in Fig. 4 , it was observed that Model 1, representing GPT-4 without any additional prompting, achieved a fair level of performance. However, Model 2, which utilized GPT-4 prompted with all measures, outperformed all other models in terms of PRMSE score, achieving a score of 0.681. These results indicate that the inclusion of specific measures and prompts significantly enhanced the performance of GPT-4 in AES. Among the measures, syntactic complexity was found to play a particularly significant role in improving the accuracy of GPT-4 in assessing writing quality. Following that, lexical diversity emerged as another important factor contributing to the model’s effectiveness. The study suggests that a well-prompted GPT-4 can serve as a valuable tool to support human assessors in evaluating writing quality. By utilizing GPT-4 as an automated scoring tool, the evaluation biases associated with human raters can be minimized. This has the potential to empower teachers by allowing them to focus on designing writing tasks and guiding writing strategies, while leveraging the capabilities of GPT-4 for efficient and reliable scoring.

figure 4

PRMSE scores of the 18 AES models.

This study aimed to investigate two main research questions: the feasibility of utilizing LLMs for AES and the impact of prompt engineering on the application of LLMs in AES.

To address the first objective, the study compared the effectiveness of five different models: GPT, BERT, the Japanese local LLM (OCLL), and two conventional machine learning-based AES tools (Jess and JWriter). The PRMSE values indicated that the GPT-4-based method outperformed other LLMs (BERT, OCLL) and linguistic feature-based computational methods (Jess and JWriter) across various writing proficiency criteria. Furthermore, the agreement coefficient between GPT-4 and human scoring surpassed the agreement among human raters themselves, highlighting the potential of using the GPT-4 tool to enhance AES by reducing biases and subjectivity, saving time, labor, and cost, and providing valuable feedback for self-study. Regarding the second goal, the role of prompt design was investigated by comparing 18 models, including a baseline model, a model prompted with all measures, and 16 models prompted with one measure at a time. GPT-4, which outperformed BERT and OCLL, was selected as the candidate model. The PRMSE scores of the models showed that GPT-4 prompted with all measures achieved the best performance, surpassing the baseline and other models.

In conclusion, this study has demonstrated the potential of LLMs in supporting human rating in assessments. By incorporating automation, we can save time and resources while reducing biases and subjectivity inherent in human rating processes. Automated language assessments offer the advantage of accessibility, providing equal opportunities and economic feasibility for individuals who lack access to traditional assessment centers or necessary resources. LLM-based language assessments provide valuable feedback and support to learners, aiding in the enhancement of their language proficiency and the achievement of their goals. This personalized feedback can cater to individual learner needs, facilitating a more tailored and effective language-learning experience.

There are three important areas that merit further exploration. First, prompt engineering requires attention to ensure optimal performance of LLM-based AES across different language types. This study revealed that GPT-4, when prompted with all measures, outperformed models prompted with fewer measures. Therefore, investigating and refining prompt strategies can enhance the effectiveness of LLMs in automated language assessments. Second, it is crucial to explore the application of LLMs in second-language assessment and learning for oral proficiency, as well as their potential in under-resourced languages. Recent advancements in self-supervised machine learning techniques have significantly improved automatic speech recognition (ASR) systems, opening up new possibilities for creating reliable ASR systems, particularly for under-resourced languages with limited data. However, challenges persist in the field of ASR. First, ASR assumes correct word pronunciation for automatic pronunciation evaluation, which proves challenging for learners in the early stages of language acquisition due to diverse accents influenced by their native languages. Accurately segmenting short words becomes problematic in such cases. Second, developing precise audio-text transcriptions for languages with non-native accented speech poses a formidable task. Last, assessing oral proficiency levels involves capturing various linguistic features, including fluency, pronunciation, accuracy, and complexity, which are not easily captured by current NLP technology.

Data availability

The dataset utilized was obtained from the International Corpus of Japanese as a Second Language (I-JAS). The data URLs: [ https://www2.ninjal.ac.jp/jll/lsaj/ihome2.html ].

J-CAT and TTBJ are two computerized adaptive tests used to assess Japanese language proficiency.

SPOT is a specific component of the TTBJ test.

J-CAT: https://www.j-cat2.org/html/ja/pages/interpret.html

SPOT: https://ttbj.cegloc.tsukuba.ac.jp/p1.html#SPOT .

The study utilized a prompt-based GPT-4 model, developed by OpenAI, which has an impressive architecture with 1.8 trillion parameters across 120 layers. GPT-4 was trained on a vast dataset of 13 trillion tokens, using two stages: initial training on internet text datasets to predict the next token, and subsequent fine-tuning through reinforcement learning from human feedback.

https://www2.ninjal.ac.jp/jll/lsaj/ihome2-en.html .

http://jhlee.sakura.ne.jp/JEV/ by Japanese Learning Dictionary Support Group 2015.

We express our sincere gratitude to the reviewer for bringing this matter to our attention.

On February 7, 2023, Microsoft began rolling out a major overhaul to Bing that included a new chatbot feature based on OpenAI’s GPT-4 (Bing.com).

Appendix E-F present the analysis results of the QWK coefficient between the scores computed by the human raters and the BERT, OCLL models.

Attali Y, Burstein J (2006) Automated essay scoring with e-rater® V.2. J. Technol., Learn. Assess., 4

Barkaoui K, Hadidi A (2020) Assessing Change in English Second Language Writing Performance (1st ed.). Routledge, New York. https://doi.org/10.4324/9781003092346

Bentz C, Tatyana R, Koplenig A, Tanja S (2016) A comparison between morphological complexity. measures: Typological data vs. language corpora. In Proceedings of the workshop on computational linguistics for linguistic complexity (CL4LC), 142–153. Osaka, Japan: The COLING 2016 Organizing Committee

Bond TG, Yan Z, Heene M (2021) Applying the Rasch model: Fundamental measurement in the human sciences (4th ed). Routledge

Brants T (2000) Inter-annotator agreement for a German newspaper corpus. Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece, 31 May-2 June, European Language Resources Association

Brown TB, Mann B, Ryder N, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, Online, 6–12 December, Curran Associates, Inc., Red Hook, NY

Burstein J (2003) The E-rater scoring engine: Automated essay scoring with natural language processing. In Shermis MD and Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Čech R, Miroslav K (2018) Morphological richness of text. In Masako F, Václav C (ed) Taming the corpus: From inflection and lexis to interpretation, 63–77. Cham, Switzerland: Springer Nature

Çöltekin Ç, Taraka, R (2018) Exploiting Universal Dependencies treebanks for measuring morphosyntactic complexity. In Aleksandrs B, Christian B (ed), Proceedings of first workshop on measuring language complexity, 1–7. Torun, Poland

Crossley SA, Cobb T, McNamara DS (2013) Comparing count-based and band-based indices of word frequency: Implications for active vocabulary research and pedagogical applications. System 41:965–981. https://doi.org/10.1016/j.system.2013.08.002

Article   Google Scholar  

Crossley SA, McNamara DS (2016) Say more and be more coherent: How text elaboration and cohesion can increase writing quality. J. Writ. Res. 7:351–370

CyberAgent Inc (2023) Open-Calm series of Japanese language models. Retrieved from: https://www.cyberagent.co.jp/news/detail/id=28817

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, 2–7 June, pp. 4171–4186. Association for Computational Linguistics

Diez-Ortega M, Kyle K (2023) Measuring the development of lexical richness of L2 Spanish: a longitudinal learner corpus study. Studies in Second Language Acquisition 1-31

Eckes T (2009) On common ground? How raters perceive scoring criteria in oral proficiency testing. In Brown A, Hill K (ed) Language testing and evaluation 13: Tasks and criteria in performance assessment (pp. 43–73). Peter Lang Publishing

Elliot S (2003) IntelliMetric: from here to validity. In: Shermis MD, Burstein JC (ed) Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Google Scholar  

Engber CA (1995) The relationship of lexical proficiency to the quality of ESL compositions. J. Second Lang. Writ. 4:139–155

Garner J, Crossley SA, Kyle K (2019) N-gram measures and L2 writing proficiency. System 80:176–187. https://doi.org/10.1016/j.system.2018.12.001

Haberman SJ (2008) When can subscores have value? J. Educat. Behav. Stat., 33:204–229

Haberman SJ, Yao L, Sinharay S (2015) Prediction of true test scores from observed item scores and ancillary data. Brit. J. Math. Stat. Psychol. 68:363–385

Halliday MAK (1985) Spoken and Written Language. Deakin University Press, Melbourne, Australia

Hirao R, Arai M, Shimanaka H et al. (2020) Automated essay scoring system for nonnative Japanese learners. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 1250–1257. European Language Resources Association

Hunt KW (1966) Recent Measures in Syntactic Development. Elementary English, 43(7), 732–739. http://www.jstor.org/stable/41386067

Ishioka T (2001) About e-rater, a computer-based automatic scoring system for essays [Konpyūta ni yoru essei no jidō saiten shisutemu e − rater ni tsuite]. University Entrance Examination. Forum [Daigaku nyūshi fōramu] 24:71–76

Hochreiter S, Schmidhuber J (1997) Long short- term memory. Neural Comput. 9(8):1735–1780

Article   CAS   PubMed   Google Scholar  

Ishioka T, Kameda M (2006) Automated Japanese essay scoring system based on articles written by experts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17–18 July 2006, pp. 233-240. Association for Computational Linguistics, USA

Japan Foundation (2021) Retrieved from: https://www.jpf.gp.jp/j/project/japanese/survey/result/dl/survey2021/all.pdf

Jarvis S (2013a) Defining and measuring lexical diversity. In Jarvis S, Daller M (ed) Vocabulary knowledge: Human ratings and automated measures (Vol. 47, pp. 13–44). John Benjamins. https://doi.org/10.1075/sibil.47.03ch1

Jarvis S (2013b) Capturing the diversity in lexical diversity. Lang. Learn. 63:87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x

Jiang J, Quyang J, Liu H (2019) Interlanguage: A perspective of quantitative linguistic typology. Lang. Sci. 74:85–97

Kim M, Crossley SA, Kyle K (2018) Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. Mod. Lang. J. 102(1):120–141. https://doi.org/10.1111/modl.12447

Kojima T, Gu S, Reid M et al. (2022) Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, New Orleans, LA, 29 November-1 December, Curran Associates, Inc., Red Hook, NY

Kyle K, Crossley SA (2015) Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Q 49:757–786

Kyle K, Crossley SA, Berger CM (2018) The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behav. Res. Methods 50:1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Article   PubMed   Google Scholar  

Kyle K, Crossley SA, Jarvis S (2021) Assessing the validity of lexical diversity using direct judgements. Lang. Assess. Q. 18:154–170. https://doi.org/10.1080/15434303.2020.1844205

Landauer TK, Laham D, Foltz PW (2003) Automated essay scoring and annotation of essays with the Intelligent Essay Assessor. In Shermis MD, Burstein JC (ed), Automated Essay Scoring: A Cross-Disciplinary Perspective. Lawrence Erlbaum Associates, Mahwah, NJ

Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174

Laufer B, Nation P (1995) Vocabulary size and use: Lexical richness in L2 written production. Appl. Linguist. 16:307–322. https://doi.org/10.1093/applin/16.3.307

Lee J, Hasebe Y (2017) jWriter Learner Text Evaluator, URL: https://jreadability.net/jwriter/

Lee J, Kobayashi N, Sakai T, Sakota K (2015) A Comparison of SPOT and J-CAT Based on Test Analysis [Tesuto bunseki ni motozuku ‘SPOT’ to ‘J-CAT’ no hikaku]. Research on the Acquisition of Second Language Japanese [Dainigengo to shite no nihongo no shūtoku kenkyū] (18) 53–69

Li W, Yan J (2021) Probability distribution of dependency distance based on a Treebank of. Japanese EFL Learners’ Interlanguage. J. Quant. Linguist. 28(2):172–186. https://doi.org/10.1080/09296174.2020.1754611

Article   MathSciNet   Google Scholar  

Linacre JM (2002) Optimizing rating scale category effectiveness. J. Appl. Meas. 3(1):85–106

PubMed   Google Scholar  

Linacre JM (1994) Constructing measurement with a Many-Facet Rasch Model. In Wilson M (ed) Objective measurement: Theory into practice, Volume 2 (pp. 129–144). Norwood, NJ: Ablex

Liu H (2008) Dependency distance as a metric of language comprehension difficulty. J. Cognitive Sci. 9:159–191

Liu H, Xu C, Liang J (2017) Dependency distance: A new perspective on syntactic patterns in natural languages. Phys. Life Rev. 21. https://doi.org/10.1016/j.plrev.2017.03.002

Loukina A, Madnani N, Cahill A, et al. (2020) Using PRMSE to evaluate automated scoring systems in the presence of label noise. Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA → Online, 10 July, pp. 18–29. Association for Computational Linguistics

Lu X (2010) Automatic analysis of syntactic complexity in second language writing. Int. J. Corpus Linguist. 15:474–496

Lu X (2012) The relationship of lexical richness to the quality of ESL learners’ oral narratives. Mod. Lang. J. 96:190–208

Lu X (2017) Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Lang. Test. 34:493–511

Lu X, Hu R (2022) Sense-aware lexical sophistication indices and their relationship to second language writing quality. Behav. Res. Method. 54:1444–1460. https://doi.org/10.3758/s13428-021-01675-6

Ministry of Health, Labor, and Welfare of Japan (2022) Retrieved from: https://www.mhlw.go.jp/stf/newpage_30367.html

Mizumoto A, Eguchi M (2023) Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 3:100050

Okgetheng B, Takeuchi K (2024) Estimating Japanese Essay Grading Scores with Large Language Models. Proceedings of 30th Annual Conference of the Language Processing Society in Japan, March 2024

Ortega L (2015) Second language learning explained? SLA across 10 contemporary theories. In VanPatten B, Williams J (ed) Theories in Second Language Acquisition: An Introduction

Rae JW, Borgeaud S, Cai T, et al. (2021) Scaling Language Models: Methods, Analysis & Insights from Training Gopher. ArXiv, abs/2112.11446

Read J (2000) Assessing vocabulary. Cambridge University Press. https://doi.org/10.1017/CBO9780511732942

Rudner LM, Liang T (2002) Automated Essay Scoring Using Bayes’ Theorem. J. Technol., Learning and Assessment, 1 (2)

Sakoda K, Hosoi Y (2020) Accuracy and complexity of Japanese Language usage by SLA learners in different learning environments based on the analysis of I-JAS, a learners’ corpus of Japanese as L2. Math. Linguist. 32(7):403–418. https://doi.org/10.24701/mathling.32.7_403

Suzuki N (1999) Summary of survey results regarding comprehensive essay questions. Final report of “Joint Research on Comprehensive Examinations for the Aim of Evaluating Applicability to Each Specialized Field of Universities” for 1996-2000 [shōronbun sōgō mondai ni kansuru chōsa kekka no gaiyō. Heisei 8 - Heisei 12-nendo daigaku no kaku senmon bun’ya e no tekisei no hyōka o mokuteki to suru sōgō shiken no arikata ni kansuru kyōdō kenkyū’ saishū hōkoku-sho]. University Entrance Examination Section Center Research and Development Department [Daigaku nyūshi sentā kenkyū kaihatsubu], 21–32

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 1–5 November, pp. 1882–1891. Association for Computational Linguistics

Takeuchi K, Ohno M, Motojin K, Taguchi M, Inada Y, Iizuka M, Abo T, Ueda H (2021) Development of essay scoring methods based on reference texts with construction of research-available Japanese essay data. In IPSJ J 62(9):1586–1604

Ure J (1971) Lexical density: A computational technique and some findings. In Coultard M (ed) Talking about Text. English Language Research, University of Birmingham, Birmingham, England

Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Long Beach, CA, 4–7 December, pp. 5998–6008, Curran Associates, Inc., Red Hook, NY

Watanabe H, Taira Y, Inoue Y (1988) Analysis of essay evaluation data [Shōronbun hyōka dēta no kaiseki]. Bulletin of the Faculty of Education, University of Tokyo [Tōkyōdaigaku kyōiku gakubu kiyō], Vol. 28, 143–164

Yao S, Yu D, Zhao J, et al. (2023) Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36

Zenker F, Kyle K (2021) Investigating minimum text lengths for lexical diversity indices. Assess. Writ. 47:100505. https://doi.org/10.1016/j.asw.2020.100505

Zhang Y, Warstadt A, Li X, et al. (2021) When do you need billions of words of pretraining data? Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, pp. 1112-1125. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.90

Download references

This research was funded by National Foundation of Social Sciences (22BYY186) to Wenchao Li.

Author information

Authors and affiliations.

Department of Japanese Studies, Zhejiang University, Hangzhou, China

Department of Linguistics and Applied Linguistics, Zhejiang University, Hangzhou, China

You can also search for this author in PubMed   Google Scholar

Contributions

Wenchao Li is in charge of conceptualization, validation, formal analysis, investigation, data curation, visualization and writing the draft. Haitao Liu is in charge of supervision.

Corresponding author

Correspondence to Wenchao Li .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

Ethical approval was not required as the study did not involve human participants.

Informed consent

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material file #1, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, W., Liu, H. Applying large language models for automated essay scoring for non-native Japanese. Humanit Soc Sci Commun 11 , 723 (2024). https://doi.org/10.1057/s41599-024-03209-9

Download citation

Received : 02 February 2024

Accepted : 16 May 2024

Published : 03 June 2024

DOI : https://doi.org/10.1057/s41599-024-03209-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

9th grade essay grading rubric

IMAGES

  1. 9th Grade Essay Rubric

    9th grade essay grading rubric

  2. Essay Rubric Template

    9th grade essay grading rubric

  3. Grade 9 Instructional Writing Rubric*

    9th grade essay grading rubric

  4. Rubric grade 9 blog by Fiona Hurtado

    9th grade essay grading rubric

  5. 46 Editable Rubric Templates (Word Format) ᐅ TemplateLab

    9th grade essay grading rubric

  6. 46 Editable Rubric Templates (Word Format) ᐅ TemplateLab

    9th grade essay grading rubric

VIDEO

  1. Practice Essay Grading

  2. Reviewing Writing Essay Rubric Up Dated Sp 2024

  3. Essay Grading Tip ✏️

  4. Multi-paragraph essay rubric

  5. 17 How to Assign a Rubric to an Essay item within TestWriter

  6. My GCSE English Literature Grade “9” essay

COMMENTS

  1. 15 Helpful Scoring Rubric Examples for All Grades and Subjects

    Try this rubric to make student expectations clear and end-of-project assessment easier. Learn more: Free Technology for Teachers. 100-Point Essay Rubric. Need an easy way to convert a scoring rubric to a letter grade? This example for essay writing earns students a final score out of 100 points. Learn more: Learn for Your Life. Drama ...

  2. Essay Rubric

    Grading rubrics can be of great benefit to both you and your students. For you, a rubric saves time and decreases subjectivity. Specific criteria are explicitly stated, facilitating the grading process and increasing your objectivity. For students, the use of grading rubrics helps them to meet or exceed expectations, to view the grading process ...

  3. PDF Essay Rubric

    Essay Rubric Directions: Your essay will be graded based on this rubric. Consequently, use this rubric as a guide when writing your essay and check it again before you submit your essay. Traits 4 3 2 1 Focus & Details There is one clear, well-focused topic. Main ideas are clear and are well supported by detailed and accurate information.

  4. PDF ESSAY RUBRIC Superior Accomplished Developing Beginning Points

    Demonstrates little understanding of author's voice. 24 points. Mechanics (20 points) Essay is written in complete sentences. Includes a clear beginning, middle, and end. All grammar, usage, punctuation, and spelling are correct. Essay is easy to follow; all details and examples are vivid and relevant to the topic.

  5. PDF Grade 9/10: Informative/Explanatory Writing Rubric DRAFT

    Note: Descriptors in bold are taken directly from the CCSS for this grade level. Some of the language used in this rubric has been adapted from the SBAC and PAARC rubrics. 4 - Exceeds 3 - Proficient 2 - Approaching 1 - Beginning Understanding Demonstrates a deep understanding of content Shows a solid understanding of content Demonstrates a limited

  6. PDF 9th-12th Grade CommonLit 360: Literary Analysis Rubrics

    CommonLit 360: Literary Analysis Rubrics. 9th-12th Grade. GRADE9-10LITERARYANALYSISRUBRIC Score 4 3 2 1 Reading Comprehension Presentswhatisstatedinand/or whatcanbeinferredfromthe ... grade-levelstandardstoliterature, suchas: Analyzestheeffectofanauthor's choicesregardinghowtodevelop andrelateelementsofastoryon

  7. 9th-10th grade informative writing rubric

    9th-10th grade argumentative writing rubric. Offer 9th-10th grade students a standards-aligned structure for argumentative writing with this educator-developed rubric. Rubric suitable for formative and summative assignments with tasks involving the explanation of a topic. Consider using the 9th-12th Grade Informative QuickMark set with this rubric.

  8. PDF Argumentative essay rubric

    Logical, compelling progression of ideas in essay;clear structure which enhances and showcases the central idea or theme and moves the reader through the text. Organization flows so smoothly the reader hardly thinks about it. Effective, mature, graceful transitions exist throughout the essay.

  9. PDF 9th/10th Grade EGUSD Writing Rubric

    Grade 9-12 SAUSD Writing Rubric - Argumentative (Writing Standard 1) Papers receiving a 0 are unable to be scored for one of the following reasons: illegibility, no response (blank), completely off topic, written in a language other than English, completely plagiarized.

  10. Rubric Best Practices, Examples, and Templates

    A rubric is a scoring tool that identifies the different criteria relevant to an assignment, assessment, or learning outcome and states the possible levels of achievement in a specific, clear, and objective way. Use rubrics to assess project-based student work including essays, group projects, creative endeavors, and oral presentations.

  11. PDF GRADE 9 2022 B.E.S.T. Writing

    This sampler can be used as a resource for Florida educators, schools, and districts regarding the scoring of student responses on the B.E.S.T. Writing assessments. Each spring, students in grades 4-10 are administered a set of source texts and a writing prompt based on those sources.

  12. AI Essay Grader

    The concept is elegantly simple: teachers input or copy the students' essays into the provided text box, select the appropriate grade level and subject, and ClassX's AI Essay Grader takes it from there. Leveraging the cutting-edge technology of ChatGPT, the AI system meticulously evaluates essays against a predefined rubric.

  13. Student-Friendly Text Analysis Rubric

    Student-Friendly Rubric for the Text Analysis Essay Identity Archive. Unit 6: Establishing Truths/Coming-of-Age. Developing Skills: Student-Friendly Text Analysis Rubric . Preview Resource Add a Copy of Resource to my Google Drive. Type. Rubrics File. Google Doc Tags ...

  14. PDF VWC Gr. 9-10 Rubric Narrative

    Grade 9-10 Narrative Writing Rubric DRAFT Writes narratives to develop real or imagined experiences or events using effective technique, well-chosen details, and well-structured event sequences. Note: Descriptors in bold are taken directly from the CCSS for this grade level. Some of the language used in this rubric has been adapted from the ...

  15. What Is an Essay Grading Rubric & 11 Tools to Streamline Your Grading

    1. Criteria. The criteria in an essay grading rubric outline the specific areas that a student's essay will be assessed on. These criteria vary depending on the teacher's goals for the assignment. They may include elements like thesis statement, organization, supporting evidence, analysis, language use, and mechanics.

  16. Writing Rubrics [Examples, Best Practices, & Free Templates]

    1. Define Clear Criteria. Identify specific aspects of writing to evaluate. Be clear and precise. The criteria should reflect the key components of the writing task. For example, for a narrative essay, criteria might include plot development, character depth, and use of descriptive language.

  17. PDF English I and II Constructed Response Scoring Guide Sample

    The expression of ideas is basic because the writer's word choice is general ("what she had read"; "the way you see things"). Overall, this response reflects a partial understanding of the writing purpose. Conventions - 2. The writer demonstrates a consistent command of grade-level appropriate conventions.

  18. PDF AP Grading Guide

    AP LANGUAGE AND COMPOSITION - GRADING RUBRIC - RHETORICAL ESSAY Grade Description Scale 1 Scale 2 Scale 3 Scale 4 9 Essays earning a score of 9 meet the criteria for essays that are scored an 8 and, in addition, are especially sophisticated in their argument and synthesis of cited sources, or impressive in their control of language.

  19. Essay Grading Rubric

    Rubrics are great tools to assist the learning process before, during, and after assignment grading. This lesson offers a sample rubric for essay grading. Create an account

  20. 9th-12th grade narrative writing rubric

    Rubric suitable for formative and summative assignments with tasks involving writing that tells a story, either personal or literary. Use this rubric when asking students to write a personal essay or a fictional narrative. Consider using the 9th-12th Grade Narrative QuickMark set with this rubric. These drag-and-drop comments were tailor-made ...

  21. 9th Grade Writing Rubrics

    Quizzes with auto-grading, and real-time student data. Browse Easel Assessments. ... 9th Grade Writing Rubrics. 3,110 results. Sort by: Rating. Relevance; Rating; Rating Count ... and for switching between the modes if required.Reduce the time and stress of marking essays with the best essay rubric ever!Package includes a very practical and ...

  22. SAT Essay Rubric: Full Analysis and Writing Strategies

    The SAT essay rubric says that the best (that is, 4-scoring) essay uses " relevant, sufficient, and strategically chosen support for claim (s) or point (s) made. " This means you can't just stick to abstract reasoning like this: The author uses analogies to hammer home his point that hot dogs are not sandwiches.

  23. IELTS

    Understanding your score. When you get your IELTS test results you will be given an overall test score of between 0 and 9 on the IELTS scale. You will also get a score for each of the four sections of the test - Listening, Reading, Writing, and Speaking. The IELTS scale helps organisations understand your level of English.

  24. Human-AI Collaborative Essay Scoring:

    None of the novices have the experience of grading student essays currently. The 5 expert evaluators comprised experienced English teachers from Beijing high schools, with teaching tenures ranging from 8 to 20 years. Before evaluation, all participants received training on the standard scoring rubrics.

  25. Applying large language models for automated essay scoring for non

    Recent advancements in artificial intelligence (AI) have led to an increased use of large language models (LLMs) for language assessment tasks such as automated essay scoring (AES), automated ...