Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013–2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits.
preprint
We find that GenAI can satisfy 33.35% of everyday image editing requests, while 66.65% of the requests are better handled by human image editors.
GenAI for images has gained enormous research interest and created a 2023 market of
$300M, which is estimated to multiply.
Specifically, text-based image editing is an increasingly high-demand task, especially
after the recent GPT-4o and Gemini-2.0-Flash image
generators.
However, four important questions remain open:
Q1: What are the real everyday image editing requests and needs of users?
Q2: According to human judgment, what % of such requests can be satisfied by existing
AIs?
Q3: What are the improvement areas for AI editors compared to human editors?
Q4: Are vision language models (VLMs) judging AI-edited images similarly to human judges?
Q1 and Q2 are unanswered partly because many prior datasets contain made-up requests written by either human annotators or AIs based on the source image or the (source, target image) pair (see Table Dataset Comparison). Those request distributions may not reflect the actual editing needs of users as well as the challenges posed by the real requests, which may have typos, distraction or ambiguity (e.g., “I'd love to see how crazy this could get... Thank you in advance!!” in this post; Figure.1). On the other hand, some datasets feature images edited by AIs and therefore do not represent the real edits by advanced photo editors.
We aim to answer these four questions by analyzing totalimagesize tuples of (source image, text request, edited image) from the /r/PhotoshopRequest (PSR) Reddit channel, which is the largest public online community that shares diverse, everyday image-editing needs with corresponding edits by PSR wizards (Footnote: Advanced image editors who are granted to handle paid editing requests in this particular subreddit.). PSR has 1.7M users and receives an average of 141 new requests per day (our statistics for 2025) with a peak as high as 226 per day.
To answer Q1, Q2, and Q3, our closed-loop study: (a) analyzes 305k tuples, i.e., datasetsize unique requests × 3.67 human-edited images per request; (b) sends all (request, source image) pairs to image-editing models to collect AI edits; (c) performs a human study over a set of 328 requests (PSR-328) to collect over 4.5k ratings to compare how 1,644 PSR-wizard edits fare against 2,296 AI edits on the same requests to identify areas where AIs perform well and fall short. Our work is the first to compare three state-of-the-art (SOTA): GPT-4o, Gemini-2.0-Flash, and SeedEdit, as well as 46 other AI models on HuggingFace, for a total of 49 AI editors. Furthermore, to address Q4, we compare human ratings against those by 3 SOTA vision-language models (VLMs): GPT-4o, o1, and Gemini-2.0-Flash-Thinking. Our main findings are as follows:
Figure. 1: We propose PSR—the largest dataset to date of real-world image-editing requests and their corresponding human-made edits. PSR enables the community (and our work) to identify the types of requests that can be automated using existing AIs and those that need improvement. PSR is the first dataset to tag all requests with WordNet subjects, real-world editing actions, and creativity levels.
Dataset | Year | No. of Requests | No. of Edits | Source Image | Edit Generator | Request Writer | Requests Based On | Creativity Low | Creativity Med | Creativity High |
---|---|---|---|---|---|---|---|---|---|---|
IER | 2019 | 4k | 4k | human ✓ | human | reddit ✓ | 78% | 15% | 7% | |
GIER | 2020 | 6k | 6k | human ✓ | human | reddit ✓ | 63% | 34% | 3% | |
MA5k-Req | 2021 | 24k | 24k | real | Ps | Amazon MT | image pairs | 100% | 0% | 0% |
InstructPix2Pix | 2023 | 454k | 454k | SD | SD | GPT-3 | image pairs | 12% | 44% | 42% |
MagicBrush | 2023 | 10k | 10k | MS COCO | DALL-E 2 | Amazon MT | source image | 62% | 25% | 12% |
HIVE | 2024 | 1.1M | 1.1M | SD | SD | BLIP | image pairs | 48% | 30% | 22% |
EmuEdit | 2024 | 10M | 10M | Emu | Emu | Llama 2 | source image & task | 54% | 20% | 26% |
AURORA | 2025 | 280k | 280k | mixed | mixed | GPT-4o | image pairs | 96% | 4% | 0% |
UltraEdit | 2025 | 4M | 4M | MS COCO | SD | GPT-4o, human | source image | 43% | 21% | 36% |
RealEdit | 2025 | 57k | 94k | human ✓ | human | reddit ✓ | 58% | 36% | 6% | |
PSR (ours) | 2025 | 83k | 305k | human ✓ | human | reddit ✓ | 56% | 28% | 16% | |
PSR-328 (ours) | 2025 | 328 | 3.9k | human, AIs ✓ | human | reddit ✓ | 33% | 33% | 33% |
We source our dataset from Reddit's /r/PhotoshopRequest community (2013–early 2025). Historical data (up to Nov 2022) comes from PushShift, while recent data (Oct 2024–Feb 2025) is collected daily via a custom crawler, as PushShift lacks data after Nov 2022.
We present a taxonomy capturing the spectrum of image editing requests across three dimensions: subject, action verb, and creativity level. The subject identifies the element modified, the action verb specifies the modification, and the creativity level distinguishes routine tasks from those allowing multiple, open-ended interpretations. This framework enables precise analysis of automated image editing tasks and highlights that even routine edits, such as object removal, vary significantly depending on whether the subject is a person, animal, or object. While low-creativity tasks often suit standard automation tools, high-creativity tasks (see Figure 2) demand models with greater flexibility.
The subject of an image editing request is the specific element being modified—e.g., in a request to remove a person, the subject is the person. Subjects may include objects, persons, or the entire image. To systematically classify subjects from user instructions, we leverage WordNet's taxonomy. We first extract subjects from raw instructions, then match each to the nearest synset (semantic category) in WordNet's structured lexical database, providing standardized classification and reduced ambiguity.
Action | Description and Sample Request | Action | Description and Sample Request |
---|---|---|---|
add | Insert new elements, objects, text, or effects. e.g. “Add a copyright watermark to the bottom right.” | adjust | Modify properties like tones, contrast, and saturation. e.g. “Increase saturation a bit on the elephants.” |
apply | Add filters, styles, or effects. e.g. “Apply a vintage film effect.” | clone | Duplicate elements inside the image. e.g. “Use cloning tool to blend grass over dirt patches.” |
crop | Trim edges for a smaller image. e.g. “Crop to square format for social media.” | delete | Remove elements, objects, or imperfections. e.g. “Remove the jacket hanging from the girl's side.” |
replace | Substitute objects or text. e.g. “Please change the pamphlet into a dictionary.” | transform | Flip, scale, rotate, or skew elements. e.g. “Fix the perspective of the building.” |
move | Reposition elements while keeping the rest unchanged. e.g. “Shift the logo 20 pixels up.” | merge | Combine elements or effects. e.g. “Create a panorama from these shots.” |
super-resolution | Increase resolution for clearer details. e.g. “Can someone upscale this image to 4K resolution?” | re-color | Change the color of an element, object, or text. e.g. “Can someone change the dog's fur to black?” |
relight | Adjust lighting conditions. e.g. “Can someone make lighting better / remove shadows?” | zoom | Change scale to focus or zoom out. e.g. “Zoom in on the man.” |
specialized | Advanced or composite editing tasks. e.g. “Can someone vectorize this logo without background?” |
Users often describe their editing intentions in vague terms i.e., "Make this look better" instead of the more technically precise phrasing such as "Improve the lighting to make the subject stand out". To properly categorize user intent, we develop a diverse list of 15 action verbs that cover various editing actions.
We find that previous efforts to create action verb taxonomies are tied to specific, low-level tools in popular image editing software tools that do not communicate the high-level user intent and also do not contain specific edit actions like super-resolution. This limitation motivated us to develop our own taxonomy to ensure comprehensive coverage of modern image editing techniques.
To develop our taxonomy, we feed a large random subset of the edit requests into GPT-4o-mini and prompt the model to summarize common editing actions. Additionally, we consult image editing experts in the field to refine our list of actions to accurately reflect image-synthesis tasks in computer vision. Table Action Verbs presents the final list of action verbs.
We categorize editing requests into three levels of creativity freedom involved in the editing process and the potential variation in the final outcome.
This creativity classification helps differentiate between objective, technical edits and imaginative, open-ended transformations, ensuring that automated and human editing processes can be tailored accordingly. While creativity level is not a traditional dataset metric, we feel that it is an underused tool as a way to measure model accuracy. When evaluating an image editing model, it is desirable to see pixel-level precision on low creativity edits involving real-world objects. However, high creativity edits, such as (f) in Figure 2-f, may not need this level of realistic precision. Examples of other levels of creativity can be found in Figure 2 as well.
We use GPT-4o-mini and InternVL-2.5-38B
to annotate our
dataset.
GPT-4o-mini generates
taxonomy-related labels, while InternVL-2.5-38B performs image captioning
and keyword extraction.
We prompt InternVL-2.5-38B to
summarize
each image into 36 JSON keys, capturing essential attributes
such
as image type (e.g., photograph or digital art), location, weather conditions, presence of people, and lists
of
objects in the image.
To identify posts unrelated to image editing (e.g., when users post an image seeking proof of authenticity
rather than requesting an edit), we instruct GPT-4o-mini to output a binary
flag, image_editing_relevant
, indicating whether the post pertains to image editing.
We use a zero-shot prompting to extract actions from the request. GPT-4o-mini is provided with a list of valid actions, their descriptions, the input image, and the user-provided request. The prompt instructs the model to first examine the image, then rewrite the instruction in clear and simplified language to eliminate ambiguity, and finally identify the subjects of the edits along with the corresponding editing actions.
We map the extracted subjects from the previous stage to WordNet's synsets. Once the subjects are identified, we provide GPT-4o-mini with the image, instruction, and subject, instructing it to select the closest WordNet synset based on the given context. Since the generated synset may not always be valid, we perform a search within the WordNet lexical database using NLTK to find the closest matching synset and assign a final synset to the subject.
We use o1-pro to summarize WordNet subjects into higher-level semantic categories and organize them into five main and twelve subcategories. Since the extracted WordNet synsets from the previous step vary in granularity, a reasoning model with long-context capabilities, such as o1-pro, effectively groups these synsets into structured categories. This approach results in a more coherent and meaningful subject categories.
We use few-shot prompting to assign creativity levels. GPT-4o-mini receives the original image and request, along with examples annotated by creativity level, to classify the input accordingly.
Our final dataset consists of 83k posts and 305k edited images, categorized by creativity levels as 56% low, 28% medium, and 16% high. It includes 49,134 unique subjects, predominantly falling under the People and Related category (53.5%). The primary action requested predominantly falling under the People and Related category (53.5%). The primary action requested by users is delete (32.9%), typically involving the removal of individuals or visual clutter to enhance aesthetics or professionalism.
We conduct a series of studies comparing AI-generated edits with human edits to understand human preferences and to determine whether automated metrics can serve as reliable proxies for these preferences.
We process each request using three generalist SOTA image editing tools: SeedEdit, Gemini-2.0-Flash, and GPT-4o .1 For each request, we generate two images using both the original instruction (OI) provided by the user and a simplified instruction (SI) generated by GPT-4o-mini. Since user-written instructions often include unnecessary details, we use GPT-4o-mini to refine them, focusing solely on the core image editing task.
Additionally, for each request, we generate three AI-based image edits using 46 off-the-shelf image editing models hosted on Hugging Face Spaces.
Via stratified sampling, we select a random PSR subset that has almost the same amount of images in three levels of creativity groups (114 low, 101 medium, and 113 high). On average, each request is edited by an average of five Reddit human editors, resulting in a total of 1,644 human edits. We also generate 7 different AI edits per request. In total, our human study has 10,405 unique 4-tuples (source image, request, AI edit, human edit) for evaluation—substantially larger than prior studies, while keeping the human annotation effort manageable.
We conduct a comparative evaluation study to assess whether AI-generated edits or those created by Reddit users better satisfy original image editing requests. To ensure unbiased evaluation, we present randomly paired AI and human edits in a blind setting to human raters who then vote on which edit best fulfills the given request.
We use two automated evaluation methods to complement our human study: LAION Aesthetic Score and VLM-as-a-Judge . The Aesthetic Score quantifies visual appeal based on large-scale human preference data, while VLM-as-a-Judge leverages VLMs to provide evaluative feedback on images, with reasoning capabilities that articulate specific visual qualities and explain judgments. Both metrics serve as proxies for human judgment, enabling scalable assessment of AI-generated and human-edited images. We use LAION's Aesthetic Score Predictor for calculating aesthetic metrics, and GPT-4o, o1, and Gemini-2.0-Flash-Thinking as the VLM models for judgment tasks.
We provide a Hugging Face 🤗 space for the results of our study. You can filter the results by creativity level, action verb, and model.
We collect a total of 4,359 human votes from 122 different users, and Table. 3 summarizes the preferences based on the collected votes. Human raters prefer human edits in 66.0% of cases, while AI edits are preferred in only 25.8% of cases.
Within the model groups, the preference for human edits remains consistent across different models, with preference rates ranging from 53.6% to 73.2% in favor of human edits. SeedEdit achieves the highest overall preference win rate at 37.8%, followed by GPT-4o, with a win rate of 32.8%.
Breaking down the ratings by creativity still shows a preference for human edits across all three creativity groups. Although human edits are consistently preferred, the gap between human and AI performance is smaller for high-creativity requests (60.9% vs. 30.7%) compared to medium-creativity (67.6% vs. 25.3%) and low-creativity requests (70.1% vs. 21.1%). We hypothesize that this is because, in high-creativity scenarios, human evaluators tend to prioritize originality over detail—unlike in low- or medium-creativity requests.
Analyzing model performance across image editing action verbs shows models struggle with actions like crop and zoom, yet perform competitively with humans on tasks such as apply and merge.
Furthermore, we conduct a qualitative analysis to identify common patterns in cases where AI edits either succeed or fail. AI-generated edits typically outperform human alternatives due to their higher accuracy in reflecting user instructions. This precision in following the original request accounts for the majority (72%) of AI wins. In contrast, AI edits often lose because they misunderstand or misinterpret user requests, with comprehension errors representing the most common cause (43%). Additionally, AI systems frequently introduce unintended changes or artifacts—most notably, distortions of facial identity—which further diminish their reliability.
A major issue with the current generation of AI image editing models is how they struggle to preserve the identity of people in an image. This problem is more pronounced in GPT-4o , as shown in Figure. 3. In this example, using GPT-4o to make successive edits to an input image of a person results in a final image that loses its similarity to the original person.
Figure. 3: can significantly alter both a person's identity and the overall image quality through iterative image-editing requests. GPT-4o was repeatedly instructed to modify the shirt color, with each step's output serving as the input for the next iteration. Over iterations, facial identity shifted considerably from the original person.
Automated feedback from a judge model can be used to significantly boost a student model's performance through processes like reinforcement through AI feedback (RLAIF). To determine the accuracy of VLM-as-a-Judge feedback, we collect over 10,300 ratings from three separate VLMs acting as judges between human edits and AI edits.
The results highlight a clear contrast in overall preferences between humans and VLMs, showing that VLM-as-a-Judge is a poor proxy for human preferences. While humans strongly prefer human-generated edits (as discussed in the section "Human Edits Are Strongly Preferred by Human Raters"), all three VLMs exhibit mixed preferences, splitting their selections approximately evenly (around 50%) between human and AI edits. Additionally, Cohen's κ scores further confirm weak agreement between human judgments and VLM assessments, with o1 achieving the highest agreement score of κ = 0.22, which nonetheless remains low.
When examining VLM ratings by individual model groups, we observe a clear preference among VLMs for edits produced by SeedEdit and GPT-4o. Notably, o1 selects edits from GPT-4o 83.9% of the time. Conversely, for edits generated by Gemini-2.0-Flash or Hugging Face models, VLMs generally favor human-made edits. Despite this, agreement with human judgments remains relatively low, with Cohen's κ ranging from 0.14 to 0.25, which is near random in some cases.
We analyze a subset of judges' ratings to better understand why their assessments differ from human judgments. Although VLMs can provide detailed verbal feedback, they often remain blind to critical details in images and may fail to accurately capture or articulate all differences between image pairs. Additionally, VLMs occasionally overlook significant aspects, such as changes in characters' identities, and sometimes even hallucinate nonexistent elements. These issues highlight ongoing challenges in effectively leveraging VLMs as reliable judges.
A common pattern among AI models is their tendency to enhance the aesthetic quality of images, even without explicit instructions. For instance, they often apply extra touch-ups to human faces, making the skin appear smoother and more polished. Similarly, the facial features of pets are improved, and even damaged areas, such as eyes, are restored—even when such changes were never requested. For example, when instructed simply to remove extraneous elements (Figure 4-a) or image background (Figure 4b), SeedEdit and GPT-4o nonetheless enhance the dog's overall facial features beyond the explicit instructions provided by the user.
We report the LAION aesthetic scores for the images in PSR-328 and summarize the results in (Figure 4-c). The distribution of aesthetic scores confirms that AI models tend to increase overall image aesthetics. On average, AI-edited images have higher scores ($\mu = 5.56$) compared to human-edited images ($\mu = 5.18$) and the original source images ($\mu = 5.32$).
Moreover, we compare changes in aesthetic ratings from our human study to examine potential correlations between AI performance (winning or losing) and human preference. A detailed breakdown of human preferences based on variations in aesthetic scores is presented in (Figure 4-d).
We observe that AI-generated edits typically achieve higher aesthetic scores than human edits, regardless of comparison outcome. AI edits become increasingly favored when their aesthetic scores surpass those of human edits. Nonetheless, human raters generally prefer human-generated edits overall, indicating that the LAION aesthetic score alone does not fully align with human preferences.
Different image editing requests necessitate varying types and levels of editing depending on the subject and required action, ranging from subtle, precise modifications (e.g., ``remove the quote'' (Figure 2-a)) to complete transformations (e.g., ``cartoonize this image'' (Figure 2-f). While our results show human edits are generally preferred, AI models have shown they can handle a subset of these diverse requests at a competitive rate with human editors. This capability raises a question: what is the overall percentage of requests that AI models can already handle?
We estimate the overall percentage of real-world image editing requests that AI models can effectively handle by calculating a weighted average based on the distribution of action verbs in our dataset. Specifically, we define the effectiveness of AI as the percentage of tasks where AI-generated edits either outperform (win) or perform competitively (tie) compared to human edits. For each verb category, we multiply the combined AI win and tie rates by the proportion that each verb contributes to the overall dataset: $\sum_{v=1}^{v} D_vAI_v * 100 = 33.35% $ where $D_v$ represents the proportion of dataset requests associated with action verb $v$, and $AI_v$ denotes the combined percentage of AI wins and ties for edits corresponding to verb $v$. Based on this calculation, we estimate current AI tools effectively handle about one-third of real-world image editing requests.
In this study, we compared generative AI edits with human edits to understand the gap between the current capabilities of AI models and actual user needs. Current AI tools excel in specific tasks such as object removal and outpainting, effectively extending images and filling in missing details. However, in real-world scenarios, current generation models can adequately fulfill only about one-third of user editing requests. Their main constraints stem from their tendency to introduce unintended modifications beyond the targeted editing region and inadvertently altering essential characteristics, such as the identity of individuals.
Our dataset annotation and taxonomy generation rely on language models, potentially introducing biases and inaccuracies. Additionally, certain AI image editing models were inaccessible during our evaluation, leading to their absence from our human analysis.