Nora Holder, 2024
Does This Look Good on Me?
Forcing a Machine Learning Program to Pick My Outfits,
Because I’m Too Indecisive.
Abstract: Considering the base concept of the “Paradox of Choice”– why giving a person more options delays their response and overall worsens satisfaction– I’ve found an issue between picking a good outfit between myself and my peers. From dozens of “outfit A or B” texts evolving into a machine learning model, this paper looks through the logistics of attempting to create a decision tree machine learning model and see what exactly counts as a “good” outfit, along with subjective accuracy to judge if a prediction is truly a good fit, or if a model needs adjustments. |
A majority of the population has some routine. Your alarm goes off at some time between 5 and 7 AM. You might workout in the morning and shower after, maybe you shower in the evening. There’s tons of variables, but one thing we all are required to do is figure out an outfit. For tons of people this is nothing major: random shirts, random pants, random shoes. This is a simple process until you consider other factors; what's the weather? Are you going to work? What clothes are comfortable for the occasion, are they ready? It’s menial to some, but I’ve noticed how much of a damper it can be on my routine. I wake up at 6, do all my showering & stuff by 6:45, and then I have till 7:50 to pick an outfit, put on makeup, and leave. I have been a few minutes late at least once a week because of the fact that I don’t know what I want to wear, or what I’ll like to wear– so what happens when I take that choice away from myself?
To reiterate the problem: I need an optimized program that I can give it a few attributes, and it hands me an outfit (or multiple outfits). With this I need to decide what the attributes are, how to value them, and how to account for if a user is indifferent about a given attribute. This also includes an easy way to insert this within a given dataset that isn’t manual (e.g. a secondary program).
And the solution is just that– two programs: newfit.py, a program designated to take new outfits and put them into two datasets (the purpose of which will be discussed later), and generatefit.py, a program which lets you generate at minimum 2 outfits, one from each set, from a given set of variables.
The main concept driving the reason that this decision is so paralyzing is often attributed to executive dysfunction, but this does not account for why these decisions are so paralyzing– that’s where the paradox of choice appears. In a study done in 2009, Antti Oulasvirta, Janne P. Hukkinen, and Barry Schwartz participants were tested on how they responded to extended search results versus a shorter list. Those with 6 choices were able to complete the task faster and with significantly more satisfaction than their peers with 24 items. [1] Most of the concepts around choice paralysis function primarily in economic studies, such as the study done by Kurien et. al. which was able to observe and conclude that extended choices for a consumer lead to delays on purchases [2], in our case it would be a delay in a decision of clothing, or any general decision. With this we can return to this feeling of guilt and dissatisfaction with being given a result by their own choice– our fear to “chose wrong” becomes a deterrent in most eyes, as the reward value per added choice does not increase at a rate that outweighs the rate in which the lows of guilt consequence increases, creating a net loss[3]. Since this project is considered a niche (and personal) project, there was no previous work that would contribute to direct knowledge on this specific concept that is “pick outfits for me”.
Li, Michelle. “From One-Legged Pants to No Pants at All, These Are the Weirdest NYFW Trends.” Teen Vogue, Teen Vogue, 12 Sept. 2019, www.teenvogue.com/story/new-york-fashion-week-2019-weirdest-trends. | So, with our project comes some concerns. How do we measure “accuracy” in something subjective? For starters, someone's opinion on a “good outfit” is subjective. The example to the left is not to gawk and mock |
Fashion– rather I personally find all of these outfits lovely. I’m sure others have higher praises or criticism, and some may find these downright ridiculous. I am not a fashion major, nor do I plan to become one for the sake of this project.
There are a plethora of ways one could approach this model. The questions I’m asking is (1) What attributes I could assign to these outfits to be contained in a dataset and (2) in which cases would we count a “positive” return on a program run. After observing a handful of my personal outfits along with general fashion, I decided to reduce my attributes down to 7 values. (1) What was the “style[1]” The outfit is (2-4) The outfits weather without a jacket, with a light jacket, and with a heavy jacket, (5) the predominant color, (6) the occasion/event, and (7) the comfort. Every insert to the dataset will include a label, which is a brief description of the main outfit attributes: shirt, pants, shoes[2]. In the cases where maybe two attributes on a given outfit could fit interchangeably (maybe an outfit was everyday and also a party outfit), two outfit insertions would be needed (one for each). One of the biggest concerns with this project is the fact that while it is a classification project, it’s very rare that there would be a double entry for any given item (excluding in cases mentioned previously).
The solution I found was to give two outputs (a very minimal choice), one from the original dataset, with only exact values. The full dataset accounted for variants which had “Indifferent” for several attributes, giving more flexibility for an outfit. These would both be read into a system, creating a decision tree model using python's sk-learn library, train one on each dataset, and return a prediction (decision) for each set. Consider the following example. We input the given requested attributes and get the following results, along with my personal decisions on if I would wear them on this given day.
Input:
Typical | Standard | Fall | Winter | Yellow | Everyday | Standard |
Outputs:
Label | Would I Wear? | |||||||
clothes.csv: “W_Shirt_Cr_Slacks_Y_Docs” | Maybe | |||||||
fulldataset.csv: “Y_BDown_Bl_Slacks_Y_Docs” | Yes | |||||||
No Relation Partial Relation Correct Relation |
The concept of measuring human desire/goal is flawed at best, so for evaluation purposes the two datasets will be evaluated on the same responses with a point system. There will be two runs of said system: one of which values definite answers significantly more (5 for yes, 2 for maybe, 0 for no), while the other model will focus on a possible outfit versus a definite outfit (1 for yes/maybe, 0 for no). Due to simplicity on a low scale data model, only 30 tests were run (a month's worth of outfits). To emphasize, a perfect score on variant 2 would be 150 points, compared to variant 1 which would be 30 points. The tests included one that was simply 7 indifferents, seeing what possible outfit would be given with 0 true requests. It could be assumed this gives fullDataset an upper hand, as clothes.csv has no indifferent categories. There were also concerns in regards to prompts without a set that fit the request “exactly”, that neither dataset would get a clear advantage from. It is also important to note that we will not be taking advantage of the option to give multiple choices for a given prompt. This method will not be evaluated due to the concerns of time and reasonable evaluation models.
Surprisingly the results did not give significant differences between each variant's accuracy with a 3% margin. It is important to note that while Variant 1 gave less guaranteed outfits (either a yes or a maybe), all of the outfits were guaranteed yes values. | Dataset | Variant 1 | Variant 2 | “Accuracy”[3] |
clothes.csv | 22/30 | 110/150 | 73.33% | |
fullDataset.csv | 25/30 | 83/150 | 76.33% |
Another crucial note is none of these would be considered “perfect”, as their accuracy does not fall near 95%. This could be for several reasons. For starters, the test set generated did not guarantee that a value that had the exact match of attributes existed within either dataset, which could have a negative skew on the data, but would also be an example of user error in which a user has not inputted an outfit with the given requests, and thus an innacture response was returned.
4.2 Viability, Logistics, & Retests With a Refined Experiment Set
On a rerun of our program using exclusively examples found within the set, the data displayed a significantly better | Dataset | Variant 1 | Variant 2 | “Accuracy” |
clothes.csv | 28/30 | 140/150 | 93.33% | |
fullDataset.csv | 27/30 | 123/150 | 88% |
outcome. The refined dataset only returned two inaccurate values, which were entirely inaccurate. The full dataset returned 27 viable items, but 4 of them were “maybe” values, rather than all “yes” responses. The margin is still Minimal, which justifies the output of models trained on both datasets for every program execution.
Considering the small scope of a personal project, this would be considered a working success. The base “clothes.csv” file only contains 200 values (and the fullDataset is around 13000), and as time passes and the dataset is given diversity, there’s a likelihood that future accuracy tests would report over 95%. There’s also considerations on if this model was not as optimal as using any other given model. I would advise against a Neural Network, due to the sheer size of the fullDataset, but if you have the time it could be considered. I could see an adjustment to a random forest model instead to test for greater accuracy, or creating a model with 3 labels (although this would require a remade dataset from the ground up).
Overall, the program satisfies the objective, even if on occasion it may require multiple runs and/or attribute selections. Even with this, a rather daunting (but basic) choice can be reduced from tens of combinations to less than 5. The paradox of choice in most cases will not be entirely extinguished, but extremely mitigated– and if that speeds up your routine, that’s all that matters.
Citations
Antti Oulasvirta, Janne P. Hukkinen, and Barry Schwartz. 2009. When more is less: the paradox of choice in search engine use. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '09). Association for Computing Machinery, New York, NY, USA, 516–523. https://doi.org/10.1145/1571941.1572030
Rony Kurien, Anil Rao Paila, Asha Nagendra, Application of Paralysis Analysis Syndrome in Customer Decision Making, Procedia Economics and Finance, Volume 11, 2014, Pages 323-334, ISSN 2212-5671, https://doi.org/10.1016/S2212-5671(14)00200-7.(https://www.sciencedirect.com/science/article/pii/S2212567114002007)
Schwartz, B. (2004). The tyranny of choice. Scientific American, 290(4), 70–75. http://www.jstor.org/stable/26047678
Li, Michelle. “From One-Legged Pants to No Pants at All, These Are the Weirdest NYFW Trends.” Teen Vogue, Teen Vogue, 12 Sept. 2019, www.teenvogue.com/story/new-york-fashion-week-2019-weirdest-trends.
“1.10. Decision Trees.” Scikit, scikit-learn.org/1.5/modules/tree.html. Accessed 14 Dec. 2024.
[1] Note: style– and all of the given attributes– are subjective, every person's dataset will be personalized.
[2] Various extras like jewelry, makeup, etc. are significantly easier to choose once given an outfit, so they are not included for the sake of simplicity.
[3] Accuracy was based 75% on the first variant of testing, basing score out of 30, and 25% on the second variant, basing a score out of 150.