I’ve been fascinated with the problem of improving search results for a long time, but have never really had a chance to try any ideas out on an actual dataset. Recently my dad suggested that I grab the data for the netflix prize and play with it. This had not occurred to me before, but it is data which is very relevant to the problem of general web search – groups of users ranking items based on their relative personal value.
I have no pretensions about seriously competing for the million dollar prize, but I do think that playing with the dataset should (at worst) give me a bit of hands-on experience with data processing, and might perhaps allow me to solidify some of my ideas about relative ranking of items, as determined by a group of similar users. Unfortunately, my formal statistics training is lacking, but it should still be fun to play with. I am considering finding a statistics class to audit next semester, since it’s a skill I should have, even if my degree doesn’t directly require anything.
I’m interested in this as a personal exploratory project, so I do not plan to seriously scour the web for the techniques that other participants are using, or to studiously read the blogs of all the top teams. If something comes up, I’m not going to actively ignore it, but I want to develop my own ideas.
I plan to use python for doing the data processing, but may need to be somewhat clever, since I’m likely to quickly run into memory limitations working with a dataset this large.
Some of the initial questions and ideas I want to play with:
- How does the current average rating (as displayed to a user at the moment they rate the movie) affect users ratings? Can we correct for this trending?
- Which movies experience extremely disparate ratings (many 1 and 5 star ratings)? Can we use these movies as strong indicators of how we should group users?
- (from the previous) How extreme are users? Should we be ranking the opinions of users who rate many movies 1 or 5 stars differently from those who tend towards the middle? Should we be figuring out how to correct for this?
- What happens if we binarize the data, and make everyone an extreme user? Can we do something useful with this, and then work the average ratings back into the picture?
- What IS the median rating for the data? Is it 3?
- Can we identify those people who seem to be unaffected by the average rating? Can we use their ratings in a valuable way? Should they form the basis for individual groups into which we can put the people who are swayed by the current star rating? Can we even categorize people into these two groups?
Some of these questions are easier to answer than others, and some could probably be found by a simple Google search, but I think they represent a good starting point for building analysis tools.
Post a Comment