AI/ML Projects of Robert J. Yates

Time Series Forecasting

Paper Poster

Project

A Wind Power Forecasting Problem
Kaggle project title: Global Energy Forecasting Competition "A wind power forecasting problem: predicting hourly power generation up to 48 hours ahead at 7 wind farms"
URL: https://www.kaggle.com/c/GEF2012-wind-forecasting
Programming Language: R

Setup:

Goal: predict power at 7 wind farms.
Inputs: 48 hour look ahead forecast for wind speed, wind direction, and the zonal and meridional wind components at each of the farms
Outputs: for each of the 7 wind farms, normalized wind power measurements between zero and one.
Model Identification and Training period: Sept. 1st, 2009 to Sept. 30th, 2010
Evaluation period: Oct. 1st, 2010 to Dec. 31st, 2010
In Evaluation period: Repeated 72 hour pattern. First 36 hours: we are given the generated wind power from 0 to 1. Remaining 48 hours: Missing data of generated wind power, to be predicted by our model.

Evaluation: Initial models

Used two time series prediction models: ARIMA and locally weighted regression.
ARIMA model: trains itself based on the hourly generated power over all of the training set
Locally Weighted Regression model: gives more regression weight to the training instances that are similar to the 36 hour given data in the prediction intervals. But, this often leads to huge error in the prediction.
ARIMA model predicts better than Locally Weighted Regression model, even though ARIMA model tries to average out the prediction.

Evaluation: Improved models

Improved these models by adding wind speed features that we chose in our feature selection procedure through correlation to the generated power.
New models, ARIMAX and Locally Weighted Regression with Wind Speed, had a quite reasonable performance improvement over original models, ARIMA and Locally Weighted Regression, and beat the benchmark method by almost a factor of 2.

Social Network Analysis

Paper Presentation Poster

Project

Trust and Helpfulness in Amazon Reviews
Reference 1: C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg, and L. Lee, “How opinions are received by online communities: a case study on Amazon.com helpfulness votes,” in Proceedings of the 18th international conference on World wide web, pp. 141–150, ACM, 2009.
Reference 2: G. Wang, S. Xie, B. Liu, and P. S. Yu, “Review graph based online store review spammer detection,” in Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 1242– 1247, IEEE, 2011.
Programming Language: Python
Data: “Web data: Amazon reviews” dataset available via the course website. The full dataset contains 34,686,770 reviews.
We focused on a particular dataset referred to as the “Fine Foods” dataset. The “Fine Foods” subset contains 568,454 reviews from 256,059 users.
Each review has a number of positive ratings and negative ratings called helpfulness votes.

Setup

Motivation: Amazon has over 34 million reviews. While many reviews are helpful, there are also fake reviews written to skew product ratings: Book authors may have fake reviews say how great their book is, and manufacturers may write to fake reviews trashing a competitor’s product. Though users can vote a review as either helpful or unhelpful, previous ratings on a product influence how a user votes: If high variance in review star ratings (reviewers disagree), then polarizing reviews (1 or 5 stars) are voted as most helpful. If low variance in review star ratings (reviewers agree), then average reviews (near product mean) are voted as most helpful.[1]
Problem Statement: How do we tell which reviews are authentic and identify fraudulent reviewers?
Use graph based algorithm to determine the honesty of users’ reviews and trustworthiness of the user performing the reviews.[2]
We modify the algorithm for our purposes to use a single store and apply it to the "Fine Foods" Amazon review data. We analyze the user trust ratings computed from the algorithm with the user helpfulness metric in order to classify fraudulent users.

Using Review Helpfulness Votes To Obtain a Helpfulness Metric for Users

Motivation: We want to determine the helpfulness rating of a user from all number of positive helpful and negative helpful (unhelpful) votes that all his/her reviews receive. Suppose a user wrote 2 reviews, the first one has 95 positive votes and 5 negative votes, the second has 1 positive votes and 3 negative votes.
Method 1 of User Helpfulness calculation: sum total of helpful votes only / sum total of both helpful and unhelpful votes. The user’s rating is 96/104 where the range is 0.0 to 1.0.
Method 2 of User Helpfulness calculation: sum total of helpful votes only - sum total of unhelpful votes only. The user described above now has a helpfulness rating of +88 where the range is negative infinity to positive infinity.
Using Method 1, users with a small number of total votes achieve perfect helpfulness score. In Method 2, we achieve the desired quality where most users now have ratings close to a mean (zero), and just a few outliers have a large positive or negative score. Method 2 solves the problem of rating users who have few helpfulness votes in their reviews.

Algorithm To Compute User Trust

User trust (-1 distrust to +1 trust) Users are trusted if their reviews are honest, users are untrusted if their reviews are dishonest.
Item reliability (-1 unreliable to +1 reliable) Items are reliable if they have high scores by trusted users, items are unreliable if they have low scores by trusted users.
Review honesty (-1 dishonest to +1 honest) First compute review agreement: it is high when review agrees with majority trusted opinion. Review agreement is low when it disagrees with majority trusted opinion. Second compute review honesty: Normalize review agreement and take into account item reliability.
User trust, product reliability and review honesty are computed through iterative updates: Each variable is initially set 1 and updated many times in a loop that interdependently computes their values.

Conclusion: Relating User Helpfulness (from votes) with User Trust (from our algorithm)

When just outlier values are considered, there is a strong correlation between user trust (computed from our algorithm) and user helpfulness (calculated from the second method). For example, 87% of users with helpfulness ratings less than -100 have trustworthiness ratings less than -0.9. So our algorithm using pure graph analysis could be helpful in fraudulent user detection.
Additionally, a bot that rates every product 5 stars out of 5 will fool our algorithm into thinking it is a helpful user. This is due to the skewed ratings in the data where the products, fine foods, are most commonly reviewed as 5 stars and second most commonly reviewed as 4 stars. Textual analysis of review text, such as searching for common spam phrases, would likely improve our results.

Evolutionary Dynamics

Paper

Project

Increased Frequency of Defectors Using Evolutionary Spatial Modeling of the Snowdrift game
Report on the snowdrift game and literature review of its evolutionary spatial modeling.
Reference 1: Hauert, Christoph; and Doebeli, Michael (2004). “Spatial structure often inhibits the evolution of cooperation in the snowdrift game.” Nature, 428(6983), 643-646.

Background

Economists rely on the process of rational decision making, while evolutionary biologists think of natural selection as the decision maker. Evolutionary game theory models the learning of a population of agents instead of individual agents.
Definition 1: Replicator dynamic of a single population = The agents within a population all play pure strategies, with the only difference between agents being which pure strategy they choose. After every time update, the agents reproduce according to their fitness or total payoff. The frequency of agents playing each pure strategy is analyzed to determine if these frequencies converge to fixed value as time approaches infinity.
Definition 2: Evolutionarily stable strategy (ESS) = An evolutionarily stable strategy is a mixed strategy that is resistant to invasion by new strategies. A strategy is an ESS if the payoff of just the ESS strategy is higher than the payoff of the ESS strategy and some new invader strategy. If all members of a population are playing an evolutionarily stable strategy (ESS), then another strategy could not successfully invade the population.

Snowdrift game

In the Snowdrift game, two agents work to obtain a common benefit (b) at some shared cost (c) to the cooperating agents.
The game can be explained as follows: Suppose there is snowdrift on the highway late at night. Two drivers on this road are returning to their respective homes and driving in opposite directions. One driver encounters the snowdrift on the north side, the other on the south side. If both drivers shovel out the snow (both cooperate, C), then they both get the benefit of going home (b) minus the split cost of shoveling (c/2). If neither driver shovels the snow (both defect, D), then neither gets to go home so the payoff for each is 0. If only one driver shovels while the other driver does nothing, the driver that shovels (cooperate C) gets the benefit of going home (b) minus the cost of shoveling (c), but the driver that does nothing (defect D) also gets the benefit of going home (b) without the cost.

	C	D
C	b-(c/2)	b-c
D	b	0

For the standard game, the best response to cooperating is defecting and the best response to defecting is cooperating.
The evolutionarily stable strategy of this game is for each player to play C (cooperate) with probability (2b−2c)/(2b-c) and each player to play D (defect) with probability c/(2b−c).
In the snowdrift game, we set the independent variable r = c/(2b-c), which we call the cost-benefit ratio. r also is the frequency of defectors in a population in a well-mixed population of agents. When the cost c is almost as large as the benefit b, then there is a high cost-benefit ratio, r. When the cost c is much smaller than the benefit b, then there is a low cost-benefit ratio, r.

Evolutionary spatial structure model

Hauert et al. investigate spatial structure in evolutionary snowdrift. [1] They model a population of agents over time in a spatial structure, where individuals in a population are at points in a regular lattice. They let individuals have two strategies, either always cooperate or always defect. Over time, they repeatedly perform synchronous or asynchronous updates of points. When each point is updated, then the point’s occupant and nearby neighbors compete in the snowdrift game. The individual that wins the point for a given update, will have its offspring occupy the site in the next round.
The results of the spatial model are as follows: When r, the frequency of defectors, is low, the spatial model results in a higher frequency of cooperators and a lower frequency of defectors than in a well-mixed population without the spatial model. However, when r is medium or high, the spatial model results in a higher frequency of defectors and a lower frequency of cooperators than in a well-mixed population without the spatial model. What is interesting is that prior research has shown that the spatial model promotes cooperation. Remember, for a high r = c/(2b−c), the cost is almost large as the benefit. When r is low, the cost is much less than the benefit. When the cost is greater than the benefit, the snowdrift game becomes an instance of the prisoner’s dilemma. Even with a high r, the spatial model should promote cooperation as it does for the prisoner’s dilemma, but it doesn't here for the snowdrift game.
Hauert et al. [1] set the independent cost-benefit ratio r = c/(2b−c) to determine the frequency of cooperators. There are two thresholds r1 = 0.2, r2 = 0.65. When they set the cost-benefit ratio r < r1, the frequency of cooperations increases to 100% while the defectors vanish. If r > r2, the frequency of cooperations decreases to 0% while the defectors dominate. A smaller N means a higher r1 and a lower r2. This occurs for both synchronous updates and asynchronous updates.
In the spatial structure, the prisoner’s dilemma game cooperators clump together into a few large clusters. This is due to the fact that in prisoner’s dilemma, the highest payoff for a cooperator agent is when his opponent is also a cooperator. In the snowdrift game cooperators are distanced themselves into a many smaller ”t” like structures. This is due to the fact that in the snowdrift game, the only payoff for a defector agent is when his opponent is a cooperator. Since the defector receives no playoff when he plays another defector, he is strongly incentivized to be near and play to a cooperator. Without loss of generality, the same results apply to neighborhood sizes 3 ≤ N ≤ 8 as well.

Conclusion

Spatial structure typically promotes the evolution of cooperation. However, for the special case of the snowdrift game with cost values slightly to moderately smaller than benefit values, the spatial structure actually promotes the evolution of defection instead. In a real-life situation, the “closer” you are interpersonal-wise to another player, the more likely you are to defect with the costs are slightly less than, but no larger than, the benefits. This makes sense since the optimal strategy is to do the opposite of what you think the other agent will do. (If you think the other agent will cooperate, you will defect. If you think the other agent will defect, you will cooperate.)
In short, you are better off cooperating when playing the snowdrift game against strangers in a well-mixed population. If you only play people you know well and the cost is almost as large as the benefit, then you are better off defecting.