Portfolio Website
A model to invest in the S&P 500 and find correlations between the top tech stocks and the S&P 500
Using predictive analysis method called the time series model to buy and sell stocks occasionally (meaning not for day trading) and finding correlations between popular tech stocks and the S&P 500
Steps taken for this time series model
1. Download stock data from the S&P 500.
2. Cleaned and visualized the data.
3. Set up the machine learning target.
4. Train our initial model.
5. Evaluate error and create a way to backtest.
6. Accurately measure that data over a long period of time.
Step 1: Download the S&P 500
The stock information is from yahoo finance, which is why I am importing yfinance.
I queried all of the S&P data ever using history(period="max").
Lastly, I looked at the index of the S&P 500 data to see how old this data is
Step 2: Cleaned and visualize the data
We are visualizing the data by making the index the x and the close the y.
We then deleted the Dividends and Stock Split columns because we are not using them.
Step 3: Set up the machine learning model
The column tomorrow is to predict tomorrow's price and the shift(-1) is because you are predicting tomorrow's price based on the previous day.
The target is set up for if the stock will go up or down. Our target is what we will be predicting using machine learning.
The first function means is tomorrow's price greater than today's price which would return a boolean ( True or False) and then we converted it into an integer so we can use this data.
Then, I removed all data before 1990 to avoid evitable market shifts using sp500.loc["1990-01-01":].copy(). .copy at the end is to avoid a copy warning that comes with pandas at times.
Step 4: Training the initial machine learning model.
The standard package used for machine learning is sklearn needed to make random forest.
Random Forest Classifier trains multiple singular decision trees with random parameters and averaging through results from those decision tree. Doing the process this way makes random forest to overfit compared to all other models. Random Forest also run quickly, and it can pick up non-linear tendencies in the data.
N_estimators are the number of decision trees you want to train. Usually, the higher the better to a limit. Min_samples_split protects against overfitting which can occur if the random forest is built too deep. The downside is the higher the min_sample, the less accurate your results will be, so you need a middle ground like above. Random_State=1 means if we ran the same numbers twice, the results will be somewhat predictable. Somewhat because random trees still have some level of randomness to them.
I split the data into a train and test set. Making this data a time series model now. All of the rows expect for the last 100 will be into the training set (iloc:-100 means the first 100 rows are removed) and for the test the last 100 rows will be used (iloc[-100:] means the last 100 will be used).
The predictors equation will have just the columns you will use.
model.fit(train[predictors], train["Target"]) function is used to predict the target
Precision score is to predict when the market goes up or down which is good for holding stocks instead of day trading
preds=model.predict([predictors]) is to generating predictions
preds=pd.Series(preds, index=test.index) will give you 0s and 1s. 0s 1 indicates a pattern and 0 doesn't
When you put this in the precision_score function, you get .636363 which is good. Anything over 50 percent is good. This means that our data matches with history .636363 percent of the time.
Combined= pd.concat([test["Target"],preds], axis=1)<- combining our values to plot them. Axis 1 means to treat each input as a column
combined.plot( ) is the function we use to get a plot below after we made the combined function
The orange line "0" is our predictions and the blue line is the Target, that is what happened. The more orange lines, the better.
Step 5: Evaluate error and create a way to backtest
Building a backtesting system is a more robust way to test our algorithm. Before we were just test the last 100 days but now we are testing multiple years for more experience for the model. Backtesting will wrap up everything we did before into one function.
Backtesting requires a certain amount of data, for the S&P500 data, a start value and there are 252 trading days in a year and 2520 days is 10 years.
all_predictions=[ ] a list of dataframes for a single year
The training set is all the years prior to the current year
The test set is the current year
The predict(train, test, predictors, model) <- used to generate our predictions
Return pd.concat(all_predictions)<- put all the predictions together into a single dataframe
predictions["Predictions"].value_counts()
A 1 means the stock went up and a 0 means the stock went down.
When you get the precision score comparing the Target to the Predictions, you get 52% accuracy.
predictions["Target"].value_counts() / predictions.shape[0]<- to look at the percentage of days the market went up and down. This means if you bought a stock market went up and sold it at end of the day, everyday, you would make money
Step 6: Accurately measure the data over a long period of time
This section is to improve the accuracy of our data. Horizons data will be used in the rolling.means. The 2 is for 2 trading days 5 is for 5 trading days, 60 is for 60 trading days, etc. This data will be used in a for loop below called "for horizon in horizons." Then we find the ratio between today's closing price and the closing price in those periods to see if the market has grown a lot which can indicate a fall soon or vice versa.
f"Close_Ratio_{horizon}"<- will will give us columns name "Close_Ratio_2", "Close_Ratio_5", "Close_Ratio_60", etc.
sp500[trend_column]=sp500.shift(1).rolling(horizon).sum()["Target"]<- compared to the last time we used shift, we are using positive 1 instead f negative 1. This will look at the past few days and see the average sum of the target so pick a day and it will do an average of the two days before it. This will see the # of days the stock went up.
new_predictors +=[ratio_column, trend_column] will add a ratio_column and trend column to the sp500
The higher the n_estimator, the more accurate.
Predict proba will control the probability that the row will be a 0 or 1 a.k.a. down or up.
preds[preds>= .6]=1 and preds[preds<.6]=0 is for a custom threshold. Custom threshold means the 1st one has to be more confident the price will go up if it is over 50%. The threshold is .6 for both which will reduce trading days. This in turn, will reduce the number of days that the price will go up but will increase the chance the price will go up those days. Because we want to make occasional trades, but not trade everyday.
Running the backtest again but with the new predictors equation
predictions["Predictions"].value_counts()<- compared to page 5. More days went down than up because we changed the threshold. This means we will be buying stocks on fewer days.
Compared to page 5, we are 6% more accurate.
The correlation between the five most popular growth stocks and the S&P 500
Pandas is a software library for data analysis
Datetime.now()-pd.DateOffset(years=19) is used pull out stocks from the past 19 years
The stocks equation is all of the stocks we will use.
for stock in stocks
data=yf.download(stock, start= start_date, end= end_date)
stock_list.append(data)
comparestocks=pd.concat(stock_list, keys=stocks, names=['Ticker','Date'])
This code will give you a for loop to go through each stock one by one from the stocks equation downloaded from yahoo finance. Then you put them all next to each other with comparestocks but you can name it whatever you want. Print(comparestocks.tail()) and print(comparestocks.head()) is to see the beginning of the first few days of the 19 years and end of it.
We will do a moving average here. A moving average is the average change of a data series over time. Useful for forecasting. We will do moving average 10 and 20. The rolling mean is the mean of a certain number of previous periods in a time series, which is 10 and 20. We are using reset_index because we filtered data resulting in removing missing values, which leads to data that isn't continuous. In this instance, we need the index to be continuous.
for stock, group in comparestock.groupby('Ticker') is just another for loop to do the moving average of 10 and then do the moving average of 20 right after.
comparestocks equation holds all of the companies' tickers and dates as previously mentioned. Now, we are going to use this equation to find out the volatility of these companies by making a new equation called comparestocks[Volatiity]. As you can see, the most volatile stock is NVDA and it because of the AI boom as to why it is having the growth it is having now.
.pct.change() is used to find the percent change of the previous row in a time series dataset. .std() is used to measure the variability of a dataset so you can see the amount of variation or consistencies in a dataset.
This is to see the S&P500 index against the five most popular tech companies. You can see the similarities. The index is higher because index hold millions of stocks, including these five tech stocks.
The same thing as the last example. I just made an area chart for each ticker so you can see that they have the same shape as the S&P 500, meaning they are extremely similar.
How strong are the correlations between the top stocks and the S&P 500?
You will need this correlation coefficient scale to understand how strong the correlation is.
.loc takes variables from a dataset to make another dataset. So, I chose AAPL for Apple and ^GSPC which is the S&P500 to extract from the comparestocks dataset. I merged AAPL and the S&P 500 with pd.merge and on='Date' is the starting point and the end point. I made the graph a scatter plot with a line so you can see the correlation.
A correlation of 95% is nearly perfect. Yes, there is a near flawless correlation.
.loc takes variables from a dataset to make another dataset. So, I chose AMZN for Amazon and ^GSPC which is the S&P500 to extract from the comparestocks dataset. I merged AMZN and the S&P 500 with pd.merge and on='Date' is the starting point and the end point. I made the graph a scatter plot with a line so you can see the correlation.
A correlation of 95% is nearly perfect. Yes, there is a near flawless correlation.
.loc takes variables from a dataset to make another dataset. So, I chose NVDA for Nvidia and ^GSPC which is the S&P500 to extract from the comparestocks dataset. I merged NVDA and the S&P 500 with pd.merge and on='Date' is the starting point and the end point. I made the graph a scatter plot with a line so you can see the correlation.
A correlation of 90% when rounded is nearly perfect. Yes, there is a near flawless correlation.
.loc takes variables from a dataset to make another dataset. So, I chose MSFT for Microsoft and ^GSPC which is the S&P500 to extract from the comparestocks dataset. I merged MSFT and the S&P 500 with pd.merge and on='Date' is the starting point and the end point. I made the graph a scatter plot with a line so you can see the correlation.
A correlation of 96%, when rounded, is nearly perfect. Yes, there is a near flawless correlation.
.loc takes variables from a dataset to make another dataset. So, I chose Google for GOOGL and ^GSPC which is the S&P500 to extract from the comparestocks dataset. I merged GOOGL and the S&P 500 with pd.merge and on='Date' is the starting point and the end point. I made the graph a scatter plot with a line so you can see the correlation.
A correlation of 98%, when rounded, is nearly perfect. Yes, there is a near flawless correlation.
Conclusion: S&P 500 is extremely dependent on the success of five largest companies on the planet. The richest technology companies and follow the extremely similar trends so if one goes down or up, these tech companies historically will follow them. Remember, that the S&P 500 is an index of America's richest publicly traded companies. So, America's economy is being held up by these companies.