Analysis Report


In this project, we attempt to answer the question, "what 'types' of firms were hurt more or less by covid?" First, we identify risk factors that may play a role in deciding what these firm 'types' are. Then, we perform very basic natural language processing on the 10-Ks of S&P 500 firms to detect which of them are most susceptible to these risks that we have identified. Finally, we perform the analysis and correlation below to make our conclusions.

Load our resulting csv file

Risk Measurements

The risk measurements this project identifies are competition, litigation, and supply chain. To measure each, key words that might be near each other in a 10-K to imply that a company would have such a characteristic are searched for throughout each S&P 500 firm's 10-K. Every time a match of these keywords is found, a 'hit' is recorded. The sum of these hits is then saved in the firms_df data frame.

These risk measurements were chosen to identify weaknesses within companies that may have played a role in their success or decline during the first stage of the COVID-19 lockdown in March of 2020. I suspected that firms in highly competitive markets, high litigation concerns, or already-present weaknesses in the supply chain would likely be affected by the added pressure of the impending pandemic.

In calculating the risk measurements based on the firm's 10-K, the number of hits per firm varied by metric. Some risk measurements returned a number of hits between about 0 and 6, while others had a higher end of about 20. This presents reasonable data, as some firms likely do not have significant problems with some of the metrics, so nothing regarding them appears in the firm's 10-K.

Validation Checks

Methodology for Competition

Let's discuss the methodology used to find keywords in each of the firms. This process was repeated for every risk measure keyword search. We will use the search for competitive firms as an example.

To find out which firms have high competition, I search for the strings "competition", "competitor", or "compet" within 5 words of either "with", "against", "great", "intens", "risk", or "susceptible" as seen in the line of code below.[index, 'compet_hits'] = len(re.findall(

This uses the "NEAR_regex" function to iterate through the text of the 10-K and find instances of the words in question matching up. It then sums them and adds that total to the "compet_hits" column of the firms_df data frame.

Let's look at an example from Accenture's 10-K. The phrase reads, "There is intense competition for scarce talent with market-leading skills and capabilities in new technologies...." The code above recognizes this sentence as an indication that Accenture is a highly competitive firm, so it adds a "hit" to the column, "compet_hits".


To find out which firms have litigation concerns, I use the following line of code.[index, 'litigat_hits'] = len(re.findall(NEAR_regex(['(litigat|lawsuit)','(significant|concern|weakness|liabil|vulnerab|against)'],5,partial=True),text))

This implements the same function as we saw above with competition. Let's look at an example of this working. In AES's 10-K, it states, "There is ongoing uncertainty, and significant litigation, regarding...." This sentence will produce an intended hit for the litigation hits column.

Supply Chain

To find out which firms have litigation concerns, I make 3 separate searches to observe the risk measure in multiple ways. The code for this is below.

# find and save supply chain hits 1[index, 'sply_ch_hits1'] = len(re.findall(NEAR_regex(['(suppl|supply chain)','(concern|weakness|liabil|vulnerab|risk|susceptible|challeng|chang)'],5,partial=True),text))
# find and save supply chain hits 2[index, 'sply_ch_hits2'] = len(re.findall(NEAR_regex(['(resource|material)','(scarc|difficult|challeng|chang)'],5,partial=True),text))
# find and save supply chain hits 3[index, 'sply_ch_hits3'] = len(re.findall(NEAR_regex(['(supply|suppli)','(bankrupt|fail|failure|failed|difficult|competition|challeng|chang)'],5,partial=True),text))

The first line of code looks for when "supply" or "supply chain" is near a word that indicates a weakness being described. The second line finds instances of when a resource or material is scarce or challenging to acquire. Finally, the third line of code looks for when supply or a supplier indicates bankruptcy, fails, has difficulty, or is challenged in some way. Each of these produces hits at instances of phrases that indicate a supply chain issue, similarly to how they did for the competition and litigation examples.

Final Sample

Our final sample is the data set in firms_df. Let's analyze this data to ensure that it makes sense. First, we will look at the first five rows and the shape of the table.

This data set has 505 rows, which makes sense, since it began with the original S&P 500 firm data from input/sp500_firms.csv, which also has 505 firms. Now, lets look at the risk measurement hit counts.

Given the count, almost every firm has a hit recorded for it. The reason why not all of them have a value for hits is that only 492 of the firms had 10-Ks to download. Of course, if there is no 10-K for a firm, its text cannot be iterated through.

Next, let's look at some of the accounting data and our calculated weekly returns.

As seen here, there is a notable amount of accounting data missing. For weekly returns, 490 of the firms had the data available to calculate. A mean weekly return of -0.12 makes sense, since that is about how much the S&P 500 lost that week.


Missing data is certainly an issue when attempting to perform this project's goals. The program to iterate through 10-Ks has to take into consideration when 10-Ks do not exist for a given firm. The same problem of missing data needs to be accounted for in merging tables throughout the code. In every instance where a merge takes place, the code performs a left merge into the original firms_df table. That way, no data is lost or affected by a lack of data in the other table being merged. one issue that was fixed in development was the problem where firms without 10-Ks were having hits recorded from the previous firm's hit count rather than simply having a null value. Also, in computing the correlation, not every firm has a weekly return, and not every firm has a hit count. This issue is taken care of automatically inside the .corr() method.


Let's calculate and look at our correlations.


There is a very weak positive correlation between the returns of the week of March 9, 2020 and competitive firms. This may suggest that there is a small indication that more competitive firms had a more positive return that week than the majority of firms; however, this correlation is not strong enough to make such a conclusion.


This measurement draws a similar but even weaker conclusion than competition. There is a very weak positive correlation between the returns of the week of March 9, 2020 and firms with a higher litigation concern. This may suggest that there is a small indication that firms with more litigation concerns had a more positive return that week than the majority of firms; however, this correlation is by no means strong enough to make such a conclusion.

Supply Chain

There is a very weak negative correlation between the returns of the week of March 9, 2020 and firms with more supply chain concerns. This may suggest that there is a small indication that firms with weaker supply chain reliability had a more negative return that week than the majority of firms. Intuitively, such a conclusion may make sense, and having all three measurements resulting in a negative correlation may help to some extent. However, once again this correlation is not strong enough to make any kind of a definitive conclusion.


While the results did not exactly produce the kind of conclusion we may have been looking for, the project could be refined to gather better results in a number of ways. First, more time spent analyzing the language of the 10-Ks could help produce better keywords to use in text processing. Also, including more keywords and variations in keywords might help produce higher hit counts. With higher hit counts, perhaps a better correlation could be made, since fewer hits will be zero.

The other possibility is that these metrics did not have a very high impact on the variability of the stock prices of these firms. The most likely conclusion one could possibly make from this data if absolutely necessary is the one about supply chains, since all of the measurements resulted in a negative correlation, and the second measurement had a magnitude 0.15 correlation, which was the strongest I found. However it is by far most accurate to simply say that no conclusions can safely be made.