After a Google engineer (let’s call them Stevie) has decided on a hypothesis for improving search results based on a machine learning system, they will most likely pre-train the model with known data.
Basically, Stevie will give the system large volumes of data with known results. In the case of search, they would give the system queries, results, and a rating of the results on a numeric scale of its success. Is it a good result? Does it fulfill the needs of the user?
So already out of the gate we see the impact of user intent coming into play. The entire system is built on the likelihood of a query meeting it.
One of the likely ways the data would be gathered for this training stage would be to pull queries and their results, determine what a likely successful results page has as its engagement factors. A SERP for “weather victoria bc” would have very different success metrics than “best rowing machines for a home gym”.
They would need vast amounts of data for large numbers of queries, a grading of various resulting elements on a page including features snippets, knowledge panels, those ten blue links, and any other elements.
This would be fed into the system to pre-train it.
What is essentially going on during this stage is that Stevie will have created a hypothesis around improving search results using machine learning – let’s say, in what entities are used on a page. For example, if a page is about tourism in New York, what entities should exist on a page likely to satisfy a user?
The hypothesis that certain entities being present on a page or site could suggest that those resources will provide a higher level of user satisfaction with a result might have started when Stevie noticed while planning a trip that most of the successful pages related to tourism in NYC include the boroughs of Queens, Bronx, Brooklyn, Manhattan and Staten Island. From there Stevie may have considered that there are likely other entities common to satisfying users for the query, and that this is likely true in other cities and perhaps for other queries and query types.
From there an algorithm would be trained with data like: