Let’s talk more about building training datasets for refining predictive models.  Public, generic data sources are not as good as the ones you can build and curate yourself.  A training dataset enables you to classify questions by the supposed intention of the user; which can be adduced via the kind of “deep learning” made possible by AI.  After logging into, say, Kaggle and downloading the dataset, it can be loaded into a data-frame in, say, Google Colaboratory (“Colab”).  A useful dataset typically groups questions by the kind of (anticipated) answers, culled from patterns noticed from the past.

When the output from the training dataset is evaluated, it’s important that each category is trained separately, evaluated separately, then aggregated.  The generated model can be tested with Google Search Console data; which can put together a Google Data Studio report that can be cloned and used to extract long search queries from the GSC thereafter.  (The GSC data can be exported by clicking the three dots on the top right of the report in VIEW mode.)  The training data can be uploaded to Colab; which can then yield predictions.  In the training dataset used for any predictive model, the point is to classify the queries pulled from the GSC by INTENT.

You can use Google Data Studio to pull potential questions from the GSC.  The model can be used to classify the questions we export from Data Studio.  For the training dataset, you can group queries by surmised intention (considering word vectors, embeddings, and encoders/decoders), then extract actionable insights in order to see how to best optimize content.

There is a lot of room to improve the accuracy by tweaking the model definition and increasing the quantity and quality of the training data.  That is where most of the time is often spent in deep learning projects.  For machine learning to be used to its full potential, it is also worth pulling CTRs (click-through rates) and search impressions data from the GSC.  The system can then group keywords (by the thousands) according to their predicted categories…while factoring for impressions and clicks.  The point is to find queries with high search impressions yet low clicks.  This will help prioritize content development efforts.

There are some do-it-yourself approaches to compiling a useful training data-set.  For instance, you can download Kaggle, then upload it to Colab.  All that’s left to do is write the (Python) code to get predictions from the test data.  You can learn to build an automated intent classification model by leveraging pre-training data using a BERT encoder and Google Data Studio.  With AI at your disposal, you can also build chatbots to automate tasks using Python and BigQuery.  (Recall that Google has the ability to execute Tensorflow models in BigQuery.)  The Tensorflow feature of Google’s Colab can be used to run predictions directly from BigQuery.  Here, on imports the (partially) trained model.

Thus, after availing yourself of Colab, you can automate the culling of insights (regarding user intent) with BigQuery and Data Studio.

Bear in mind that using deep learning generally requires writing advanced Python code.  The relevance of Python in SEO is continuing to grow.  That said, new AI tech enables you to classify text using deep learning and without having to write a lot of code.

It is in this environment that SEO experts must conduct intent classification.  One encoder that can be used is BERT (Bidirectional Encoder Representations from Transformers).  There are two primary advantages from using BERT compared to traditional encoders:  The bidirectional word embeddings and the language model leveraged through transfer learning.  Recall that BERT was one of the first models to harness the prodigious power of NLP (Natural Language Processing).  Note, though, that BERT has recently been beaten by a new model called XLNet.

There is also Google’s Cloud TPU, a machine learning ASIC that’s designed to run AI-generated models on Google Cloud.  This is what powers Google Translate, Search, and even Gmail.  There are several pre-trained models that typically take days to train, but SEO experts can fine-tune in hours (or even just minutes) when making use of Google Cloud TPUs.

Once the model is trained, you can go ahead and test it on new questions–which can be grabbed from the GSC.  Thus the process of perpetual SEO improvement is set in motion.  And user INTENT is factored into content optimization.