Summary
In this project, I try to detect topics among Quora questions through unsupervised Natural Language Processing machine learning methods. Using textual data from more than 10 000 questions, I implement a Latent Dirichlet Allocation model and fit it to the pre-processed questions. After tuning the model for the optimal number of topics (n=20), the final t-sne projected topics can be visualized below in a 2-dimensional space, together with the distriubtion of words representing them.
Quora questions topics from the optimal coherence LDA model (n=20, t-SNE projection)
Introduction
Quora, Quora, Quora… what is it, anyway?
To put it simply, Quora is a place to gain and share knowledge—about anything. It’s a platform to ask-or answer questions and connect with people who contribute unique insights. This platform empowers people to learn from each other and to better understand themselves and the world. Based on today’s (2022.04.29) Alexa audience metrics, Quora ranks #65 in global internet traffic and engagement over the past 90 days. As such, it is an invaluable source of information to understand what people worldwide are questioning themselves and others about.
In this project, the aim was to understand exactly this: what people worldwide are questioning themselves and others about? Using a Natural Language Processing (NLP) and machine learning approach, I try to detect what topics can be extracted from questions from Quora users.
Exploratory Data Analysis
The quora dataset used for this analysis can be found here. Each row of this dataset is a unique questions retrieved from Quora, represetend with a unique id (qid
) and the question textual data (question_text
). The dataset contains a total of 375806 questions. Provided below is a glimpse of the first 10 questions in the dataset:
# Loading the data
quora = pd.read_csv('./gdrive/MyDrive/test.csv.zip')
quora.head(10)
index | qid | question_text |
---|---|---|
0 | 0000163e3ea7c7a74cd7 | Why do so many women become so rude and arrogant when they get just a little bit of wealth and power? |
1 | 00002bd4fb5d505b9161 | When should I apply for RV college of engineering and BMS college of engineering? Should I wait for the COMEDK result or am I supposed to apply before the result? |
2 | 00007756b4a147d2b0b3 | What is it really like to be a nurse practitioner? |
3 | 000086e4b7e1c7146103 | Who are entrepreneurs? |
4 | 0000c4c3fbe8785a3090 | Is education really making good people nowadays? |
5 | 000101884c19f3515c1a | How do you train a pigeon to send messages? |
6 | 00010f62537781f44a47 | What is the currency in Langkawi? |
7 | 00012afbd27452239059 | What is the future for Pandora, can the business reduce its debt? |
8 | 00014894849d00ba98a9 | My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type? |
9 | 000156468431f09b3cae | How much does a tutor earn in Bangalore? |
Text pre-processing
Tokenization and stopwords removal
Tokenization is the first step in any NLP pipeline. It describes the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens-the building blocks of Natural Language.. A lot of open-source tools are available to perform the tokenization process. Here we use the gensim
python library to tokenize the Quora questions textual data.
# Test use of spacy by using the spacy.load() function
Language = spacy.load('en_core_web_sm') # Builds a basic Language tool to perform NLP
# Instantiate tokenizer
tokenizer = spacy.tokenizer.Tokenizer(Language.vocab)
# Custom stopwords to add
custom_stopwords = ['hi','\n','\n\n', '&', ' ', '.', '-', 'got', "it's", 'it’s', "i'm", 'i’m', 'im', 'want', 'like', '$', '@', ':', '--', 'w/', "'s", "?", "people", "best", "good"] # Add custom stopwords here if you see any!!
# Customize stop words by adding to the default list
STOP_WORDS = Language.Defaults.stop_words.union(custom_stopwords).union(stopwords)
print("Number of stopwords: ", len(STOP_WORDS))
# Define custom function to wrap simple_preprocess() from gensim
def preprocess_data(txt):
word_list = gensim.utils.simple_preprocess(txt, deacc=True)
preprocessed_txt = ' '.join(i for i in word_list) # Join list of words back into a sentence
return preprocessed_txt
# Apply to column
quora['gensim_preprocess'] = quora['question_text'].apply(preprocess_data)
# Get tokens from tweets
tokens = []
for doc in tokenizer.pipe(quora['gensim_preprocess'], batch_size=1000):
doc_tokens = []
for token in doc:
if token.text.lower() not in STOP_WORDS:
doc_tokens.append(token.text.lower())
tokens.append(doc_tokens)
# Makes tokens column
quora['tokens'] = tokens
# Visualize tokenization
quora.iloc[:,-3:]
index | question_text | gensim_preprocess | tokens |
---|---|---|---|
0 | Why do so many women become so rude and arrogant when they get just a little bit of wealth and power? | why do so many women become so rude and arrogant when they get just little bit of wealth and power | women,rude,arrogant,little,bit,wealth,power |
1 | When should I apply for RV college of engineering and BMS college of engineering? Should I wait for the COMEDK result or am I supposed to apply before the result? | when should apply for rv college of engineering and bms college of engineering should wait for the comedk result or am supposed to apply before the result | apply,rv,college,engineering,bms,college,engineering,wait,comedk,result,supposed,apply,result |
2 | What is it really like to be a nurse practitioner? | what is it really like to be nurse practitioner | nurse,practitioner |
3 | Who are entrepreneurs? | who are entrepreneurs | entrepreneurs |
4 | Is education really making good people nowadays? | is education really making good people nowadays | education,making,nowadays |
5 | How do you train a pigeon to send messages? | how do you train pigeon to send messages | train,pigeon,send,messages |
6 | What is the currency in Langkawi? | what is the currency in langkawi | currency,langkawi |
7 | What is the future for Pandora, can the business reduce its debt? | what is the future for pandora can the business reduce its debt | future,pandora,business,reduce,debt |
8 | My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type? | my voice range is my chest voice goes up to included sample in my higher chest range what is my voice type | voice,range,chest,voice,goes,included,sample,higher,chest,range,voice,type |
9 | How much does a tutor earn in Bangalore? | how much does tutor earn in bangalore | tutor,earn,bangalore |
Lemmatization
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. That’s precisely what lemmatization does, with the goal of reducing inflectional forms and sometimes derivationally related forms of a word to a common base form.
# Write a lemmatization function based on nltk.stem.WordNetLemmatizer()
from nltk.stem import WordNetLemmatizer
# Lemmatize with POS Tag
from nltk.corpus import wordnet
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet') # To use WordNetLemmatizer
nltk.download('punkt') # To use WordNetLemmatizer
# Function to lemmatize text
def lemmatize_text(txt):
wordnet_lemmatizer = WordNetLemmatizer() # Define word net lemmatizer to use
txt = [wordnet_lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in txt]
return txt
# Apply lemmatize_text() to tokenized text
quora['lemma'] = quora['tokens'].apply(lemmatize_text)
# Visualize tokenization and lemmatization
quora.iloc[:,-4:]
index | question_text | gensim_preprocess | tokens | lemma |
---|---|---|---|---|
0 | Why do so many women become so rude and arrogant when they get just a little bit of wealth and power? | why do so many women become so rude and arrogant when they get just little bit of wealth and power | women,rude,arrogant,little,bit,wealth,power | woman,rude,arrogant,little,bit,wealth,power |
1 | When should I apply for RV college of engineering and BMS college of engineering? Should I wait for the COMEDK result or am I supposed to apply before the result? | when should apply for rv college of engineering and bms college of engineering should wait for the comedk result or am supposed to apply before the result | apply,rv,college,engineering,bms,college,engineering,wait,comedk,result,supposed,apply,result | apply,rv,college,engineering,bm,college,engineering,wait,comedk,result,suppose,apply,result |
2 | What is it really like to be a nurse practitioner? | what is it really like to be nurse practitioner | nurse,practitioner | nurse,practitioner |
3 | Who are entrepreneurs? | who are entrepreneurs | entrepreneurs | entrepreneur |
4 | Is education really making good people nowadays? | is education really making good people nowadays | education,making,nowadays | education,make,nowadays |
5 | How do you train a pigeon to send messages? | how do you train pigeon to send messages | train,pigeon,send,messages | train,pigeon,send,message |
6 | What is the currency in Langkawi? | what is the currency in langkawi | currency,langkawi | currency,langkawi |
7 | What is the future for Pandora, can the business reduce its debt? | what is the future for pandora can the business reduce its debt | future,pandora,business,reduce,debt | future,pandora,business,reduce,debt |
8 | My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type? | my voice range is my chest voice goes up to included sample in my higher chest range what is my voice type | voice,range,chest,voice,goes,included,sample,higher,chest,range,voice,type | voice,range,chest,voice,go,include,sample,high,chest,range,voice,type |
9 | How much does a tutor earn in Bangalore? | how much does tutor earn in bangalore | tutor,earn,bangalore | tutor,earn,bangalore |
Detecting topics in Quora questions: an LDA model approach
Afer the textual data has been correctly pre-processed, we fit an LDA model to the questions and tune it for the best number of topics by optimizing cross-validated coherence. After tuning, the optimal number of topics was found to be n=20. The topics can be visualized in the first figure at the beginning of this post.
#Defining a function to loop over number of topics to be used to find an
#optimal number of tipics
def tune_LDA_topics(dictionary, corpus, texts, limit, start=2, step=3):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the
LDA model with respective number of topics
"""
coherence_values_topic = []
model_list_topic = []
for num_topics in range(start, limit, step):
print('\nStarting num_topics = ', num_topics)
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=47,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
model_list_topic.append(lda_model)
coherencemodel = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values_topic.append(coherencemodel.get_coherence())
print('\nFinishing num_topics = ', num_topics)
return model_list_topic, coherence_values_topic
# Create Dictionary
id2word = corpora.Dictionary(sub_quora['lemma'])
# Create Corpus
texts = sub_quora['lemma']
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
model_list_topic, coherence_values_topic = tune_LDA_topics(dictionary=id2word,
corpus=corpus,
texts=sub_quora['lemma'],
start=2, limit=52, step=3)