Summary


In this project, I try to detect topics among Quora questions through unsupervised Natural Language Processing machine learning methods. Using textual data from more than 10 000 questions, I implement a Latent Dirichlet Allocation model and fit it to the pre-processed questions. After tuning the model for the optimal number of topics (n=20), the final t-sne projected topics can be visualized below in a 2-dimensional space, together with the distriubtion of words representing them.


Quora questions topics from the optimal coherence LDA model (n=20, t-SNE projection)

Introduction


Quora, Quora, Quora… what is it, anyway?

To put it simply, Quora is a place to gain and share knowledge—about anything. It’s a platform to ask-or answer questions and connect with people who contribute unique insights. This platform empowers people to learn from each other and to better understand themselves and the world. Based on today’s (2022.04.29) Alexa audience metrics, Quora ranks #65 in global internet traffic and engagement over the past 90 days. As such, it is an invaluable source of information to understand what people worldwide are questioning themselves and others about.

In this project, the aim was to understand exactly this: what people worldwide are questioning themselves and others about? Using a Natural Language Processing (NLP) and machine learning approach, I try to detect what topics can be extracted from questions from Quora users.

Exploratory Data Analysis


The quora dataset used for this analysis can be found here. Each row of this dataset is a unique questions retrieved from Quora, represetend with a unique id (qid) and the question textual data (question_text). The dataset contains a total of 375806 questions. Provided below is a glimpse of the first 10 questions in the dataset:

# Loading the data
quora = pd.read_csv('./gdrive/MyDrive/test.csv.zip')
quora.head(10)
index qid question_text
0 0000163e3ea7c7a74cd7 Why do so many women become so rude and arrogant when they get just a little bit of wealth and power?
1 00002bd4fb5d505b9161 When should I apply for RV college of engineering and BMS college of engineering? Should I wait for the COMEDK result or am I supposed to apply before the result?
2 00007756b4a147d2b0b3 What is it really like to be a nurse practitioner?
3 000086e4b7e1c7146103 Who are entrepreneurs?
4 0000c4c3fbe8785a3090 Is education really making good people nowadays?
5 000101884c19f3515c1a How do you train a pigeon to send messages?
6 00010f62537781f44a47 What is the currency in Langkawi?
7 00012afbd27452239059 What is the future for Pandora, can the business reduce its debt?
8 00014894849d00ba98a9 My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type?
9 000156468431f09b3cae How much does a tutor earn in Bangalore?

Text pre-processing


Tokenization and stopwords removal


Tokenization is the first step in any NLP pipeline. It describes the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens-the building blocks of Natural Language.. A lot of open-source tools are available to perform the tokenization process. Here we use the gensim python library to tokenize the Quora questions textual data.

# Test use of spacy by using the spacy.load() function
Language = spacy.load('en_core_web_sm') # Builds a basic Language tool to perform NLP

# Instantiate tokenizer
tokenizer = spacy.tokenizer.Tokenizer(Language.vocab)

# Custom stopwords to add
custom_stopwords = ['hi','\n','\n\n', '&', ' ', '.', '-', 'got', "it's", 'it’s', "i'm", 'i’m', 'im', 'want', 'like', '$', '@', ':', '--', 'w/', "'s", "?", "people", "best", "good"] # Add custom stopwords here if you see any!!

# Customize stop words by adding to the default list
STOP_WORDS = Language.Defaults.stop_words.union(custom_stopwords).union(stopwords)
print("Number of stopwords: ", len(STOP_WORDS))

# Define custom function to wrap simple_preprocess() from gensim
def preprocess_data(txt):
  word_list = gensim.utils.simple_preprocess(txt, deacc=True)
  preprocessed_txt = ' '.join(i for i in word_list) # Join list of words back into a sentence

  return preprocessed_txt

# Apply to column
quora['gensim_preprocess'] = quora['question_text'].apply(preprocess_data)

# Get tokens from tweets
tokens = []

for doc in tokenizer.pipe(quora['gensim_preprocess'], batch_size=1000):
    doc_tokens = []    
    for token in doc: 
        if token.text.lower() not in STOP_WORDS:
            doc_tokens.append(token.text.lower())   
    tokens.append(doc_tokens)

# Makes tokens column
quora['tokens'] = tokens

# Visualize tokenization
quora.iloc[:,-3:]
index question_text gensim_preprocess tokens
0 Why do so many women become so rude and arrogant when they get just a little bit of wealth and power? why do so many women become so rude and arrogant when they get just little bit of wealth and power women,rude,arrogant,little,bit,wealth,power
1 When should I apply for RV college of engineering and BMS college of engineering? Should I wait for the COMEDK result or am I supposed to apply before the result? when should apply for rv college of engineering and bms college of engineering should wait for the comedk result or am supposed to apply before the result apply,rv,college,engineering,bms,college,engineering,wait,comedk,result,supposed,apply,result
2 What is it really like to be a nurse practitioner? what is it really like to be nurse practitioner nurse,practitioner
3 Who are entrepreneurs? who are entrepreneurs entrepreneurs
4 Is education really making good people nowadays? is education really making good people nowadays education,making,nowadays
5 How do you train a pigeon to send messages? how do you train pigeon to send messages train,pigeon,send,messages
6 What is the currency in Langkawi? what is the currency in langkawi currency,langkawi
7 What is the future for Pandora, can the business reduce its debt? what is the future for pandora can the business reduce its debt future,pandora,business,reduce,debt
8 My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type? my voice range is my chest voice goes up to included sample in my higher chest range what is my voice type voice,range,chest,voice,goes,included,sample,higher,chest,range,voice,type
9 How much does a tutor earn in Bangalore? how much does tutor earn in bangalore tutor,earn,bangalore

Lemmatization


For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. That’s precisely what lemmatization does, with the goal of reducing inflectional forms and sometimes derivationally related forms of a word to a common base form.

# Write a lemmatization function based on nltk.stem.WordNetLemmatizer()
from nltk.stem import WordNetLemmatizer
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet') # To use WordNetLemmatizer
nltk.download('punkt')   # To use WordNetLemmatizer

# Function to lemmatize text
def lemmatize_text(txt):

  wordnet_lemmatizer = WordNetLemmatizer() # Define word net lemmatizer to use
  txt = [wordnet_lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in txt]

  return txt
  
# Apply lemmatize_text() to tokenized text 
quora['lemma'] = quora['tokens'].apply(lemmatize_text)

# Visualize tokenization and lemmatization
quora.iloc[:,-4:]
index question_text gensim_preprocess tokens lemma
0 Why do so many women become so rude and arrogant when they get just a little bit of wealth and power? why do so many women become so rude and arrogant when they get just little bit of wealth and power women,rude,arrogant,little,bit,wealth,power woman,rude,arrogant,little,bit,wealth,power
1 When should I apply for RV college of engineering and BMS college of engineering? Should I wait for the COMEDK result or am I supposed to apply before the result? when should apply for rv college of engineering and bms college of engineering should wait for the comedk result or am supposed to apply before the result apply,rv,college,engineering,bms,college,engineering,wait,comedk,result,supposed,apply,result apply,rv,college,engineering,bm,college,engineering,wait,comedk,result,suppose,apply,result
2 What is it really like to be a nurse practitioner? what is it really like to be nurse practitioner nurse,practitioner nurse,practitioner
3 Who are entrepreneurs? who are entrepreneurs entrepreneurs entrepreneur
4 Is education really making good people nowadays? is education really making good people nowadays education,making,nowadays education,make,nowadays
5 How do you train a pigeon to send messages? how do you train pigeon to send messages train,pigeon,send,messages train,pigeon,send,message
6 What is the currency in Langkawi? what is the currency in langkawi currency,langkawi currency,langkawi
7 What is the future for Pandora, can the business reduce its debt? what is the future for pandora can the business reduce its debt future,pandora,business,reduce,debt future,pandora,business,reduce,debt
8 My voice range is A2-C5. My chest voice goes up to F4. Included sample in my higher chest range. What is my voice type? my voice range is my chest voice goes up to included sample in my higher chest range what is my voice type voice,range,chest,voice,goes,included,sample,higher,chest,range,voice,type voice,range,chest,voice,go,include,sample,high,chest,range,voice,type
9 How much does a tutor earn in Bangalore? how much does tutor earn in bangalore tutor,earn,bangalore tutor,earn,bangalore

Detecting topics in Quora questions: an LDA model approach

Afer the textual data has been correctly pre-processed, we fit an LDA model to the questions and tune it for the best number of topics by optimizing cross-validated coherence. After tuning, the optimal number of topics was found to be n=20. The topics can be visualized in the first figure at the beginning of this post.



#Defining a function to loop over number of topics to be used to find an 
#optimal number of tipics
def tune_LDA_topics(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the 
    LDA model with respective number of topics
    """
    coherence_values_topic = []
    model_list_topic = []

    for num_topics in range(start, limit, step):

        print('\nStarting num_topics = ', num_topics)
        
        lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                    id2word=id2word,
                                                    num_topics=num_topics, 
                                                    random_state=47,
                                                    update_every=1,
                                                    chunksize=100,
                                                    passes=10,
                                                    alpha='auto',
                                                    per_word_topics=True)
        
        model_list_topic.append(lda_model)
        coherencemodel = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values_topic.append(coherencemodel.get_coherence())

        print('\nFinishing num_topics = ', num_topics)

    return model_list_topic, coherence_values_topic
    
# Create Dictionary
id2word = corpora.Dictionary(sub_quora['lemma'])

# Create Corpus
texts = sub_quora['lemma']

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

model_list_topic, coherence_values_topic = tune_LDA_topics(dictionary=id2word,
                                                           corpus=corpus,
                                                           texts=sub_quora['lemma'],
                                                           start=2, limit=52, step=3)