NLP Environment Setup Procedure
This guide will show you how to install the prerequisite software required to use NLTK and StanfordCoreNLP in Microsoft Windows.
Minimum Hardware Requirements
| Memory | 6 GB | more |
While using the Stanford POS tagger, I found that if my computer was not equipped with at least 6GB RAM, then I can expect to encounter java.lang.OutOfMemoryError. |
| Storage | 6.32 GB | more |
Running the Setup Script in PowerShell
The PowerShell script provided below will
-
Ensure that the folder
C:\.temp\nltk\nltk_dataexists and is empty. -
Set the NLTK_DATA environment variable equal to
C:\.temp\nltk\nltk_data - Install Chocolatey Package Manager
-
Use Chocolatey to install
- Python 3.8.3 32-bit
- Java Platform SE Binary 64-bit
- Git for Windows 64-bit
- Set the JAVA_HOME environment variable to the location where Java is installed
- Set the JAVAHOME environment variable to the full path to java.exe
- Use pip to install virtualenv
-
Use virtualenv.exe to create a virtual environment in
C:\.temp\nltk\venv - Enter your virtual environment and use pip to install
-
Use nltk's downloader to download all NLTK Data to
C:\.temp\nltk\nltk_data - Download the latest Stanford CoreNLP and extract pre-built binaries
- Download the Stanford Log-linear Part-Of-Speech Tagger
- Set the CLASSPATH environment variable to the folder where the Stanford Log-linear Part-Of-Speech Tagger is installed
- Set the STANFORD_MODELS environment variable to the full path to english-bidirectional-distsim.tagger
Follow the steps listed below to run the setup script.
-
Strike WinKey and type powershell. Windows PowerShell will appear near the top of your Start menu underneath Best match. Right click Windows PowerShell and select Run As Administrator.
-
Copy the script below, paste it into PowerShell, and strike Enter to install all of the software required to use NLTK and StanfordCoreNLP.
No further interaction is needed as the setup process is fully automated and should take between 7 and 8 minutes to complete. You will see a lot of output in your PowerShell console window, most of which can be safely ignored. The final part of the script will produce a checklist that will tell you whether there were any problems.
- Set-ExecutionPolicy Bypass -Scope Process -Force;
- [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072;
- iex (irm https://nlp.nanick.org/setup.ps1)
To illustrate the steps described above
Preprocessing Transcript Data
Can you hear me? Can you hear me? You hear me? Okay. Congratulations class of 2015. You guys and girls, and young men and women are the reason I'm here. I'm really looking forward to talking with you all tonight. You heard my dad played football here and I believe he even graduated from here. That was some extra incentive for me to come. Short and sweet or long and salty? A sugar doughnut or some oatmeal? Now, out of respect for you and your efforts in getting your degree, I thought long and hard about what I could share with you tonight. Did I want to stand up here at a podium and read you your rights? Did I want to come up here and just share some funny stories. I thought about what you would want, I thought about what you might need. I also thought about what I want to say and what I need to say. Hopefully, we're both going to be happy on both accounts. As the saying goes, take what you like, leave the rest. Thank you for having me.
So before I share with you some what I do knows, I want to talk with you about what I don't know. I have two older brothers. One was in high school in the early 1970s. And this was a time when a high school GED got you a job, and the college degree was exemplary. My other brother, Pat, was in high school in the early 80s. And by this time, the GED wasn't enough to guarantee employment. He needed a college degree. And if you got one, you had a pretty good chance of getting the kind of job that you wanted after you graduated. Me, I graduated high school in 1988. Got my college degree in 1993. And that college degree in '93 did not mean much. It was not a ticket. It was not a voucher. It was not a free pass go to anything. So I asked the question, what does your college degree mean?
It means you got an education. It means you have more knowledge in a specific subject, vocation. It means you may have more expertise in what your degree is in. But what's it worth in the job market out there today? We know the market for college graduates is more competitive now than ever. Now, some of you already have a job lined up, you've got a path where today's job is going to become tomorrow's career. But for most of you, the future is probably still pretty fuzzy. And you don't have that job that directly reflects the degree you just got. Many of you don't even have a job at all. Think about it. You've just completed your scholastic educational curriculum in life, the one that you started when you were five years old in kindergarten up until now, and your future may not be any more clearer than it was five years ago. You don't have the answers and is probably pretty damn scary.
In order to clean the data, the script removes carriage returns, line feeds, and punctuation. Once complete, then all characters are made lowercase. Here are the same five lines of the transcript after they have been cleaned.
can you hear me can you hear me you hear me okay congratulations class of 2015 you guys and girls and young men and women are the reason im here im really looking forward to talking with you all tonight you heard my dad played football here and i believe he even graduated from here that was some extra incentive for me to come short and sweet or long and salty a sugar doughnut or some oatmeal now out of respect for you and your efforts in getting your degree i thought long and hard about what i could share with you tonight did i want to stand up here at a podium and read you your rights did i want to come up here and just share some funny stories i thought about what you would want i thought about what you might need i also thought about what i want to say and what i need to say hopefully were both going to be happy on both accounts as the saying goes take what you like leave the rest thank you for having me so before i share with you some what i do knows i want to talk with you about what i dont know i have two older brothers one was in high school in the early 1970s and this was a time when a high school ged got you a job and the college degree was exemplary my other brother pat was in high school in the early 80s and by this time the ged wasnt enough to guarantee employment he needed a college degree and if you got one you had a pretty good chance of getting the kind of job that you wanted after you graduated me i graduated high school in 1988 got my college degree in 1993 and that college degree in 93 did not mean much it was not a ticket it was not a voucher it was not a free pass go to anything so i asked the question what does your college degree mean it means you got an education it means you have more knowledge in a specific subject vocation it means you may have more expertise in what your degree is in but whats it worth in the job market out there today we know the market for college graduates is more competitive now than ever now some of you already have a job lined up youve got a path where todays job is going to become tomorrows career but for most of you the future is probably still pretty fuzzy and you dont have that job that directly reflects the degree you just got many of you dont even have a job at all think about it youve just completed your scholastic educational curriculum in life the one that you started when you were five years old in kindergarten up until now and your future may not be any more clearer than it was five years ago you dont have the answers and is probably pretty damn scary
Five lines becomes one line and all that remains are alphanumeric characters and spaces. Once the script finishes the cleaning process, it then saves the cleaned transcript to a text file.After cleaning the transcript, the script then tokenizes each word. This simply means creating a table where each word in the transcript gets it's own row in the table.
The cleaned and tokenized transcript is then fed into the Stanford Part-of-Speech tagger. What this does is it determines which lexical category each word in the transcript belongs to (noun, verb, adjective, etc.). Each word is then tagged with a label that represents the lexical category. The POS tagger uses the labels defined in the Penn Treebank tag set to tag each word.
Once the tagging is complete, the result is the tokenized words in one column and their corresponding tag in the next column. The data is then formatted as comma separated values and saved.
| Word | POS tag |
|---|---|
| can | MD |
| you | PRP |
| hear | VB |
| me | PRP |
| can | MD |
| you | PRP |
| hear | VB |
| me | PRP |
| you | PRP |
| hear | VBP |
| me | PRP |
| okay | JJ |
| congratulations | NNS |
| class | NN |
| of | IN |
| 2015 | CD |
| you | PRP |
| guys | NNS |
| and | CC |
| girls | NNS |
| and | CC |
| young | JJ |
| men | NNS |
| and | CC |
| women | NNS |
| are | VBP |
| the | DT |
| reason | NN |
| im | NN |
| here | RB |
| im | IN |
| really | RB |
| looking | VBG |
| forward | RB |
| to | IN |
| talking | VBG |
| with | IN |
| you | PRP |
| all | DT |
| tonight | NN |
| you | PRP |
| heard | VBD |
| my | PRP$ |
| dad | NN |
| played | VBD |
| football | NN |
| here | RB |
| and | CC |
| i | PRP |
| believe | VBP |
| he | PRP |
| even | RB |
| graduated | VBD |
| from | IN |
| here | RB |
| that | WDT |
| was | VBD |
| some | DT |
| extra | JJ |
| incentive | NN |
| for | IN |
| me | PRP |
| to | TO |
| come | VB |
| short | JJ |
| and | CC |
| sweet | JJ |
| or | CC |
| long | JJ |
| and | CC |
| salty | JJ |
| a | DT |
| sugar | NN |
| doughnut | NN |
| or | CC |
| some | DT |
| oatmeal | NN |
| now | RB |
| out | IN |
| of | IN |
| respect | NN |
| for | IN |
| you | PRP |
| and | CC |
| your | PRP$ |
| efforts | NNS |
| in | IN |
| getting | VBG |
| your | PRP$ |
| degree | NN |
| i | PRP |
| thought | VBD |
| long | JJ |
| and | CC |
| hard | JJ |
| about | IN |
| what | WP |
| i | PRP |
| could | MD |
| share | VB |
| with | IN |
| you | PRP |
| tonight | NN |
| did | VBD |
| i | PRP |
| want | VB |
| to | TO |
| stand | VB |
| up | RP |
| here | RB |
| at | IN |
| a | DT |
| podium | NN |
| and | CC |
| read | VB |
| you | PRP |
| your | PRP$ |
| rights | NNS |
| did | VBD |
| i | PRP |
| want | VB |
| to | TO |
| come | VB |
| up | RP |
| here | RB |
| and | CC |
| just | RB |
| share | VB |
| some | DT |
| funny | JJ |
| stories | NNS |
| i | PRP |
| thought | VBD |
| about | IN |
| what | WP |
| you | PRP |
| would | MD |
| want | VB |
| i | PRP |
| thought | VBD |
| about | IN |
| what | WP |
| you | PRP |
| might | MD |
| need | VB |
| i | PRP |
| also | RB |
| thought | VBD |
| about | IN |
| what | WP |
| i | PRP |
| want | VBP |
| to | TO |
| say | VB |
| and | CC |
| what | WP |
| i | PRP |
| need | VBP |
| to | TO |
| say | VB |
| hopefully | RB |
| were | VBD |
| both | DT |
| going | VBG |
| to | TO |
| be | VB |
| happy | JJ |
| on | IN |
| both | DT |
| accounts | NNS |
| as | IN |
| the | DT |
| saying | NN |
| goes | VBZ |
| take | VB |
| what | WP |
| you | PRP |
| like | UH |
| leave | VB |
| the | DT |
| rest | NN |
| thank | VBP |
| you | PRP |
| for | IN |
| having | VBG |
| me | PRP |
| so | RB |
| before | IN |
| i | PRP |
| share | VBP |
| with | IN |
| you | PRP |
| some | DT |
| what | WP |
| i | PRP |
| do | VBP |
| knows | VBZ |
| i | PRP |
| want | VBP |
| to | TO |
| talk | VB |
| with | IN |
| you | PRP |
| about | IN |
| what | WP |
| i | FW |
| dont | FW |
| know | VBP |
| i | PRP |
| have | VBP |
| two | CD |
| older | JJR |
| brothers | NNS |
| one | CD |
| was | VBD |
| in | IN |
| high | JJ |
| school | NN |
| in | IN |
| the | DT |
| early | JJ |
| 1970s | NNS |
| and | CC |
| this | DT |
| was | VBD |
| a | DT |
| time | NN |
| when | WRB |
| a | DT |
| high | JJ |
| school | NN |
| ged | NN |
| got | VBD |
| you | PRP |
| a | DT |
| job | NN |
| and | CC |
| the | DT |
| college | NN |
| degree | NN |
| was | VBD |
| exemplary | JJ |
| my | PRP$ |
| other | JJ |
| brother | NN |
| pat | VB |
| was | VBD |
| in | IN |
| high | JJ |
| school | NN |
| in | IN |
| the | DT |
| early | JJ |
| 80s | NNS |
| and | CC |
| by | IN |
| this | DT |
| time | NN |
| the | DT |
| ged | FW |
| wasnt | FW |
| enough | JJ |
| to | TO |
| guarantee | VB |
| employment | NN |
| he | PRP |
| needed | VBD |
| a | DT |
| college | NN |
| degree | NN |
| and | CC |
| if | IN |
| you | PRP |
| got | VBD |
| one | CD |
| you | PRP |
| had | VBD |
| a | DT |
| pretty | RB |
| good | JJ |
| chance | NN |
| of | IN |
| getting | VBG |
| the | DT |
| kind | NN |
| of | IN |
| job | NN |
| that | WDT |
| you | PRP |
| wanted | VBD |
| after | IN |
| you | PRP |
| graduated | VBD |
| me | PRP |
| i | PRP |
| graduated | VBD |
| high | JJ |
| school | NN |
| in | IN |
| 1988 | CD |
| got | VBD |
| my | PRP$ |
| college | NN |
| degree | NN |
| in | IN |
| 1993 | CD |
| and | CC |
| that | IN |
| college | NN |
| degree | NN |
| in | IN |
| 93 | CD |
| did | VBD |
| not | RB |
| mean | VB |
| much | JJ |
| it | PRP |
| was | VBD |
| not | RB |
| a | DT |
| ticket | NN |
| it | PRP |
| was | VBD |
| not | RB |
| a | DT |
| voucher | NN |
| it | PRP |
| was | VBD |
| not | RB |
| a | DT |
| free | JJ |
| pass | NN |
| go | VB |
| to | IN |
| anything | NN |
| so | RB |
| i | PRP |
| asked | VBD |
| the | DT |
| question | NN |
| what | WP |
| does | VBZ |
| your | PRP$ |
| college | NN |
| degree | NN |
| mean | VBP |
| it | PRP |
| means | VBZ |
| you | PRP |
| got | VBD |
| an | DT |
| education | NN |
| it | PRP |
| means | VBZ |
| you | PRP |
| have | VBP |
| more | JJR |
| knowledge | NN |
| in | IN |
| a | DT |
| specific | JJ |
| subject | NN |
| vocation | NN |
| it | PRP |
| means | VBZ |
| you | PRP |
| may | MD |
| have | VB |
| more | JJR |
| expertise | NN |
| in | IN |
| what | WP |
| your | PRP$ |
| degree | NN |
| is | VBZ |
| in | IN |
| but | CC |
| whats | VBZ |
| it | PRP |
| worth | JJ |
| in | IN |
| the | DT |
| job | NN |
| market | NN |
| out | RB |
| there | RB |
| today | NN |
| we | PRP |
| know | VBP |
| the | DT |
| market | NN |
| for | IN |
| college | NN |
| graduates | NNS |
| is | VBZ |
| more | RBR |
| competitive | JJ |
| now | RB |
| than | IN |
| ever | RB |
| now | RB |
| some | DT |
| of | IN |
| you | PRP |
| already | RB |
| have | VBP |
| a | DT |
| job | NN |
| lined | VBN |
| up | RP |
| youve | NNP |
| got | VBD |
| a | DT |
| path | NN |
| where | WRB |
| todays | NNS |
| job | NN |
| is | VBZ |
| going | VBG |
| to | TO |
| become | VB |
| tomorrows | NNS |
| career | NN |
| but | CC |
| for | IN |
| most | JJS |
| of | IN |
| you | PRP |
| the | DT |
| future | NN |
| is | VBZ |
| probably | RB |
| still | RB |
| pretty | RB |
| fuzzy | JJ |
| and | CC |
| you | PRP |
| dont | VBP |
| have | VB |
| that | DT |
| job | NN |
| that | WDT |
| directly | RB |
| reflects | VBZ |
| the | DT |
| degree | NN |
| you | PRP |
| just | RB |
| got | VBD |
| many | JJ |
| of | IN |
| you | PRP |
| dont | VBP |
| even | RB |
| have | VB |
| a | DT |
| job | NN |
| at | RB |
| all | RB |
| think | VB |
| about | IN |
| it | PRP |
| youve | VBD |
| just | RB |
| completed | VBN |
| your | PRP$ |
| scholastic | JJ |
| educational | JJ |
| curriculum | NN |
| in | IN |
| life | NN |
| the | DT |
| one | NN |
| that | WDT |
| you | PRP |
| started | VBD |
| when | WRB |
| you | PRP |
| were | VBD |
| five | CD |
| years | NNS |
| old | JJ |
| in | IN |
| kindergarten | NN |
| up | IN |
| until | IN |
| now | RB |
| and | CC |
| your | PRP$ |
| future | NN |
| may | MD |
| not | RB |
| be | VB |
| any | DT |
| more | JJR |
| clearer | JJR |
| than | IN |
| it | PRP |
| was | VBD |
| five | CD |
| years | NNS |
| ago | RB |
| you | PRP |
| dont | VBP |
| have | VB |
| the | DT |
| answers | NNS |
| and | CC |
| is | VBZ |
| probably | RB |
| pretty | RB |
| damn | RB |
| scary | JJ |
For example, the word "best" is a comparitive inflection of the word "good". Therefore, the lemma for the word "best" is "good".
The script will create another text file containing the lemmatized version of the cleaned transcript. Here are the same five lines of the transcript after they have been cleaned and lemmatized.
can you hear me can you hear me you hear me okay congratulation class of 2015 you guy and girl and young men and woman are the reason im here im really looking forward to talking with you all tonight you heard my dad played football here and i believe he even graduated from here that wa some extra incentive for me to come short and sweet or long and salty a sugar doughnut or some oatmeal now out of respect for you and your effort in getting your degree i thought long and hard about what i could share with you tonight did i want to stand up here at a podium and read you your right did i want to come up here and just share some funny story i thought about what you would want i thought about what you might need i also thought about what i want to say and what i need to say hopefully were both going to be happy on both account a the saying go take what you like leave the rest thank you for having me so before i share with you some what i do know i want to talk with you about what i dont know i have two older brother one wa in high school in the early 1970s and this wa a time when a high school ged got you a job and the college degree wa exemplary my other brother pat wa in high school in the early 80 and by this time the ged wasnt enough to guarantee employment he needed a college degree and if you got one you had a pretty good chance of getting the kind of job that you wanted after you graduated me i graduated high school in 1988 got my college degree in 1993 and that college degree in 93 did not mean much it wa not a ticket it wa not a voucher it wa not a free pas go to anything so i asked the question what doe your college degree mean it mean you got an education it mean you have more knowledge in a specific subject vocation it mean you may have more expertise in what your degree is in but whats it worth in the job market out there today we know the market for college graduate is more competitive now than ever now some of you already have a job lined up youve got a path where today job is going to become tomorrow career but for most of you the future is probably still pretty fuzzy and you dont have that job that directly reflects the degree you just got many of you dont even have a job at all think about it youve just completed your scholastic educational curriculum in life the one that you started when you were five year old in kindergarten up until now and your future may not be any more clearer than it wa five year ago you dont have the answer and is probably pretty damn scary
Once the script completes the lemmatization, it performs the same tokenizing and tagging process on the lemmatized transcript that it did during the first iteration. The result is structured the same way, with the lemmatized words in one column and their corresponding tag in the next. The data is then formatted as comma separated values and saved.
| Lemmatized word | POS tag |
|---|---|
| can | MD |
| you | PRP |
| hear | VB |
| me | PRP |
| can | MD |
| you | PRP |
| hear | VB |
| me | PRP |
| you | PRP |
| hear | VBP |
| me | PRP |
| okay | JJ |
| congratulation | NN |
| class | NN |
| of | IN |
| 2015 | CD |
| you | PRP |
| guy | NN |
| and | CC |
| girl | NN |
| and | CC |
| young | JJ |
| men | NNS |
| and | CC |
| woman | NN |
| are | VBP |
| the | DT |
| reason | NN |
| im | NN |
| here | RB |
| im | IN |
| really | RB |
| looking | VBG |
| forward | RB |
| to | IN |
| talking | VBG |
| with | IN |
| you | PRP |
| all | DT |
| tonight | NN |
| you | PRP |
| heard | VBD |
| my | PRP$ |
| dad | NN |
| played | VBD |
| football | NN |
| here | RB |
| and | CC |
| i | PRP |
| believe | VBP |
| he | PRP |
| even | RB |
| graduated | VBD |
| from | IN |
| here | RB |
| that | DT |
| wa | NN |
| some | DT |
| extra | JJ |
| incentive | NN |
| for | IN |
| me | PRP |
| to | TO |
| come | VB |
| short | JJ |
| and | CC |
| sweet | JJ |
| or | CC |
| long | JJ |
| and | CC |
| salty | JJ |
| a | DT |
| sugar | NN |
| doughnut | NN |
| or | CC |
| some | DT |
| oatmeal | NN |
| now | RB |
| out | IN |
| of | IN |
| respect | NN |
| for | IN |
| you | PRP |
| and | CC |
| your | PRP$ |
| effort | NN |
| in | IN |
| getting | VBG |
| your | PRP$ |
| degree | NN |
| i | PRP |
| thought | VBD |
| long | JJ |
| and | CC |
| hard | JJ |
| about | IN |
| what | WP |
| i | PRP |
| could | MD |
| share | VB |
| with | IN |
| you | PRP |
| tonight | NN |
| did | VBD |
| i | PRP |
| want | VB |
| to | TO |
| stand | VB |
| up | RP |
| here | RB |
| at | IN |
| a | DT |
| podium | NN |
| and | CC |
| read | VB |
| you | PRP |
| your | PRP$ |
| right | NN |
| did | VBD |
| i | PRP |
| want | VB |
| to | TO |
| come | VB |
| up | RP |
| here | RB |
| and | CC |
| just | RB |
| share | VB |
| some | DT |
| funny | JJ |
| story | NN |
| i | PRP |
| thought | VBD |
| about | IN |
| what | WP |
| you | PRP |
| would | MD |
| want | VB |
| i | PRP |
| thought | VBD |
| about | IN |
| what | WP |
| you | PRP |
| might | MD |
| need | VB |
| i | PRP |
| also | RB |
| thought | VBD |
| about | IN |
| what | WP |
| i | PRP |
| want | VBP |
| to | TO |
| say | VB |
| and | CC |
| what | WP |
| i | PRP |
| need | VBP |
| to | TO |
| say | VB |
| hopefully | RB |
| were | VBD |
| both | DT |
| going | VBG |
| to | TO |
| be | VB |
| happy | JJ |
| on | IN |
| both | DT |
| account | NN |
| a | DT |
| the | DT |
| saying | NN |
| go | VB |
| take | VB |
| what | WP |
| you | PRP |
| like | UH |
| leave | VB |
| the | DT |
| rest | NN |
| thank | VBP |
| you | PRP |
| for | IN |
| having | VBG |
| me | PRP |
| so | RB |
| before | IN |
| i | PRP |
| share | VBP |
| with | IN |
| you | PRP |
| some | DT |
| what | WP |
| i | PRP |
| do | VBP |
| know | VB |
| i | PRP |
| want | VBP |
| to | TO |
| talk | VB |
| with | IN |
| you | PRP |
| about | IN |
| what | WP |
| i | FW |
| dont | FW |
| know | VBP |
| i | PRP |
| have | VBP |
| two | CD |
| older | JJR |
| brother | NN |
| one | CD |
| wa | NN |
| in | IN |
| high | JJ |
| school | NN |
| in | IN |
| the | DT |
| early | JJ |
| 1970s | NNS |
| and | CC |
| this | DT |
| wa | NN |
| a | DT |
| time | NN |
| when | WRB |
| a | DT |
| high | JJ |
| school | NN |
| ged | NN |
| got | VBD |
| you | PRP |
| a | DT |
| job | NN |
| and | CC |
| the | DT |
| college | NN |
| degree | NN |
| wa | NN |
| exemplary | JJ |
| my | PRP$ |
| other | JJ |
| brother | NN |
| pat | VBP |
| wa | NN |
| in | IN |
| high | JJ |
| school | NN |
| in | IN |
| the | DT |
| early | JJ |
| 80 | CD |
| and | CC |
| by | IN |
| this | DT |
| time | NN |
| the | DT |
| ged | FW |
| wasnt | FW |
| enough | JJ |
| to | TO |
| guarantee | VB |
| employment | NN |
| he | PRP |
| needed | VBD |
| a | DT |
| college | NN |
| degree | NN |
| and | CC |
| if | IN |
| you | PRP |
| got | VBD |
| one | CD |
| you | PRP |
| had | VBD |
| a | DT |
| pretty | RB |
| good | JJ |
| chance | NN |
| of | IN |
| getting | VBG |
| the | DT |
| kind | NN |
| of | IN |
| job | NN |
| that | WDT |
| you | PRP |
| wanted | VBD |
| after | IN |
| you | PRP |
| graduated | VBD |
| me | PRP |
| i | PRP |
| graduated | VBD |
| high | JJ |
| school | NN |
| in | IN |
| 1988 | CD |
| got | VBD |
| my | PRP$ |
| college | NN |
| degree | NN |
| in | IN |
| 1993 | CD |
| and | CC |
| that | IN |
| college | NN |
| degree | NN |
| in | IN |
| 93 | CD |
| did | VBD |
| not | RB |
| mean | VB |
| much | JJ |
| it | PRP |
| wa | NN |
| not | RB |
| a | DT |
| ticket | NN |
| it | IN |
| wa | NN |
| not | RB |
| a | DT |
| voucher | NN |
| it | IN |
| wa | NN |
| not | RB |
| a | DT |
| free | JJ |
| pas | NN |
| go | VB |
| to | IN |
| anything | NN |
| so | RB |
| i | PRP |
| asked | VBD |
| the | DT |
| question | NN |
| what | WDT |
| doe | NN |
| your | PRP$ |
| college | NN |
| degree | NN |
| mean | VBP |
| it | PRP |
| mean | VB |
| you | PRP |
| got | VBD |
| an | DT |
| education | NN |
| it | PRP |
| mean | VBP |
| you | PRP |
| have | VBP |
| more | JJR |
| knowledge | NN |
| in | IN |
| a | DT |
| specific | JJ |
| subject | NN |
| vocation | NN |
| it | PRP |
| mean | VBP |
| you | PRP |
| may | MD |
| have | VB |
| more | JJR |
| expertise | NN |
| in | IN |
| what | WP |
| your | PRP$ |
| degree | NN |
| is | VBZ |
| in | IN |
| but | CC |
| whats | VBZ |
| it | PRP |
| worth | JJ |
| in | IN |
| the | DT |
| job | NN |
| market | NN |
| out | RB |
| there | RB |
| today | NN |
| we | PRP |
| know | VBP |
| the | DT |
| market | NN |
| for | IN |
| college | NN |
| graduate | NN |
| is | VBZ |
| more | RBR |
| competitive | JJ |
| now | RB |
| than | IN |
| ever | RB |
| now | RB |
| some | DT |
| of | IN |
| you | PRP |
| already | RB |
| have | VBP |
| a | DT |
| job | NN |
| lined | VBN |
| up | RP |
| youve | NNP |
| got | VBD |
| a | DT |
| path | NN |
| where | WRB |
| today | NN |
| job | NN |
| is | VBZ |
| going | VBG |
| to | TO |
| become | VB |
| tomorrow | NN |
| career | NN |
| but | CC |
| for | IN |
| most | JJS |
| of | IN |
| you | PRP |
| the | DT |
| future | NN |
| is | VBZ |
| probably | RB |
| still | RB |
| pretty | RB |
| fuzzy | JJ |
| and | CC |
| you | PRP |
| dont | VBP |
| have | VB |
| that | DT |
| job | NN |
| that | WDT |
| directly | RB |
| reflects | VBZ |
| the | DT |
| degree | NN |
| you | PRP |
| just | RB |
| got | VBD |
| many | JJ |
| of | IN |
| you | PRP |
| dont | VBP |
| even | RB |
| have | VB |
| a | DT |
| job | NN |
| at | RB |
| all | RB |
| think | VB |
| about | IN |
| it | PRP |
| youve | VBD |
| just | RB |
| completed | VBN |
| your | PRP$ |
| scholastic | JJ |
| educational | JJ |
| curriculum | NN |
| in | IN |
| life | NN |
| the | DT |
| one | NN |
| that | WDT |
| you | PRP |
| started | VBD |
| when | WRB |
| you | PRP |
| were | VBD |
| five | CD |
| year | NN |
| old | JJ |
| in | IN |
| kindergarten | NN |
| up | IN |
| until | IN |
| now | RB |
| and | CC |
| your | PRP$ |
| future | NN |
| may | MD |
| not | RB |
| be | VB |
| any | DT |
| more | JJR |
| clearer | JJR |
| than | IN |
| it | PRP |
| wa | NN |
| five | CD |
| year | NN |
| ago | RB |
| you | PRP |
| dont | VBP |
| have | VB |
| the | DT |
| answer | NN |
| and | CC |
| is | VBZ |
| probably | RB |
| pretty | RB |
| damn | RB |
| scary | JJ |
-
Even if you already have a PowerShell window open, you must open a new PowerShell session to make the git command available to you.
Strike WinKey and type powershell.
Windows PowerShell will appear near the top of your Start menu.
Strike Enter.
-
In PowerShell, change directories to the folder you created during this step. In my case
- cd C:\.temp\nltk\venv\
-
Enter your virtual environment by running
- .\Scripts\activate.ps1
PS C:\>should now be prepended with (venv) so that it looks like(venv) PS C:\>. -
Use Git to create a local copy of this repository.
- git clone https://github.com/nstevens1040/nlp.git
-
By default, the git clone command creates a folder named after the repository and downloads the contents of the repository to that folder.
Change directories to the newly created nlp folder.- cd nlp
-
Here is how you can execute preprocess_transcript.py while providing the file path to the sample data as a parameter.
The script will create four new files in the same folder that your input data is in.- python preprocess_transcript.py --transcript_file '.\data\Matthew McConaughey University of Houston Speech.txt'
In my case, these files were saved to the folderC:\.temp\nltk\venv\nlp\data.
Each of these files can be identified by their filename's prefix.Prefix cleaned_ Filetype plaintext Example cleaned_Matthew McConaughey University of Houston Speech.txt Description The all-lowercase transcript after removing carriage returns, line feeds, and punctuation. Prefix cleaned_and_lemmatized_ Filetype plaintext Example cleaned_and_lemmatized_Matthew McConaughey University of Houston Speech.txt Description The all-lowercase and lemmatized transcript after removing carriage returns, line feeds, and punctuation. Prefix tagged_and_tokenized_ Filetype comma separated values Example tagged_and_tokenized_Matthew McConaughey University of Houston Speech.csv Description A table with each word in one column and each word's corresponding part-of-speech tag in the next. Prefix tagged_tokenized_and_lemmatized_ Filetype comma separated values Example tagged_tokenized_and_lemmatized_Matthew McConaughey University of Houston Speech.csv Description A table with each word's lemma in one column and each word's corresponding part-of-speech tag in the next.