This week’s theme is Ethics, Limitations, and Challenges. The sources I looked at focused more on how to collect and analyze language samples from online, rather than the content of these samples. What ethical questions do we need to consider when collecting linguistic data? What challenges are involved with data collection?
Ethics
Researching online language, like all research, requires us to consider how to make ethical research choices. One of the largest issues might be the blurred line between public and private online content. Although most people are aware that anything posted on the internet is no longer private, most users also expect a degree of privacy when it comes to their emails and texts. Furthermore, even if a person publicly posts to a public social media account or website, they may not be intending for their words to reach a large audience. But researchers might also prefer this sort of data point, since the users are not being influenced (consciously or not) by the idea that someone might be analyzing their writing (Hou 36). The issue of informed consent and permission to use material is thus a relevant issue when looking at research being done on digital language (Lewis 14). Issues of privacy also arise. Although it is unlikely that any singular quoted tweet could be traced back to an English-speaking Twitter user, what if researchers are analyzing a smaller linguistic community?
Depending on the research question, it might be easiest for researchers to collect data from users who are clearly public figures or organizations (Hou 41). But if research is focusing specifically on informal language, slang, or other linguistic phenomena that are less likely to appear in a corporate or professional context, the usefulness of data from public figures might be low.
Challenges
Besides ethical considerations, there are multiple other challenges that may impact data collection and analysis. First of all, there is an extensive amount of data available to researchers (Crystal 10). Of course this can be a good thing, but filtering data to create a usable sample may pose some difficulties depending on the specific questions researchers are asking. The sheer amount of data also may influence researchers to focus on the sites/platforms that are easiest for data collection (such as Twitter) and overlook sites/platforms that are more difficult to research. If you can get all your data from Twitter, why would you choose to look elsewhere? (This is of course a generalization, but researchers only have so much time, manpower, and funding).
The second challenge is that some of this data may not be relevant to some research due to the rate of change online. Site updates may impact the type of data being produced (like Twitter raising the character limit for tweets) and site demographics may shift in just a few years time (Facebook at its inception was primarily used by college-aged young adults, but now is being used more by adults in their 30s and older). This is also true with studies done on traditionally published works or offline communities, but in general traditional publishing is more conservative and less inclined to abrupt, rapid change.
Anonymity can also pose issues when trying to determine demographic data for any given sample. Factors like age, geographic location, and gender may not be readily shared by a user. So any analysis that wants to look at differences within or between specific demographics will be missing some data (14). In some contexts anonymity might pose less of a problem; for example, if a researcher is looking at vlogs of individuals using ASL, then this demographic information might be more obvious. But this type of data collection brings us back to the question of ethics and informed consent – the more information that researchers have about any one person in the data set, the higher the possibility that that person’s privacy might be compromised (Hou 40).
Another challenge arises simply from the formatting of different sites. For example, how should retweets be treated if a researcher is analyzing language use on twitter (Crystal 40)? Reduplication in the data set could skew results, but removing retweets entirely could ignore how users are interacting with each other or how they are reacting to or interpreting certain linguistic structures. Similarly, users may compose tweets which have incomplete utterances or whose meaning is difficult to figure out without additional context (41, 45). Removing these tweets might be necessary in some cases, but it does mean that researchers will be losing some data. The difficulty of
The ability for posts to be edited or deleted might also cause issues. When citing data from someone else, including the data’s provenance – its line of history from you to its original creator – can allow for someone to check for errors and, if there are errors, to pinpoint where and possibly how they occurred (Lewis 8, 13). What should be done if a research cites a particular post, and that post no longer exists? Or what if that post has subsequently been changed?
Final Thoughts
As time goes on, more tools will be developed to help researchers filter data, and more people will bring forward ideas as to how we can best overcome these challenges of collecting online linguistic data without compromising the privacy and informed consent of those users that data is being collected from.
And, as we continue on this semester, we’ll have to consider how these challenges may have impacted the data we are looking at.
Citations:
Crystal, David. Internet Linguistics : A Student Guide, Taylor & Francis Group, 2011. ProQuest Ebook Central, https://ebookcentral.proquest.com/lib/smith/detail.action?docID=801579.
Hou, Lynn, et al. “Working with ASL Internet Data.” Sign Language Studies, vol. 21, no. 1, 2020, pp. 32–67, https://www.jstor.org/stable/26984276. Accessed 21 Sept. 2022.
Lewis, W., Farrar, S. & Langendoen, T. (2006), Linguistics in the Internet Age: Tools and Fair Use, in ‘Proceedings of the EMELD’06 Workshop on Digital Language Documentation: Tools and Standards: The State of the Art’. Lansing, MI. June 20-22, 2006.