{"id":1798,"date":"2015-05-15T10:13:47","date_gmt":"2015-05-15T08:13:47","guid":{"rendered":"https:\/\/inlab.fib.upc.edu\/?p=1798"},"modified":"2015-05-20T07:57:35","modified_gmt":"2015-05-20T05:57:35","slug":"what-data-scientist","status":"publish","type":"post","link":"https:\/\/inlab.fib.upc.edu\/en\/uncategorized-ca\/what-data-scientist","title":{"rendered":"What is a Data Scientist?"},"content":{"rendered":"<p><span lang=\"en\"><span class=\"hps atn\">&#8220;<\/span><span class=\"hps\">Data <\/span><\/span><span lang=\"en\"><span>science<\/span><span class=\"hps\">,&#8221;<\/span> <span class=\"hps\">was born<\/span> of&nbsp;<span class=\"hps\"> the scientific method<\/span><span>,<\/span> <span class=\"hps\">is the evolution of<\/span> <span class=\"hps\">what has hitherto<\/span> <span class=\"hps\">been known<\/span> <span class=\"hps\">as a data analyst<\/span><span>,<\/span> <span class=\"hps\">but unlike<\/span> <span class=\"hps\">it<\/span>, d<span>ata<\/span> s<span class=\"hps\">cientist<\/span><span class=\"hps\"> should explore<\/span> <span class=\"hps\">and analyze data from<\/span> <span class=\"hps\">multiple<\/span> <span class=\"hps\">sources, often<\/span> <span class=\"hps\">huge<\/span> <span class=\"hps\">(known as<\/span> <span class=\"hps\">Big<\/span> <span class=\"hps\">data)<\/span><span>, which<\/span> <span class=\"hps\">may have very different<\/span> <span class=\"hps\">formats.<\/span> <span class=\"hps\">Data scientist also has<\/span> <span class=\"hps\">a strong<\/span> <span class=\"hps\">business vision<\/span> <span class=\"hps\">to be<\/span> <span class=\"hps\">able to extract<\/span> <span class=\"hps\">and transmit<\/span> <span class=\"hps\">recommendations to<\/span> <span class=\"hps\">business leaders<\/span> <span class=\"hps\">in his company.<\/span><\/span><\/p>\n<p><!--more--><\/p>\n<p><span id=\"result_box\" lang=\"en\"><span class=\"hps\">A Data<\/span> <span class=\"hps\">Scientist<\/span> <span class=\"hps\">is an expert in<\/span> <span class=\"hps\">Data<\/span> <span class=\"hps\">Science<\/span> <span class=\"hps atn\">(<\/span><span>Science<\/span> <span class=\"hps\">Data<\/span><span>), his<\/span> <span class=\"hps\">job is to<\/span> <span class=\"hps\">extract<\/span> <span class=\"hps\">knowledge from<\/span> <span class=\"hps\">the data<\/span> <span class=\"hps\">to<\/span> <span class=\"hps\">answer the questions<\/span><span>.<\/span><\/span><\/p>\n<h2><strong><span lang=\"en\"><span class=\"hps\">What<\/span> <span class=\"hps\">is the<\/span> <span class=\"hps atn\">&#8220;<\/span><span>data science<\/span><span>&#8220;?<\/span><\/span><\/strong><\/h2>\n<p><span lang=\"en\"><span class=\"hps atn\">&#8220;<\/span><span class=\"hps\">Data <\/span><\/span><span lang=\"en\"><span>science<\/span><span class=\"hps\">,&#8221;<\/span> <span class=\"hps\">was born<\/span> of<span class=\"hps\"> the scientific method<\/span><span>,<\/span> <span class=\"hps\">is the evolution of<\/span> <span class=\"hps\">what has hitherto<\/span> <span class=\"hps\">been known<\/span> <span class=\"hps\">as a data analyst<\/span><span>,<\/span> <span class=\"hps\">but unlike<\/span> <span class=\"hps\">it<\/span>, d<span>ata<\/span> s<span class=\"hps\">cientist<\/span><span class=\"hps\"> should explore<\/span> <span class=\"hps\">and analyze data from<\/span> <span class=\"hps\">multiple<\/span> <span class=\"hps\">sources, often<\/span> <span class=\"hps\">huge<\/span> <span class=\"hps\">(known as<\/span> <span class=\"hps\">Big<\/span> <span class=\"hps\">data)<\/span><span>, which<\/span> <span class=\"hps\">may have very different<\/span> <span class=\"hps\">formats.<\/span> <span class=\"hps\">Data scientist also has<\/span> <span class=\"hps\">a strong<\/span> <span class=\"hps\">business vision<\/span> <span class=\"hps\">to be<\/span> <span class=\"hps\">able to extract<\/span> <span class=\"hps\">and transmit<\/span> <span class=\"hps\">recommendations to<\/span> <span class=\"hps\">business leaders<\/span> <span class=\"hps\">in his company.<\/span><\/span><\/p>\n<p><span lang=\"en\"><span class=\"hps\">These<\/span> <span class=\"hps\">data sets<\/span> <span class=\"hps\">can come from<\/span> <span class=\"hps\">all types<\/span> <span class=\"hps\">of electronic devices<\/span> <span class=\"hps\">(such as a<\/span> <span class=\"hps\">phone,<\/span> <span class=\"hps\">all types of<\/span> <span class=\"hps\">sensors,<\/span> <span class=\"hps\">genome<\/span> <span class=\"hps\">sequencers<\/span><span>,<\/span><span class=\"hps\">&#8230;<\/span><span>)<\/span><span>, social networking,<\/span> <span class=\"hps\">medical data<\/span><span>, web pages&#8230;<\/span><\/span> and they affect in a very significant way the current investigation in many fields as the biological sciences, the medical computer science, the social sciences&#8230;<\/p>\n<h2><strong><span lang=\"en\"><span class=\"hps\">What<\/span> <span class=\"hps\">process<\/span> <span class=\"hps\">follows<\/span> <span class=\"hps\">a data<\/span> <span class=\"hps\">scientist<\/span><span>?<\/span><\/span><\/strong><\/h2>\n<p><span lang=\"en\"><span class=\"hps\">The process<\/span> <span class=\"hps\">follows<\/span> <span class=\"hps\">a Data<\/span> <span class=\"hps\">Scientist<\/span> <span class=\"hps\">to answer<\/span> <span class=\"hps\">the questions<\/span> <span class=\"hps\">can be summarized in<\/span> <span class=\"hps\">these<\/span> <span class=\"hps\">five<\/span> <span class=\"hps\">steps:<\/span><\/span><\/p>\n<ul>\n<li><span class=\"hps\">Extract data<\/span><span>, regardless of<\/span> <span class=\"hps\">its source<\/span> <span class=\"hps\">(websites<\/span><span>,<\/span> <span class=\"hps\">csv<\/span><span>,<\/span> <span class=\"hps\">logs,<\/span> <span class=\"hps\">celery<\/span><span>, etc.)<\/span> <span class=\"hps\">and<\/span> <span class=\"hps\">volume<\/span> <span class=\"hps atn\">(<\/span><span>Small<\/span> <span class=\"hps\">or<\/span> <span class=\"hps\">Big Data<\/span> <span class=\"hps\">Data)<\/span><span>.<\/span><\/li>\n<li><span class=\"hps\">Clean<\/span> <span class=\"hps\">the data<\/span><span class=\"hps\">.<\/span><\/li>\n<li><span lang=\"en\"><span class=\"hps\">Process data<\/span> <span class=\"hps\">using different<\/span> <span class=\"hps\">statistical methods<\/span> <span class=\"hps atn\">(<\/span><span>statistical inference<\/span><span>,<\/span> <span class=\"hps\">regression<\/span><span>,<\/span> <span class=\"hps\">hypothesis testing<\/span><span>, etc.)<\/span><span>.<\/span><\/span><\/li>\n<li>To design new tests or experiments<\/li>\n<li><span lang=\"en\"><span class=\"hps\">Visualize<\/span> <span class=\"hps\">and present<\/span> <span class=\"hps\">data<\/span> <span class=\"hps\">graphically<\/span><span>.<\/span><\/span><\/li>\n<\/ul>\n<h2><strong>What is expected from a Data Scientist?<\/strong><\/h2>\n<p>What is expected from a Data Scientist is that not only it is capable of approaching a problem of exploitation of data from the point of view of analysis, but also it has the necessary aptitudes for covering the stage of management of data. So, the aim of a profile of this type is bring over two worlds (the management and data analysis), <span id=\"result_box\" lang=\"en\"><span>which until now<\/span> <span class=\"hps alt-edited\">they had<\/span> <span class=\"hps\">been<\/span> <span class=\"hps\">separated<\/span><span>,<\/span> <span class=\"hps\">but<\/span> <span class=\"hps\">due to the new<\/span> <span class=\"hps\">requirements<\/span> <span class=\"hps\">of volume,<\/span> <span class=\"hps\">variety<\/span> <span class=\"hps\">and data<\/span> <span class=\"hps\">speed<\/span> <span class=\"hps alt-edited\">exploitation<\/span> <span class=\"hps\">of these <\/span><\/span>(ie, three V&#8217;s of the standard definition of the term Big Data) it has become <span id=\"result_box\" lang=\"en\"><span class=\"hps\">essential<\/span><\/span> to carry out this exploitation through a combined profile.<\/p>\n<h2><strong><span id=\"result_box\" lang=\"en\"><span class=\"hps\">What<\/span> <span class=\"hps\">profile<\/span> <span class=\"hps\">must have a<\/span> <span class=\"hps\">Data<\/span> <span class=\"hps\">Scientist<\/span><span>?<\/span><\/span><\/strong><\/h2>\n<p>The profile of the Data Scientist, is as a magic potion, needs as principal ingredients advanced skills in computer science, mathematics\/statistics, automatic learning, to be able to handle large volumes of data, aptitude to communicate the knowledge that we have extracted from the information, vision of business, etc.<\/p>\n<p><span lang=\"en\"><span class=\"hps\">Since science of data<\/span> <span class=\"hps\">is<\/span> <span class=\"hps\">multidisciplinary<\/span><\/span>It, it is necessary to learn many things, and is a specialization demanding and advanced time, but the combination is very powerful and difficult to find, maybe <span id=\"result_box\" lang=\"en\"><span class=\"hps\">that&#8217;s why<\/span> <span class=\"hps\">the<\/span> <a href=\"http:\/\/hbr.org\/2012\/10\/data-scientist-the-sexiest-job-of-the-21st-century\/\" target=\"_blank\" rel=\"noopener\"><span class=\"hps\">Harvard<\/span> <span class=\"hps\">Business<\/span> <span class=\"hps\">Review<\/span><\/a><\/span><a href=\"http:\/\/hbr.org\/2012\/10\/data-scientist-the-sexiest-job-of-the-21st-century\/\" target=\"_blank\" rel=\"noopener\"> magazine<\/a> defined this work as the most sexy of the 21st century.<\/p>\n<p>In the graph that heads the article, extracted from <a href=\"http:\/\/www.zhaw.ch\/nc\/de\/zhaw\/die-zhaw\/publikationen\/publikationen-zhaw-angehoerige\/zhaw-publikation-detailanzeige.html?pi=206546\" target=\"_blank\" rel=\"noopener\">Applied Data Science in Europe<\/a> published in the Zurich University of Applied Sciences and in the <a href=\"http:\/\/blog.zhaw.ch\/datascience\/the-data-science-skill-set\/\" target=\"_blank\" rel=\"noopener\">blog of one of his authors, in Thilo Stadelmann<\/a>, there are detailed the different skills that a data scientist should have.<\/p>\n<h2><strong>What challenges can we approach?<\/strong><\/h2>\n<p>For mentioning an example, o<span id=\"result_box\" lang=\"en\"><span class=\"hps\">ne of the challenges<\/span> <span class=\"hps\">of current&nbsp;<\/span><span class=\"hps\">Big Data<\/span> <span class=\"hps\">and Data<\/span> <span class=\"hps\">Science<\/span> <span class=\"hps\">technologies <\/span><\/span><span lang=\"en\"><span class=\"hps\">is its<\/span> <span class=\"hps\">application in<\/span> <span class=\"hps\">the analysis of the<\/span> huge <span class=\"hps\">amount<\/span> <span class=\"hps\">of<\/span> <span class=\"hps\">genomic information that<\/span> <span class=\"hps\">we have,<\/span> <span class=\"hps\">and used it to<\/span> <span class=\"hps\">study<\/span> <span class=\"hps\">diseases<\/span> <span class=\"hps\">such as<\/span> <span class=\"hps\">cancer<\/span><span class=\"hps\">.<\/span><\/span><\/p>\n<p><span id=\"result_box\" lang=\"en\" tabindex=\"-1\"><span class=\"hps\">Consider<\/span> <span class=\"hps\">that humans<\/span><span>, <\/span><span class=\"hps\">have<\/span> <span class=\"hps\">23 pairs of<\/span> <span class=\"hps\">chromosomes<\/span><span>, each<\/span> one <span class=\"hps\">consists of<\/span> <span class=\"hps\">about 3,200<\/span> <span class=\"hps\">million<\/span> <span class=\"hps\">base pairs<\/span> <span class=\"hps\">of DNA<\/span> <span class=\"hps\">containing<\/span> <span class=\"hps\">approximately<\/span> <span class=\"hps\">20.000-25.000<\/span> <span class=\"hps\">gens.<\/span> <span class=\"hps\">Determine<\/span> <span class=\"hps\">which combination<\/span> <span class=\"hps\">of these gens<\/span> <span class=\"hps\">are significant<\/span> <span class=\"hps\">for<\/span> <span class=\"hps\">certain diseases<\/span> <span class=\"hps\">opens the door<\/span> <span class=\"hps\">to think that<\/span> <span class=\"hps\">someday<\/span> <span class=\"hps\">we will<\/span> <span class=\"hps\">be<\/span> <span class=\"hps\">personalized medicine<\/span><span>.<\/span><\/span><\/p>\n<p><span lang=\"en\"><span class=\"hps\">Currently there are<\/span> <span class=\"hps\">a lot of<\/span> <span class=\"hps\">open data<\/span> <span class=\"hps\">sources<\/span> <span class=\"hps\">that we<\/span> <span class=\"hps\">can<\/span> <span class=\"hps\">analyze<\/span><span>, for<\/span> <span class=\"hps\">example,<\/span> <\/span><a href=\"http:\/\/opendata.bcn.cat\/opendata\/ca\" target=\"_blank\" rel=\"noopener\">open data from Barcelona town hall<\/a><span lang=\"en\"><span>,<\/span> <span class=\"hps\">or&nbsp;<\/span><a href=\"http:\/\/www.pediatriccancergenomeproject.org\/site\/\" target=\"_blank\" rel=\"noopener\"><span class=\"hps\">full details<\/span> <span class=\"hps\">of<\/span> <span class=\"hps\">all<\/span> <span class=\"hps\">the<\/span> <span class=\"hps\">human cancer<\/span> <span class=\"hps\">genome<\/span><\/a><span> from <\/span> <span class=\"hps\">the<\/span> <span class=\"hps\">Pediatric<\/span> <span class=\"hps\">Cancer Genome<\/span> <span class=\"hps\">Project<\/span> <span class=\"hps\">at the University of<\/span> <span class=\"hps\">Washington<\/span><span>.<\/span><\/span><\/p>\n<p>You can take part in different challenges of data science,<span lang=\"en\"> <span class=\"hps\">such as:<\/span> <span class=\"hps\">Identifying<\/span> <span class=\"hps\">signs of<\/span> <span class=\"hps\">diabetic retinopathy<\/span> <span class=\"hps\">in images<\/span> <span class=\"hps\">of the eye.<\/span> <span class=\"hps\">This<\/span> <span class=\"hps\">and other<\/span> <span class=\"hps\">challenges<\/span> <span class=\"hps\">are published<\/span><span>,&nbsp;<\/span><\/span><a href=\"http:\/\/www.kaggle.com\/competitions\" target=\"_blank\" rel=\"noopener\"> kaggle competitions<\/a><span lang=\"en\"><span>, where if<\/span> <span class=\"hps\">you&#8217;re<\/span> <span class=\"hps\">good<\/span><span>, you<\/span> <span class=\"hps\">get<\/span> <span class=\"hps\">good<\/span> <span class=\"hps\">rewards.<\/span><\/span><\/p>\n<h2><strong><span id=\"result_box\" lang=\"en\" tabindex=\"-1\"><span class=\"hps\">How<\/span> <span class=\"hps\">can I<\/span> <span class=\"hps\">learn<\/span><span>?<\/span><\/span><\/strong><\/h2>\n<p><span lang=\"en\" tabindex=\"-1\"><span class=\"hps\">A good<\/span> <span class=\"hps\">way to learn<\/span> <span class=\"hps\">Data<\/span> <span class=\"hps\">Science is<\/span> <span class=\"hps\">through specialization<\/span> <span class=\"hps\">in the <\/span><\/span>MOOC&#8217;s platform<span lang=\"en\" tabindex=\"-1\"> <span class=\"hps atn\">(<\/span><span>online courses<\/span><span>)&nbsp;<\/span><\/span><a href=\"http:\/\/www.coursera.org\/specialization\/jhudatascience\" target=\"_blank\" rel=\"noopener\">Coursera<\/a><span lang=\"en\" tabindex=\"-1\"><span>, they <\/span><\/span> offer nine courses<span lang=\"en\" tabindex=\"-1\"> <span class=\"hps\">for free.<\/span><\/span><\/p>\n<p><span lang=\"en\" tabindex=\"-1\"><span class=\"hps\">In<\/span> <span class=\"hps\">inLab<\/span> <span class=\"hps\">FIB<\/span> have been <span class=\"hps\">working in<\/span> <span class=\"hps\">the<\/span> <span class=\"hps\">data analysis<\/span>, <span class=\"hps\">for many years, in<\/span> <span class=\"hps\">areas such as<\/span> <span class=\"hps\">modeling<\/span><span>, simulation<\/span><span>, optimization<\/span><span> <\/span><span class=\"hps\">and analysis of<\/span> <span class=\"hps\">learning<\/span> <span class=\"hps\">(<\/span><\/span><a href=\"http:\/\/inlab.fib.upc.edu\/en\/learning-analytics\">Learning Analytics<\/a><span lang=\"en\" tabindex=\"-1\"><span>)<\/span><span>.<\/span><\/span> With the appearance of<span lang=\"en\" tabindex=\"-1\"><span class=\"hps\"> technologies<\/span> <span class=\"hps\">to treat<\/span> <span class=\"hps\">large volumes<\/span> <span class=\"hps atn\">of data (<\/span><span>Big<\/span> <span class=\"hps\">Data)<\/span> <span class=\"hps\">now<\/span> <span class=\"hps\">have<\/span> <span class=\"hps\">powerful tools<\/span> <span class=\"hps\">that complement<\/span> <span class=\"hps\">this area.<\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Data science,&#8221; was born of&nbsp; the scientific method, is the evolution of what has hitherto been known as a data analyst, but unlike it, data scientist should explore and analyze data from multiple sources, often huge (known as Big data), which may have very different formats. Data scientist also has a strong business vision to [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":1791,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[572,1],"tags":[],"experteses":[8],"class_list":["post-1798","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog","category-uncategorized-ca","experteses-datascienceandbigdata-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/posts\/1798","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/comments?post=1798"}],"version-history":[{"count":0,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/posts\/1798\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/media\/1791"}],"wp:attachment":[{"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/media?parent=1798"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/categories?post=1798"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/tags?post=1798"},{"taxonomy":"experteses","embeddable":true,"href":"https:\/\/inlab.fib.upc.edu\/en\/wp-json\/wp\/v2\/experteses?post=1798"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}