Welcome everybody to lecture one, information retrieval in the winter semester. It looks like summer semester outside, but it's actually the winter semester, 22, 23. I'm happy to see so many people in the room. It feels like a party, so I think that's the first time in years that it's so packed. It will become less, so don't worry if you have to stand now. Naturally it will become less in the room, more people online or via other means. But it's very nice to see so many of you here. It's also very nice to see 30 people on Zoom. So yeah, there are many ways to participate. So let's just start. So today's lecture, naturally it's the first lecture, so there will be a lot of organizational stuff. I will tell you a bit how this all works, how credits, exam info and so on, the course systems. There will also be a little bit of contents but it will be very much on the light side so it's just getting started in the first lecture. But first exercise sheet will already be implementing a mini search engine so as you might know this lecture has a very practical aspect. I will talk more about this later, so mini search engine, movie search. Okay, now first the organizational part. So what's the, yeah, let me start with three demos. So this course is about search engines, so let me show you a few search engines. So I've prepared one here, that's DBLP. DBLP is search engine for, I don't know if you know it, every researcher in computer science will know it. It just searches in articles on computer science. So here we can search something, for example, let me type information retrieval. It does search as you type. Now you get all kinds of conferences, publications on information retrieval, people who have done something in information retrieval, conferences relevant for information retrieval. And I wanted to show you, and I will show this here. Let me just clear these screens. This is actually a search engine powered by stuff we have done. And I can show you how it looks like if you have a search engine powered by stuff we have done and I can show you how it looks like if you have a search engine that's actually used by a lot of people. So this is the search engine right now and let's just check if we type something here. Let me just do this query again. If we find it here, did we find it? Here we find it. Information retrieval. Maybe we can, and you can see it's used by a, okay, let's just type something maybe which we recognize. A lot of O's here. Let's see if we see them. Yeah, there they are. So you see, you get, and let's also look, it's here's another interesting one which I prepared for you. That's just in countries where the requests are coming from. So that's also live accumulated from when I prepared for you. That's just in countries where the requests are coming from so that's also live accumulated from when I started the command. So you see right now it's very interesting to look at this depending on the time when you look at it you get a lot of requests from from certain countries. So right now we are getting a lot of requests from China, United States, Hong Kong and so on. So that's how it looked like when you have such a search engine in action. So it's quite thrilling actually. And of course it's a challenge to build something like this so that it always works 25-7. And you will learn how to do this in this course. Here's another one, Wikidata Entity Search. Who knows Wikidata in the room? I've heard of Wikidata. Still a few people. It's one of the fastest growing Wikimedia projects. It's like the sister project of Wikipedia. And we will talk about it several several times in the lecture. Interesting, it takes a while to get there. Maybe, yeah, let's see. So just very briefly explain what Wikidata is. I'm surprised at the speed here. So this is, you know, there's a Wikipedia article about Earth, this is the Wikidata article about Earth and it's the fact about Earth. So let me maybe just look for comparison at the Wikipedia article on Earth. No, not the German one. I don't want the German one. It's a bit slow, and I'm not sure why. Maybe the machine also has to get used to whatever. So this is the, do you have an idea Frank why it's so slow? Yeah, but it doesn't really matter. That's the Wikipedia, you all know Wikipedia articles I don't have to ask. It's just a lot of text. But you also have these info boxes on the right, right? And you also get them in search results nowadays. And these info boxes are more of the type you have a property here, like a verb and a sentence, a predicate and an object. So the aphelion of Earth is this distance, or the eccentricity of the orbit is this number and so on. And what Wikidata is, it's all this structured data, it's all this triple data, the Earth is a planet and so on. And you have, we will see a search on this in a second, that's my third demo. And one thing, of course, what you, this also takes a while, Let's say we are looking for something, maybe university. It's really slow of Freiburg. Yeah, the lag is incredible. I don't know why. But I don't think it has anything to do with Wikidata. It has to do with this machine or the internet. So I type something so there will be an article about the University of Freiburg here. But just one problem. So let's look for the Ukrainian president for example. I'm typing here Vladimir something and then I think I will not find him because he's written with a W. So if I write him with a W, I will find him I think. So now I will find him. But this is something which we, so this is the official Wikidata Entity Search. We will actually build something better in lectures six and seven. So in half of the course you are already able to build something better than what is out there. I think that's quite exciting. And then the data you have just seen that's also made in Freiburg. It's one of our biggest project in our group at the moment. How do you search this data, all this triple data, how it's called 18 billion triples. And let me just show you each statement is called a triple. There's a search engine called clever. And if I click on here, you can also try this everything online. It wasn't so slow when we tested it 20 minutes ago. So maybe it has something to do with zoom running or I don't know there's Camtasia running, recording, zoom running, maybe the machine has some problems with it. So let's try anyway and let's look for all people with a certain first name and let me just and I'm not explaining too much now how this works, you can just watch it and enjoy. Give me a first name that's maybe not so rare but also not super frequent. Any name to show that it's live? Wolfgang. Wolfgang. Beautiful name, Wolfgang. Here we are, there's a Wolfgang, there we have it. Okay, and now let's just, let's first look at how many people Wikidata knows with the name Wolfgang. Things have IDs in Wikidata. Let me also add the names. Not Wolfgang, Wikidata knows about over 9,000 people. And since this probably Wikidata also know about where these people are born. So let's do that. Their place of birth and we're not interested in the exact place but just in the coordinates on the map. Let's just do that and let's add this here to the query result and now we have okay the location is known just for birthplace for over 4,000 and now we can see on a map where all the Wolfgangs in Wikidata are born. Again, takes a while, which is not the fault of... Okay, so we see there's a certain concentration in certain parts of the world. So it's pretty much... I'm sorry that it's so slow. I'm zooming in and out, okay. You can try it on your machine, it's super fast and everything, it's just, yeah. It's really slow. So it's mainly a German name, interesting. Okay, that's it for the demos, let me just show you this amazing number so that you also see it here. So this is really almost 18 billion, not 18 million, yeah, it's just three more digits, 18 billion statements of the sort, so and so has this first name, whatever about everything human knowledge. It's just enormous. Okay so these are the kind of you will learn in this course how to build something like this. What are research topics behind this? The door is still open in the back right? I think it would be good to keep it open just for air. We can't keep these things open, it's too loud. How do you do it that you have such an enormous amount of data and you get fast queries? They were not as fast as they could have been now because of latency issues with this machine, but this is about indexing and we will start with this today. Ranking is important, the order in which you present things. You have to store all this data in some form or the index. That's about compression. We will have a lecture about this. You have seen it for the entity search. You type the wrong letter, you don't find it. It shouldn't be like this. Very important. Web app stuff, it ran in a web browser, communicating with something on the back end that just belongs together with search engines very much and you will learn all the basics in this course. As quite a bit of machine learning there are three lectures about this when just fixed rule based stuff doesn't work. Knowledge graphs, that's the last thing we saw. Wikidata, that's a knowledge graph. And also evaluation, you build a system, you want to find out, okay, how good is this really? You type a few example queries, that's one thing. Do it more systematically is another thing and we will learn how to do that, like measuring, computing certain measures. will learn how to do that like measuring, computing certain measures. Now a few more and you can of course ask any time comments about how this course is organized. So the lectures, this is very important and we will also write a post, we have a forum, I will talk about it later. We want to have a little more time in case you have questions. So we want to start at five past two in the future. That's why it's in red and bold here. But there will always be a part like ten minutes or so at the beginning which are about organizational, about the last exercise sheet and so on. So the actual contents of the course starts at 14.15, the rest will be on the recording so if you cannot make it at that time that's fine too but try to make it at that time. So this will be the start, also a little bit longer, then we have one hour 45 minutes, I also want to have breaks in between that's just more relaxed. That's just from experiences from previous years, so please make a note of this. It will be in this room or you participate via zoom how you like. One or two short breaks in between. 13 lectures, so we actually removed one to de-stress you a bit. No lecture on one to de-stress you a bit. No lecture on, yeah we just removed one lecture just to have it make it a little more relaxed. So there will be the Christmas break, there will be one in two weeks already, yeah and there will be another one we don't know yet. All lectures are recorded and live streamed. People on Zoom are enjoying it right now, 33 of them. The editing will be done by Alexander, who has been with us for some time now, does a great job. Everything is on our Wiki. We will see this later, and there's also a versioning system. I will also talk about that later. The exercise sheets, they are, as usual in my courses, in our courses, the most important part of the course. One sheet every lecture, you get one today. Twelve sheets, one less than in the last years for de-stressing purposes. The deadline is always noon before the next lecture. And here's a serious comment. It's really serious, it only affects about 10 or 15 percent of people, but not 1 percent, which is why I have to say it. You are totally free and even encouraged to think and discuss about the sheets together. Yeah, meet in groups, online, in presence, however you like. But in the end, you have to code, it's usually coding or sometimes also theory, then of course you shouldn't copy, but also the coding, 100%, 100%, not 90 or 50%. And the reason is simply that you don't learn anything. If you just sit there and watch somebody else do the code or you copy code, you only learn it if you do it yourself. It's really important. And this pertains to everything, so you're not allowed to copy from the internet, from someone else, from the master solutions from previous years, which by the way you shouldn't even have. And yes, so this is really seriously, most of you don't do this. So there are, maybe let me quickly show this at this point. Here's the wiki page. Let me just see if this works. This is the first exercise sheet. Yeah, it works. It says here in red also, so I've now set it. It's on the slides, it's on the recording. There's a link here which goes to the wiki where it's also rule number 10. It says it all again with a little more detail. So these rules, I mean they're written on the exercise sheet, should absolutely read them once the first time. It's just one page because these are the rules for the rest of the course. And you also find it on the wiki. Yeah, so it's like ten times. So there's no way you can say you didn't know. That's just the point of this two minute ad section here. There's no way you can say you didn't know. The only exception is we write code in the lecture and everything we have written in lecture and we put in this repository that you can use. And it's called SVN public for a reason, from the course of this semester, not from some publics from 10 years ago where maybe you find relevant stuff. So please adhere to this and we are very good at finding if you do this anyway. And I come to your question in a second. Some of you think, and I have to say it again, and it affects 10, 15 percent of people who still try it, so it's not few, it's too many, which is why we have to install these rules. And people try to change variable names, change things, write it a little bit, it doesn't help, we find it anyway. It's really, really hard to cover up plagiarism. It's really hard, in fact it's so hard, it's harder than doing the exercises yourself. So I would recommend instead of investing your energy there in covering up plagiarism, just do it yourself. We will find it. And there's a question. Yes, please. Can I use code that I for example wrote previously for some other purpose? Your own code, yes. Your own code, if it's your own, you can do it. And by the way, that's also a general thing at a university and in science, you should learn. Whenever you use something, even by yourself, just write it there, right? Just write it in the comments. I use this from there. I mean, that's how you should do it, citation. So even in this case just say, I use this from some other project of mine. And it's also good practice to write, I use this from the lecture. Okay, tutorials. There will be a weekly Q&A as long as people attend. So at some point people don't come anymore then we will just stop it or just one person comes. Time to be negotiated and let me just try one thing now. I've prepared a number of surveys and here's the first one and I will just let this run for some time and for that it would be good if also the people in the room just go to the zoom meeting take your time I will leave this up for 15 minutes now I'm just launching this now. Ah wait maybe I shouldn't I'm wondering when I'm launching it and then you are participating in the meeting after I launch it will you still see the query? And how do I end this? The poll is ended. Okay, stop sharing. Can I now start it again? Now I can't start it again, that's not true is it? Come on, I should be able to start the poll again. There's three dots here. Relaunch poll, okay, can you see it by the way? Yes you can see it. Okay, so people in the room just log into Zoom. On the Wiki you will find it. I hope you can still pay attention to what I say. And anyway, there will be a small break after this organization section, so you can still. So just so that you can prepare mentally already, it's just about the Q&A Friday this week. You don't have to come, you can just come if you have questions. There are just four time slots and we just want to know which one is most to your convenience and it's a multiple choice question and don't just pick your preferences, pick every time that's possible for you. Yes please. It's on the wiki and the wiki is on the exercise sheet and the exercise sheet is on the wiki so we have a little circular thing here but I'm sure, so one thing is to Google, I could also post the link on Zoom but this answer doesn't help. You see it's not so easy. So if you do AD teaching wiki or something you will find it. So one way to do it I think is if you just google AD wiki, I think that's a popular search request. If you search at wiki you will probably get to our teaching Wiki, on the teaching Wiki, let's reserve some, it's really slow, here's information retrieval, that's our only lecture this winter semester and there it's on the top meeting link, it has the idea here. Just take your time and then I will just issue, launch the poll in, I don't know, in the break probably. There's a forum for all kinds of questions, I have a separate slide about that, you will receive personalized feedback for each of your sheets, usually pretty fast. The feedback is in a special file. We have a lot of tutors for this course. And if I can have your attention for a second while you're still the assistants for this course, it's Natalie, she's here, maybe you can briefly stand up for a second. That's Natalie, she will help me a lot. So thank you very much in advance, Natalie. And the style of the lectures. What will I do? I will provide motivation, definition, the example, we will code a lot together. I will not explain all the details to you. You have to figure these out yourself in the exercise sheet. You only work, learn stuff by working out the details yourself. There's theory, but it will always be clear. There's no theory for the sake of theory, although that's also nice, but we do not do it in this course. I also have a question about this, but let me just wait for you. Maybe let's start with that survey to see how many people. I have a service about a survey about theory. Let me try that. Here's one theory versus practice. Bam. Here's some options. Let's just see. I will just let it run for a few. So you should see a survey now with four questions. Your attitude towards theory. And I hope the people on Zoom should also be able to, of course. There's one topic per lecture, self-contained. We provide all the materials you need, so you don't need any literature, but still if you want to look at the literature, I mean, there's Google, there's Wikipedia, a lot of great Wikipedia tickles about the stuff we are doing, but you don't need it, you can do it. Okay, some of you are still busy with your devices. I will go a little bit slower. So that's really important. There is, okay, you have to turn off your sound when you log into Zoom. You should understand the concept, so I will give you the intuition. Understanding it in depth is your task. In depth the exam will also, I mean there is superficial understanding, that's good, and there is understanding in depth. You should also be able to implement this stuff in practice. And I think some of you have to turn off your loudspeakers when you're logged into Zoom. So at university, sometimes you have an overemphasis on A when you are somewhere else in a company, overemphasis on B. In this course, we wanna do A and B. That's kind of bit special about this course. There will be master solutions. They will be available, believe it or not, after the deadline for each sheet. These are strictly for your personal use only, now and in the future. I think it's clear, but let me just say it. Let me just state the obvious. Now and in the future they are just for your personal view, yeah, which means under no circumstances should you pass them on to others except your future self, which does not count as others. So how much work is it? It's a six ECTS course like most courses at our faculty here in the meantime. We standardized this a while ago. There's this usual calculation, it's 180 working hours and here are three ways how you can do this. The recommended ways is to spend about eight hours a week per average on the exercise sheet and then you don't have to prepare a lot for the exam then you really because the exam will be about the stuff from the lectures which will be yeah you will exercise them in the exercise sheet you can do a little more bit less then you have to learn a little bit more for the exam. This is not possible because you have to reach a certain number of points in the exercise sheets. And let me just repeat it, doing the exercise sheet is the best way, I mean we spend a lot of effort on them, we invest a lot of time in them, it's the best way to understand everything and to prepare for the exam. So the student life of the courses you have to reach a reasonable number of points in the exercise sheets, half of the points, just to make it clear that's not a real hurdle, it's just to motivate you to work on the exercise sheet. Everybody if you follow the course, you do the sheets, you will reach 50% of the points. So it's not a hurdle, it's motivation, help. In the end there will be 20 points if you participate in the evaluation. This will just replace your worst exercise sheet so that will make it even easier to get the 50%. You can also use this as a joke if you want to skip a sheet or if you are sick or something like this. 50% will not be a problem. There will be a written exam in the end, written because we are so many people, around 100 usually participate in the exam. Date will be fixed in the second half. There will be four tasks, 25 points, in the second half there will be four tasks, twenty five points, you just see here how, so we just make it very transparent in the beginning. Lots of exams from previous years on our page, you don't have to go to Fachshaft or anywhere else, they also have them probably but we just have them online, you can access them, like ten at least from previous years. So now we have one poll here and so let's see theory. I love math and I hope the course has a fair portion of it. 22 percent you will not be disappointed. Math is okay as long as I understand what it's good for. 50 percent you will not be disappointed. Math is tough for me, very honest answer. Of course, these queries are anonymous, these surveys. We will try the best to give you crash courses on things. And six, that's interesting, 16% wouldn't mind if we throw out all the stupid math. Sorry, won't happen, but thank you for your honesty. Okay and there's another survey which is maybe, let's do it now. No this is the, I have to relaunch the Q&A one. And this is now, yes interesting, this is now running and since this is the end of the organization part we're just having a four minute break or so before you you're welcome to chat or anything we will open the doors you will hear a nice sound and then please come back so it shouldn't take too long to come back so four minutes and then we just continue. Same for the people on Zoom. Okay, so see you in four minutes, participate in the poll and I will set the timer. So there will be a Q&A on, and it's really Q&A, it's not like we, I mean the contents is in the lecture, we will just be there to answer questions. Maybe it's questions with the setup in the first week, maybe it's questions about the sheet. Several people will be present so if you have very individual problems we can also go into breakout rooms and it will be via zoom. I think that makes and just one event. Okay, so that's the result of that one. Ah, by the way, for the, I don't know if the people on Zoom, could the people on Zoom already see the previous one, or do you only see it now? You could also see the previous one, okay, great. So, back to the second half of the, is there any question about the organizational stuff? Yeah? So the Q&A is going to be between one and two. Between one and two and there will be an announcement about this. And you can just come and ask your questions. Yes please. No, no, there will be no. That's the weekly Q&A where you can come and then things can be explained. But we do it in a you ask we reply fashion because you don't need more of frontal teaching where you just sit and passively receive. Yes? Specific language? You mean a programming language? Yes it's there will be a slide about this, it's Python but you can, but actually now that you ask it let's address it now. So the first rule is called programming language, we recommend Python because I will do Python in the lecture because it will just relieve us from all language specific stuff, Python is just easiest to use. You can also use Java. I know it's ridiculous. Or C++. I don't have a preference, no? But it will be more work. So in the past, we used to support all languages fully in the sense that we provided templates and everything in all three languages, but it was too much work. And 99% of people used Python, so now we just, whatever we give to you is in Python, you can use one of these two other languages, but it will be more work. But you are welcome to do it if you want, it will just be more work. But in the last years, everybody used Python, I think. But in the last years everybody used Python I think. But for some of the efficiency sheet, another language might make sense. Okay, this is a little bit of contents for today and the exercise sheet will be about this and then there will again be a break and some practical stuff. So keyword search. So we have a collection of text... Interesting. That was not deliberate, but yeah, why not? Keeps us awake. Text documents, for example, the web. So for exercise sheet one, we have prepared for you 100,000 movie descriptions. Let's just look at them together. I think they are linked. So this is your, for most of the exercise sheet you will get a nice data set. So let's just look at this one, 23, and it's called, as I said, it's linked on the wiki, movies, TSV. We should probably delete the other one, right? And here it is. And it's just movies, 100,000 movies. You see ranking is important. They are not in random order. They are in IMDB order by number of users who voted for them. And what you have here, so if we just maybe take just the first one of this file, head minus one, then we have the most popular movie on IMDB for a time now, the, oh wow, it's a long text, The Shawshank Redemption. You have a text here. You also have some additional columns. It's a tab that is separated values. So several columns with a tabulator in between. That's the number of IMDb users who voted for it. That's the score. 9.3 is really high on IMDb. And this is the number of Wikimedia articles about it. So basically we have text and titles. Let's just, I have another, I have another survey prepared on this. Let's just look at the first column, just the title and the first ten. These are the first ten, top ten movies on IMDB and I've prepared a very important survey on this, namely Stop Sharing. Which of these movies do you know? So that's of these 10. None of them? All of them? What's in between? Here they are. Okay, you can, that's the list. Just have a quick look and I'm curious Top 10 IMDb movies, let's see how much of a movie go where you are and Whether there are some tens among you So I'm a 10 Definitely a 10 And let's go. Yeah, I can you can also go to the file on the wiki if I'm now going away, if you want to see them, we'll just run. So now we have a keyword query, let's say astronaut and moon. And now we want to find text records which contain these two words. That's a text search as its most basic. And for now, I mean, search engines, they do not, they may ignore some words. That's a text search as its most basic. And for now, I mean, search engines, they may ignore some words, for now we just return documents that contain all the words. So all movies that contain astronauts and moon, for example. For the exercise sheet, just return three documents. Sheet also says something about the selection. The next lecture we will talk about ranking. Of course if you search matrix you don't expect any movie which has somewhere the word matrix in the description. You would expect the three matrix movies or four of them at the top. Yeah so this is just the start first exercise sheet. In the next lectures we will see a lot of refinements about this which will get you closer and closer to building a real engine. Ordering, ranking is important, lecture two, fast, how to do it fast, lecture three, how to save space. If you have a lot of data, lecture four, error tolerant search, lecture five, actually building a search engine, that's two lectures because that's really a lot of stuff but it's super important you should know it. Synonyms like other words meaning the same thing and more stuff in the later lecture. Today is just an absolute minimum but it's nice that in a single exercise sheet which is not too hard you already can build a mini search engine. And here's the solution, so I just tell you how it's done in principle, it's pretty simple. This is not what we will do, you can just use grep on the command line, grep is just given a file, find all records which contain certain words. I will not show it. It's actually not so bad. If you have, so for our movies file it would work. You can scan through a gigabyte and a fraction of a second, but for 100 gigabyte it would be too much and if you have to search the whole web, which is 60 billion pages nowadays, that would be very slow to just go through all of them. Interesting fact on the site, the number of web pages, I mean it increased a lot exponentially probably for a number of years. It has stayed pretty constant for the last 10 years. So it's not that number of web pages, maybe the number of web pages around, but not the ones indexed, but I mean the ones you have in your search index, you want to return to users, it's the meaningful ones, right? And so the amount of meaningful content on the web, this is I think one can say, does not really increase anymore. Think about it, Wikipedia has six million articles, six million, that's nothing and it contains like all the knowledge of the world. Six million, sixty billion, yeah it's not really about quantity. Let's see, let's look at the result of the survey, I'm really interested. There we have it, 21 people have seen all 10 of them. Great. The other ones, 5%, okay. Interesting. If you have an interesting story, we have some non-moviegoers and yeah, but most of you like movies and also good movies. I think the IMDB rating is a good one. It's not trashy movies on the top, good movies. Okay and we will learn more about movies in the course of this lecture. Inverted index. So how do you do it? That's what you need for the exercise sheet. It's quite simple. What you do is for each word that occurs somewhere in your text of the movie descriptions for this exercise sheets, astronauts, you have a list of the records or documents that contain this word. So text record number 13, so maybe it's a 13th line in our file, contains the word astronaut. 57 contains the word astronaut. It's like the index which you have at the end of a book. It just tells you this word, it's contained on these pages. And you have that for all the words. And that's an inverted index. These lists are called inverted lists. Why are they called inverted? Because in a sense when you have the document, so let's look at the document again, this tells you text record number one, which is the first line, contains these words. It contains the, it contains Shawshank, it contains redemption and so on and so on. So for each document it tells you the words and this year for each word it tells you the document. That's why it's like inverting the data. And depending on what exactly you store here it's actually lossless. Here we don't have the positions in the documents if you would also store them. It would be the same data, just in a different representation. Actually what I just said for the first exercise sheet, a document may contain a word several times. You should see that it is contained in this list at most once. So even if document 23 contains the word moon, three times 23 is here only once. That's important and there is a little pitfall here which is written on the exercise sheet. You have to pay attention that your running time is still okay. It's written on the exercise sheet. So you can also store pairs of like say 23 comma 4 saying moon is contained four times. This is something for lecture two. For lecture one it's really the most simple way to do it. How do you now process a query? Let's say your query is the word astronauts. You type astronauts and you want to find the documents containing astronauts, well you have pre-computed it, they are the documents for the exercise sheet, you would just output 13, 57 and 61 and of course you don't output the IDs but you want the documents or the document titles, I think it's for the sheet, so we just have a list where you can look up 13 as movie blah and so on. You will just return. You don't have to, yeah it's pre-computed, it's very fast. How do you do it for two keywords? Here's how you do it for two keywords, and let me just show you the algorithm. It's our first algorithm, and it's really quite simple. I will show you the idea and then you can implement it. So here's the, let me show you the result list here. So you have two lists now and now you want to find the IDs which occur in both lists which means you want to compute the intersection of two lists of sorted numbers and for the algorithm I'm showing you now it's important that they are sorted. So how do you do it? You start with two pointers, it's not really pointers, it's like indices, let me call them i and j and let me maybe write it to the side so that it doesn't, so this is a variable i, this is a variable j, so it's starting with zero and now you are comparing the two, they starting with zero and now you're comparing the two they are not equal and now you are advancing the pointer in the list where it's smaller where the element is smaller so I'm advancing here and you can think about why this is a correct algorithm we will not prove it now so I'm advancing in this list now I'm comparing 13 and 23. They are not equal, I advance the pointer in the list where we have the smaller one, so it's now here. Now I'm comparing 57 and 23, not equal. I'm advancing in the list of the smaller one, 23 is smaller, 57. Now I have a, I think this I have to write it a little bit lower and let me just so that I don't, yeah, let me just write it here. So now 57 is in my list. Now I can remove this again. Now I can proceed in both lists because the 57 is there. 61 I process in this list. No match I proceed here. 114 and so on. This way, so that's a simple linear time algorithm to compute the intersection of two sorted lists. Why is it linear? I think that's a simple linear time algorithm to compute the intersection of two sort lists. Why is it linear? I think that's easy to see because you are only going to the right. You are not going left right or starting again from the top right. You have these pointers. Either you go one to the right in the one list or in the other list so you are making progress in every step. And that's why it's linear in the time of the total size of the two lists. The same principle can be used for computing the union for merging. So if you also, if you just want the sorted order of the complete list, this is something we will do in the next lecture. For this lecture it's just intersection and intersection means you will get the documents which contain both words. Okay, this is... Let me just see, do I have another? Just so that I don't miss any... Ah, I have one more survey about programming skills you can already mentally prepare. Now Natalie, I'm not sure in the exercise sheet is it just two keywords or more than two? Do we have arbitrary number of? We have arbitrary number of keywords. Good. So it's not that trivial. How do you do it if you have more than two keywords? So now you have, I don't have an example here, you have k inverted lists, you have to compute the intersection of them. How do you do that? Well, for the exercise sheet it's good enough. You just compute pairwise intersection. Let's say you have three of them first intersect L1 and L2, then you get the intersection of these two and then intersect that one with the third one. And you can do that in any order, so it's actually simple. You just implement the pairwise one and then you can intersect an arbitrary number. Here are some possible optimizations. What's a good intersection order? Think about it. It makes sense to start with the lists that are smaller because if one of the lists is smaller you have less work to do. So if you start with the smallest one, L1 and L2, this will probably be very small and now you are continuing with a small one intersecting it with the rest. There's a more sophisticated algorithm, you can do a k-way intersect all lists at the same time, you need a priority queue for this, the running time is log k times total sum of the things. If you want you can implement it, if you feel fancy, we will talk more about this in a later lecture. It's also not too hard if you know a little bit of data structures, algorithms and data structures. Priority queue is what you need. I can maybe very briefly hint at it. Here what you do at each point, you have, so for two, let me just very briefly explain it, for two lists you have i and j, you have an index for every list and what you do at every point is you compare these two numbers, if you have k lists you have k pointers and at each point you need to find the smallest one of k and this is what you need the priority queue for because now you found the smallest one you throw it out you advance in that list you put that one in you do that with a proper TQ it's not needed just in case you want to do it I have explained it. How do we break the text into words? Well that's conceptually simple let's just go to our document you could say okay everything is space, that's one way to do it. We do it even simpler, we just take the regular Latin letters A to Z, even ignore all strange characters and you just take maximal sequences of those, which means all characters which are not A to Z, capitalized or not, are just separators. And what remains are the words. And this is, you can do it as simple as that. In reality that's actually a really hard problem, tokenization. So here are some examples. That's a famous Japanese haiku. Anyone sophisticated in Japanese script here? If you find it out, tell me. German or Finnish also has these funny words which are really long and then maybe you want to split that into words too. And then there are all these funny characters, UTF-8. So this is the capital U, so German umlaut with two dots and capital, and you often see it in capital A tilde minus. Why is that? We will have a lecture about this. UTF-8 Unicode, it's actually quite interesting to understand how that works. So in reality tokenization is not so easy for this exercise sheet. And if you do a real search, and it's super important to take care of this, but we will ignore it for now. How do you construct an inverted index? I will just leave this here on the slide, and now we will just code together something. We'll just do it together, so this is our coding part now and afterwards we have a small break again. So let's just do some so I'm now here in yeah I'm the right directory I hope. Let's just do inverted index.py let's see how we let me start with a copyright notice you should also do that and let's just okay let's just start with a and my python skills I haven't used them in a few months so maybe need some startup time so this is a simple inverted index as explained in lecture one. And there is a typo. And if you see any mistake while I'm doing it and it stays around for longer than three seconds feel free to shout at me or write it in the chat so that we, yeah, our goal is to have to write the code and then it compiles runs it just works yeah so let's see whether we manage that okay so we need a constructor that's in Python that's done with this so the inverted, how do we represent the inverted index? We represent the inverted index as a map from words to lists of text records, IDs of IDs of text records. Yeah, so we have a map. Start with an empty map. And you will understand this in a second. Let's now so inverted list. So these are my inverted list for every word I have something, so it's a map and in the beginning it's the empty map because I don't have any words yet. And I think this will become clearer in the following. Let's build an inverted index from a file. Let's just have the file name here. Do we have it with an underscore or without an underscore? We have an underscore. Let's just do it like we do it in the... So build an inverted index from the given file. The file contains and that's like the file we give you it's, we also have an example file, I think the file contains one text record per line with tab separated values. That's called TSV, which means you have something, then you have a tabulator, then you have something else. The first column is the title, the second column is the text, the other columns you can ignore for now. And I'm doing this live so that you, I'm also showing you some coding stuff and so on, so it's actually good practice too. You write a function, you first write what you want to do and then you do it. And maybe as I go along here I think I should copy my, I have an internal here, you don't have it but I have it. That was one too high up, no don't. Internal, no it's not, where am I here? Ah, they're here, I'm here, okay, I should go to public code, lecture one first, and now I should one up, one up, and here it should be internal, you don't have internal, I only have internal because it has the example file here. Yeah. So that's just an example file, just simplified. You will also have to write tests. So here the title is just doc1, doc2, doc3. The text is very simple and here we have these numbers which we can ignore. And let's just, maybe let me Let's just maybe let me edit anyway for some reason. It's a bit slow, we will be faster with this later. So by the way what you're seeing here, let me explain this a bit. I'm locked into a machine which is over there in our rooms. I'm using two windows, one window with the console stuff when I'm running a program. That's one of our machines called Touha. We will use it for the rest of the semester. Here I have an editor. So when I open two windows, you see the name of the file at the bottom. So here at the top, I've opened example TSV. It's at the bottom. So here at the top I have opened example.tsv it's at the bottom, it's just some cursor position. Here I've opened the file invertedindex.py and the name is on the bottom. So makes sense to have them both here. And now let's just write this code together. So we want to open this file. Let's see how much Natalie will help me if I have gaps in my Python skills right now. So I want to open the file and now I want to iterate over, and you can also think how you would do it over all lines in the file. In Python that's pretty simple I think for line and file. So what do we do now? Now we want to get the title and the text. We ignore the rest right that's column one and text column 2 How do we do that? Let's just try something first I'm not quite sure how this line splits and now I want I'm not quite sure how this works but we will If you know it just tell me I'm not sure whether I have to So by what do we want to split I think by a tabulator and then I want at most two. I'm not quite sure if you know it just tell me what split gives me but I think it gives me three things then. And let's maybe just run it on the... So let's do a main function to just play around. So if we call this as a main function then what do we do? Let me just put this up a little bit. So how do we use this? You should always, all your programs should look like this, that in your main function, print usage info, you just read the command line arguments, if it's somehow strange you show how it should be used. So the command line arguments I think are in sys.arcv, it's just a list. If you see a, so if it's not two then let me tell you how it should be printed, executed python3 inverted index.py and then it should get a file name. And then we exit maybe with an error code, not important. And then if we called it the right way, then our file name is the first argument. The zeroth argument is always the program itself. Sys I think we have to include sys here. And if you, so how about that, so right now we are just going over the, and then maybe let's just build from example file. How do we do this? We create an inverted index and then we call build from file build from given file. So what do you think of this code so far? So now we are just starting to read the file. Do you think it will run? It's also our joint responsibility that it works. Let's see, that's good. I'm not calling it with the right number of arguments. I should get a... Okay, build from file takes... What did I do wrong? Here's the error message, self, very good. So sometimes I insert errors deliberately, but most of the errors are not deliberate. Even if I could claim they are for didactic purposes, but it's not true. Okay, so it does something. It didn't complain anymore. Yeah, let's just see if this works by just printing out the title and the... Yeah, let's just do it with the space in between just to see how it... Just to get used to some Python here. Now I'm just parsing title and text, ignoring the rest and printing it out. Let's see how it works. That works, right? So, I think we are fine. So now we have the title and the text. And now maybe index both the title and the text. I think that's also how it's done. So let's just add the title to the text. Let's just do it that way. And let's maybe add a space in between. And now let's just tokenize this, go over, iterate over all words in the text. Yeah, I also think so. Okay, so how do we split it into words? So for word in, and now we have to split it, I think we need regex, regular expression. It's a bit weird in Python that you need to call RE and I think it has to be included up here. I mean it should really be the string and then split and then some regular expression but you have to, I think this was a weird design choice. And I think now it should be split, I'm sorry, split and now you have to say by what you split and I think now it should be split, I'm sorry, split. And now you have to say by what you split, and I think this, no wait, in Python I think it's like this. By what do we want to split? By anything that is not one of these characters. So regex, right, it's a regular expression. That's any character that's not one of the upper or lower case Latin characters, and any sequence of them. And now we split the text. Okay, let's just print the word and see how it works and leave a blank line here. Let's see what it does. That looks good. Now I've splitted it. There are some blank things here, probably because we have dots and stuff like this, so we have some empty strings resulting. So we should I think first convert the word to lower case. How do you convert a word to lower case? In Python? Too lower? Like this or with an underscore? a word to lowercase in python? To lower? Like this or with an underscore? No, all lowercase. All lowercase, to lower and all lowercase, very good. So now we have the word and let's just try it. Just lower. Just lower. Just lower. Really? Yeah, looks good. Okay, so now we have all the words and now, okay, if we have an empty word we could also remove very short words but let's not do that. If the length is zero then we continue with this loop. No colon here. And otherwise, so now, now comes, this was all just preparation, now we have to fill our inverted index. Let's do it. So if we have seen this word for the first time, if not word in our inverted index is a map from words to list, if not word in self inverted lists, so it's not in that map, then I would say, and you correct me if you have different opinion, I insert it here and this is an inverted index, so for every word I have a map to an inverted list for that word and initially if I see it for the first time it's empty. So now after this if either I've seen it before there is already a list or I've seen it for the first time it's now the empty list in both cases I have a list. So let's just get used to the word and now I just append it. Yeah what do I append? Now I need like the record ID right? So let's I think we should have a variable for the record ID which is 0 in the beginning. Whenever I have a new file I increase it by 1. So which means for the first line in the file it will be 1. And now I just append 1. And note this is very nice. Naturally because I will process the text records in order of increasing IDs, the lists I get here are increasing IDs. What comes later will have a larger record ID. So I don't have to sort them anymore afterwards or anything, which is quite convenient. I think, let's just look at, let's just print them. I think we can just print the inverted lists and see how they look like, whether it makes sense. Let's just look at the file. This is the file, This is the file. This is the inverted list. Does it look correct? So we have lower case stuff because we're not interested in upper or lower case. Oh, of course in documents one and two, doc in all three because we included the title, movie, one, one, three. So here, this you shouldn't do for the sheet but for now I just keep it so it's just there twice so the one is also there twice film is also it's there once it's only in doc 2. Any questions about this so far? Okay so I think that was our that's actually what I wanted and I will, one more slide where I will show you something. Yeah, here it's written. So, this is not how you should do it in the exercise sheet but for what I want to show you on the next slide, it's actually deliberate. So this is actually more lossless, right? So the information that movie occurs here twice is contained in these lists, but for the exercise sheet you shouldn't do it, for now I want to do it. Is there any question about this simple data structure, these inverted lists? This is what you will work with in the exercise sheet. Okay, no questions. Let me go to the, and then we have a quick break again, and then the last part. Here's an interesting law, Sipf's law. It says something, and I wanted to show you that, it's also a popular exam question I can tell you. Let's just look at how frequent are the words in this collection and let's look. And before I continue with this let's just code this together. Let's just output not the inverted lists but let me go through four. Let me just iterate over the inverted lists self and no no that's not what I want. I want inverted list in in these inverted lists. It's a map I want it as pairs. I think that works like this. Now I just get for all my inverted list the word and the inverted list and what I want to output is just, I want to know the, what do I want? I want to know the number, the size of the inverted list. The size of the inverted list is how often does the word occur. I will come back to this in a second, think about it. And I want the word. Yeah I think that's completely right, thank you. Very good. So here I want the length of the inverted list and here I want the word. And let's just see what this prints. So what I now did, I just went over the inverted list, so doc, and actually we have seen it before, doc had, the inverted list was 123, this is just the length. So what this tells me, understand, the length of the inverted list is how often this word occurs overall. And because I didn't remove duplicates, I actually have a three for movie now because the word movie occurs three times in document one, again in document one and in document three. This is just... So what I have here is the number for each word the number of occurrences. And now let's just sort this. No me sort this with the Unix sort command. You can look it up in the recordings if you're interested. Now I just sorted it by number of, how often it occurs, the most frequent one first. And actually let me do that now with our real file. Let's just look how it looks like for our yeah I should be able to call the same code on our real file right. It will take a little bit longer because it's Python but datasets what's the most frequent words What's your guess? Let me just maybe just look at the top. I don't have a survey for this. What are the most? Let's look at the top 20 most frequent words. The end film, a by maybe not so surprising. And you see the nice thing about computer science. I mean that's the one great thing about computers, you write something for an example program, example file with three lines, it also works for 100,000 lines. You don't have to change your program, the computer does it for you, right? That's basically the magic of computers. So we get these frequencies, now let's just look at the first column. Let's just look at the sorted frequencies and then I will go back to the slide. So this is just giving me the frequencies. Always takes a while, it's parsing. So and the question is if I plot this now, how will it look like? Let's just look at that how will it look like? And let me write it into a file. And let me call that file movies-word-frequencies.txt. It's just what you have just seen in a file. It's just the numbers and now let me plot this let's see if I have GNU plot on this machine. Let me plot and let's maybe just show the first 10 of these and you will see in a second how I what I want to show movies.word, I don't have completion inside the GNU plot, frequencies.txt, let's see if this works, yeah it works and it's gone because I need to tell it that it should wait. Let me wait for the mouse click, I think that's how it works. Okay, now I have... So just understand this picture, it's easy to understand. This is just... Okay, the plus here, I can't see it. The most frequent word has a frequency around here. I think it's 300,000 something and we have seen this before. The second most frequent word has this frequency, the third most frequent word has this frequency and so on. So what's this function here? Looks linear or something but we just showed the first 10. That's what the one 10 year says, it's just take the first 10 lines, let's maybe take the first 100. Now it looks like this. What function is this, what do you think? It's not a straight line, right? The beginning looks straight but then it goes like this and let's maybe take a few more. If I take too many I think then it crashes. First 1000. So what you can see is the frequency goes down pretty quickly and then it stays on a low level. And this is what's called Zipp's law because the interesting thing is that you take any text on anything if it's halfway meaningful and it will look like this. The picture always looks like this. You have some very frequent words, then the frequencies go down very quickly and then you have a lot of these rarer words. Actually if we would, we can do that if we go to the end of this file here, you will have a lot of 1s. Let me also, if I do it with minus n here, I will see the line numbers. So that's now the first most, second most. So actually we have a lot of words. We have 170, 70, 77 and some things something thousand different words and you see a lot of them are, you can go through, oh my this is also slow, everything slow today. So you have some, so Zipp's law says how this behaves and that it behaves like this function. So it's hyperbola a bit more general than that 1 over n to the alpha. And the question is how do you verify that? I mean is it just a claim that that is so? Let's just look again at the graph. Here it is. Is it really a hyperbola? Maybe there are other functions which looked at this. Well there's one way to check such things and let me quickly show you. to you. So let's say f n, let's say the function is like this, it's c times n to the minus alpha. This is equivalent to, let's just take the log on both side, doesn't matter to which basis. Then the log of f n is, well, it's the log on both sides, doesn't matter to which basis. Then the log of f n is, well it's the log of this, let me just write it out, log of c times n to the minus alpha, and this is log of the product, is sum of the logs, so I will have, so let's just take the n minus alpha first. It's log n to the minus alpha, log n to the something. The something comes to the front as a multiplicative factor. So it do we have? This means if we take this as our x-axis, which we did, we took this as our x-axis and the N as our... no that's not true, the other way around, I'm sorry. In our plot, what we did... let me just... we just saw a normal plot. What did we do in a normal plot. What did we do in our normal plot? There are n, just how many first frequent, second frequent, and so on. So n was the x-axis, or let me just write it the other way around. People making noise if it's too loud. So the X axis was our N, one, two, three, and the Y axis was the FN, the frequency. And here if we do a log log plot, that's what a log log plot is. Then the x axis, we don't plot n, we plot log n. And on the y axis, there's a delay here and it's probably the same reason why everything is so slow. It's log of the frequency and what should we see then if we take this here this is now our y and this is our x and then what we get is, yeah what's written here is y is minus alpha times x plus log c. And what kind of function is this? Linear and how does it look? Linear in which way? This, this, this. Will it go up, down? So this is where it will start on the y-axis, so that something positive will start on the y-axis, and the slope is negative. So we expect something to start on the y-axis somewhere and then go down negatively and this I mean it doesn't make sense by itself it's just to verify that the function is really some hyperbolic kind of thing. Let's just try this I think the way to do it is I'm not sure, I hope it works. Log scale of my xy. Yeah, that looks, so there's some weakly motion in the beginning, probably some deep theory behind it, but it's pretty linear, so it's, yeah, so this is more or less proof that it's that it holds. Okay and and we see that. So we have a last part but before that we just make another break of four minutes or so and then resume for the last part which will not be very long. So four minutes break again then I will meet you again. Last part which will not be very long and it's just to help you with some practical stuff. By the way I have a CO2 measurement device here and it says Schlecht. Schlecht says so for an hour already. It's a nice German word Schlecht, schlecht, it says so for one hour already, it's a nice German word, schlecht, schlecht. So all the opening, yeah, but I think we will survive. So a few more, it's just four more slides and I will show a little bit so that it's on the recording that you see how you should do it. It's about committing stuff and so on. So there's our course management system, DAFNA, there's a link on the course wiki, it's actually not important whether you register via HSN1, it's just important that you register with us and it's written on the exercise sheet, it's easy. You will have access to all the data then and I think I've just, yeah it looks like this, let me just up Melden again and unmelden. So what you do is you just, with your UNI account, you don't have your own accounts with us, just the UNI accounts and we store your password of course. Just in case and then you enter your password. And then so here I have a test user. Maybe I've participated in former courses. This test user has. If you haven't participated in the course yet you will be asked to enter some some basic information and you will get a page like this with information about your exercise sheet points and so on. Is it okay from the sound level or for you? If it's not okay close the doors, if it's okay we can yeah it's it's just a few more minutes so I think we should be fine and how do you use this so there's a forum that's very important let me very briefly show you here's the link for example the link is at several places you can just ask this also takes a while but this has something to do with this computer and not with the, yeah, there are no posts yet, you can ask all kinds of questions here, there's a subforum for each exercise sheet for, it's really important that you subscribe to this official channel because whatever will be written there is official. So it's your responsibility that you read these mails if you miss it. Writing on the forum, we have some guidelines on it. They are also linked on the wiki. Let me very quickly go through them. There's some stuff which you just have to read in the beginning, but you should absolutely read it because it's important. It's very short, it's just these two. What you should do before, think a bit about yourself for a few minutes, not too long, don't get frustrated. Google it, sometimes you can just paste the error message into Google, you will find the answer on Stack Overflow. Just look in the forum, maybe somebody has just asked the same question before and then ask on the forum. And this one is in red because it's important so let me also briefly say it. Whenever you have an error on your site and you don't find it and don't spend too much time, spend a few minutes but don't spend half an hour, maybe it's a stupid error and you are not very experienced and this can be super frustrating and we see it or somebody else in one second. Then you should just ask. It's important how you ask, it's written here, always proper information, copy and paste the error message, also the relevant code, not all your code of course, and always, always that's really important whenever you have a problem, make sure that the code causing the problem is in our repository, I will talk about our repository, the SVN in a second. But this does not mean that's equally important that you just put your code, whatever it is, 200 lines in your repository and you say it doesn't work, please look at it. That's not how it works. It's just a backup. Yeah, you always put your code there, then you ask a question with the error message and maybe an excerpt from your code and we try to help you or someone else from the course with that information. But as a backup we always have your code and we can look at it because sometimes it's actually easier to look at the code. But it's only the backup, it's not here's my code please find my error. It does not work like this neither on Stack Overflow nor with us. Yeah, so please read this, just one page and then, yeah. We will usually answer very quickly, which is important, otherwise it's also frustrating. If for some reason, and let me also say this right away, you need help, some people, maybe you feel like, oh, I understand so little, I don't even know how to ask a proper question, you have a tutor, you can ask them, can we meet, can you explain to me a few things? And of course, we have the Q&A sessions. So we don't have Git, we have subversion. There will be the usual, actually there's a post on the forum from past years where I explain, yes we also use Git for all our professional projects for the course, SVN is actually easier. Git has quite a learning curve and quite a few things which are really complicated and you don't need them. So for this, so it's a repository, just you can upload things. You have versioning, you have the history of everything you ever did. And what you basically need is here's a new file, here's a new version of a file. So you add something, you commit something, or give me the latest version of the files from the server, which for example is feedback from our tutors. So SVN is pretty easy and for those purposes just as good as Git, no reason to use Git there. And that's why we still stick with it. You have a complete history. It's written in the rules, which you should also read. You can also use this as a backup while you work on your exercise sheet. Any time you can just commit if it works or not, if it's complete chaos or whatever. What counts is your last submission before the deadline. So feel free to use this also as a backup. Now I have done something, let me just commit it to the server and go for a lunch or a walk or whatever. That's completely fine. You can commit as often as you want. It's perfectly alright. You find a short tutorial on the wiki. And let me just very briefly show you that so that you have it. So here I have my test user. So if I go to my test user now on the, where is it? Here I think I've logged in. Here's the link to my, so that's what's currently on the server and it's empty. I have nothing here. And now let me just copy let me just make let me make a directory here and let me pretend it's the first exercise sheet it could be called sheet minus zero one it's written on the exercise sheet so not sheet with a capital S not without the minus not should name it exactly like this. And let's now go to the sheet one and now let's copy my files from which I've written in the lecture here. And I've written, oh yeah, this make file. I will briefly explain it and the inverted index. Let me just copy. We have written together the inverted index. Let me just copy we have written together the inverted index dot pi and we also have this example TSV here and let me just yeah let me and let me now add them soN add means, and if I add a whole directory it will add all files in the directory. So add is not on the server now, it went way too fast for that. Add means here is something which I intend to upload for the first time. So it's like scheduling for upload. That's what add does in SVN. So it's scheduled for upload now and if I actually want to commit it now, I can do commit. I don't have to write anything anymore because these are already scheduled for. If I now commit these files from L1 under the false name sheet1 it's actually lecture1 some meaningful comment. Now I will upload them to the server and now you should see them here. Now they are here, now they are on our server, you can do that. Now if I make a change and I do commit, it will just commit the changes. Yes? In the commit window, how can the window be closed? Like the actual committed message? The editor, you're talking about the editor which pops up? How it can be closed? Actually, I think there is a, I think the problem is, SVN, ah, now I know, now I know. SVN, no, visual. Yeah, there's an environment variable visual which tells you the default editor and maybe on your system it's a strange editor. It's like I don't know, one where you don't know how to exit it. I think with this environment variable if you set it to whatever you want to set it to you can configure the editor that will be used. So for me it's vim right now and vim you exit with colon q or colon wq. It depends on the editor that opens and then you have to leave that editor with a write and quit. So now I have the files here, they are uploaded now, my three files and more than that if you go to, if I go here to my overview page of the test user. Now I will see it did something automatically and let's look at here what it did. So what it does every time you commit something and you can ignore that for intermediate commits, it will try to do something with your code. It will compile it. And that's actually what's written here in the make file, which I didn't show you so far. So the makefile just contained and you should always take the same makefile for all exercise sheets in the course. It will tell you how to compile. Now Python doesn't really need compiling it's just checking the syntax. It will check your unit tests. I don't have any unit tests here, we'll show that in the next lecture and you will see it in the exercise sheet. It will check the style, style errors, whether you, let's just maybe, so flake 8 is the program for that, let me just, yeah and here we see a problem. So I had two style errors, so in line 38, I had trailing white spaces. That's of course a terrible sin. I have two white spaces in the end. StyleChecker does not like this. Let me just remove it. It's still there. Oh yeah, because I'm now in the I see, and now I copied these files. So let me just very quickly go to the right internal code, no no this was I have to go to my test user. Test user sheet one inverted index dot PI 38. Here we are. But you see the purpose of this automatic build system. It will tell you there's something wrong. So now I remove this. Now it tells me should be not in. So it even gives me interesting. So this is actually correct but the style checker says it's better practice to say not not not in but not in like this. Nice. So now I have changes so if I do status now it will tell me M stands for merge. I can just commit them. Actually a short form for commit is just CI and I just write what I did fixed check style errors. Now I upload them. Now I go to my system. Now I have the latest files and now it should build automatically, it should check everything automatically and it does so there is a current build running now it says here and I don't know if we can, yeah it's running right now, let's see if this works. Sometimes needs some time, the most frequent one is at the top so you can see I already did a fair bit of testing here. Try it yourself, I think I won't show any more about this now. If you have problems with this just come to the Q&A on Friday and we can help you with that. But essentially what it does is it will check it out on our server and just try to see if everything compiles, if the test works, if everything is all right. This is what I just showed you. Let me just see, we are almost done. Now it gets, so there we are. So now it did check for syntax errors, ran it, checked the style, everything is fine. This is important because just it works for you doesn't mean that it works for us. Maybe you forgot to commit a file for something. So if you have a green arrow here, then you know it also works what you uploaded to our site. Before we leave very quickly and then short opportunity to ask questions, the exercise sheet. Just give me one more minute. It's a bit everything a bit slow here. We have seen it already and please do ask a question if you have one. It's basically inverting this very simple search engine here. Deadline is until noon next week. Register on Daphne and so on. It tells you exactly how to commit and everything. Is there, oh I have a final poll to make and while the poll is running, opportunity, last opportunity for asking questions, there's a last poll, yeah? What are the rules about importing packages? So for example, Nandai or Fundance. Yeah that's a very good question. The question is what are the rules for using stuff from other library importing packages. The rule is later we will use NumPy and stuff. For now you shouldn't. For the first lectures just ask, you don't need anything else really except for something like sys or regular expression. So if you intend to bring it like some monster library, I don't know the information retrieval library or search engine stuff, you probably shouldn't or you should ask us. Actually it's written on the exercise sheet for intersecting two lists. You should not use a library which intersects two lists. You should write it yourself. That's the point of the exercise. So when in doubt, ask in the later lectures, you can use some stuff like NumPy. Any other questions? So here we have some awesome programmers, one fifth of the audience are awesome programmers, okay? And some others are. Okay, nobody, we're 20% of people who said, I don't mind the math, but we have no one who says, leave out, just do it pure theory. So you are all eager to do some coding, that's nice. Any other questions before we close for today? Okay. Oh, there's one more. Yes. I wanted to know if and when the recordings are provided. If and when the recordings are provided. We try to do this as quickly as possible, which means right after this, our cutter, our professional, semi-professional editor will start his work but it takes a few hours because we do some post-processing, we add time stamps. It's usually on the same day but in the evening, maybe late evening. But we try to have them ready on the same day, so pretty quickly. Any other questions? So thank you, that's it for today, see you next week.Welcome everybody to lecture two, information retrieval in the summer semester 22-23. Due to global warming there are now two summer semesters, one in the summer and one in the winter. In case you didn't know, so I have prepared another exciting lecture for you today. We will first say something about your experiences with the first exercise sheet, which was about the inverted index. There's another Q&A because there's a holiday next week, so you have two weeks for the next exercise sheet, and so the Q&A will be not this Friday, but Friday next week, so please come if you have any questions or problems. Just realizing that I have trouble figuring out where to look at that the camera at you at this screen at this screen so right now I have to commit to where I look at and then we will talk about ranking today, evaluation and the exercise sheet will be something I will talk about as we go along. So first your experiences. So most of you found it an interesting exercise, a good start, quite doable, welcome opportunity to refresh your rusty programming skills and there are a few with more problems and I have a slide for those too. Here's some quotes from your experiences. Thank you for giving us feedback. It's always very valuable to have. Good start, hands-on experience. Yes, that's what the whole lecture is about. Coding abilities more rusty than my bicycle chain. Yes, very. I think many of you had that problem, but many of you said, yeah, it's a little bit rusty, but that's a great opportunity to refresh. Another comment in this, the two cool alarm sounds indeed. And it should be mentioned that Python is a prerequisite. I wouldn't say it's a prerequisite. I mean, you don't have to use Python, but it certainly helps, and maybe you can use it as an opportunity to learn it. You should certainly know some programming language at this point. I have an important slide for those among you. There are 10% or so, and that's just normal because everything in life is like a normal distribution who have more problems than usual and we can read it from the comments. So about 10% for whom already this first relatively simple exercise sheet was a lot of work and I have a slide for you and some messages because I think it's important. We completely understand it's not pleasant to be among those 10% who have it hardest. That's, but yeah, that's just life that there are these 10%. I want to emphasize that we are very happy to help and support everybody including those who have more problems. And here's the but. I mean this is really important if you want to stay and you're welcome to stay I think you just have to accept that it will be more work. I mean we can't yeah the the amount of work and and everything and difficulty it's just adjusted to the 80% or so in the middle. So it will be more work for you but I mean you can accept that and say yeah I have more problems due to whatever reasons, personal circumstances, maybe I'm missing something from earlier lectures which I now have to learn that's okay but then you should just calculate with twice the work or twice the time or something. I think that's really really important and it's also important because there's a strong correlation. I see it in what the feedback you give, the quality of the exercise sheet between difficulties people have and when they start committing something to SVN and the late comers are always those, often those which have more difficulty. So if you have trouble, what was that? Ah, the windows, okay. You should start on time. The more difficulties you have, the earlier you should start and you should reserve enough time, yes. And we welcome your feedback, but please don't't blame us but tell us how you are feeling, how we can help you. Please don't tell us what you think we are doing wrong. I think especially if you have difficulties that's the wrong approach and not very helpful. So I hope that helps and now let's go on with the feedback to the exercise sheet. So it was about a first very simple search engine and the master solution is online now. So here it is. It's just one sheet now for the first one it's there. Here's the master solution. Everything is perfect. The tests go through, check style errors, none. And let's just, I have tried this already, try it out on the movies dataset. So that was what the first exercise sheet was about. You start it and now you can ask queries. So I don't know, maybe we are looking for the matrix queries. I type matrix and now I get three queries, the title and their description. We see the first one is the matrix, the second one is not the matrix, but it's a movie which mentions Keanu Reeves who also played in the matrix. So that's something which you frequently encountered and it was also part of the exercise sheet. Play around. When you talk it's very loud up here so I think you should not otherwise too loud. Thank you. So here's some examples. So you were the sheet was to figure out what works well, what does not work well and why. So this is an example of a query that works very well. Why does it work very well? Because it's two words which are extremely specific. I mean Shawshank probably occurs in no other movie description except in that of the movie and it's the most popular movie. So if it matches it will be number one. Nolan, there's no, if a description mentions Nolan, it's Christopher Nolan because it's not an English word otherwise. And it's such a unique name. So these things worked well. You figured that out. Pokemon works super well because of the accent de gu and super specific. Lord of the Rings, that's interesting, why does that work? And it was a really good exercise because there were obvious things, but also subtle things. So let's try Lord of the Rings. And note, Lord of the Rings of super frequent, the super frequent, Lord, it's also not so specific. Rings, it's also not so specific. You have lots of movies with rings, yet the first three are exactly Lord of the Rings movies. And why is that? It's not because you can see it here in the highlighting, the words occur so frequently, but it's just because the movies are so popular in the ranking, right? That you get them first. So in this case the words are super unspecific, but because these are among the most popular movies, you get them first. And here are examples of other movies. 2001, you don't get it because we didn't index any numbers. And in practice when you see other search engines and you get angry it's usually because they're making these stupid mistakes like not indexing certain words so you don't find it. Titanic. Titanic is not so popular as other movies which also mention the word. Harry Potter has the same problem. We could try it out. Here's an example of a word, you forget the accent de gu and you don't get the hits and so on. So you figured all that out and as I said obvious things but also many subtle things. Okay this is just one general comment, one little twist compared to the lecture was that when, yeah, in the lecture what I did was when you have a word occurs in document multiple times, you just have it in the inverted list, the document, the record ID multiple times. And I had that to show you Zipp's law. In the exercise sheet you were not supposed to do that. That is if a word occurs multiple times, the record ID should appear only once. And the obvious way to do it is this code. So if it's not already in the list, append it. If it's in the list, don't append it. What's the problem with this code? Who can tell me? I mean innocent looking code looks natural code. What's the problem with it? So what's linear time? So the not in? Yes, yeah, thank you, that's exactly it. So let me repeat it for everybody and also a Zoom audience. So the if condition, not in, that's searching in whatever you already have as inverted list, which might already be very long. Not in does not know that it only has to search at the end for whether the word is already there, but it will search the whole list, which means as you go along and you are inside a loop, you are, yeah, for every addition to the loop, you will search basically the whole inverted list that's already there, and that gives quadratic behavior. So hidden quadratic behavior, it happens so often. And yeah, that's what's written here. So here's a super important piece of advice because there were already questions about this. Can I use this function? Can I use that function? There are two aspects here. One is of course you shouldn't use built-in functions or library which solves the whole exercise sheet for you. That's number one. But number two is when you use functions from a library you should always ask yourself what am I doing here, what's the time complexity? And this is an example of natural use which is just terrible. And you see that a lot in code also in code from products, companies and so on. So very, very frequent problems. So let's look at the chat. We have some questions there. By the way, you're very welcome to use the chat. Also people in the room for asking questions in between, maybe Natalie answers them or somebody else wants to answer them or I answer them. So people on Zoom use the chat, everybody in the room can use the chat. We can just do it in parallel. So that was the introductory part, about 10 minutes as planned. Now we go to the first part of today's lecture, which is about ranking. We completely disregarded ranking. I mean, what we did here is you just, many more hits, you just output those which come first in the input file, which was already not that bad because the input file was sorted by popularity, but we have already seen there are problems with that. So the motivation is obvious, you have a lot of results, of millions if you search on the web, and you want the most relevant ones first. And the question is how do you measure what's relevant? Relevant if you think about it is a very ill-defined and also subjective notion. What's relevant for my query? My search engine has to guess what I want. So here's the basic idea. So the basic idea, so these are our inverted lists. And here I have two different ones, university and Freiburg. This is what we had so far, the blue things, just the IDs of the documents or text records containing that word. And now we also have a score something between 0 and 1 doesn't have to be but let's do it like that right now which somehow says how it says something we will see in a second and when I merge these lists or intersect today we will merge I will just sum up these scores so for example let's look at 53. 53 occurs here and there, 0.2 and 0.1, so it's 0.3. So by aggregating, now aggregation, I deliberately wrote aggregation, but here. aggregation but here, some. Yeah but it could also be something else doesn't have to be some could also be the average or maximum or somehow aggregate them. If a document only occurs once like 17 does not occur in the Freiburg list it's still in the result list but then just with the one score. And then we sort. Now we just sort by the scores. So now 127 is first. Why? Because it has a large score in both lists. And now one important thing to note, let me just write this here, note, note, a document containing only one of the two words, only one of the two words, like, and this is something we didn't have last time, and for example, where is the document which contains only, and what I want to say is that it can be ranked pretty high. What's the first document in the sorted list that contains only one of the two words? 34. 34, yes. Like 34. Like number 34. Can be ranked higher and documents containing, documents containing both words and that's very important. And I'm writing that up so that you have some time to think about it because it's really important words. I have to write them again because something went wrong. And it's something which for example Google didn't do until about ten years ago. So you might think that's obvious, but as soon as you do that it becomes risky. Now you're omitting some words, you have to omit the right words, you get a lot more hits. So in the first ten or twenty years of web search, depends on when you start counting, search engines would not do that because it was too risky. But here when we merge and we do that from now on, we do it. So here these entries, they are now more complex, it's a doc ID and a score, we call them postings for whatever reason, it can also contain more information. So this is now a whole record for example, it could also contain the information where the word occurs in the document at which position can be more than one position. But for now and I think for most of the rest of this course it will be just the document ID and some score. Okay. So how do we get? That's just a small remark. So this is merging. This is basically the same algorithm as last time, which we used for intersect. With a very small twist, you can turn it into a merging algorithm. You just also write out the things which only occur in one list. It will be part of the next exercise sheet to implement it, but it's really only one line or two change in the code. So you get this merge list. How do you get the sorted list? Well, you sort it. So that takes n log n with a typical sorting algorithm. Typically, you only want to display the top k hits like the top 3 or so. So then you can do a partial sort. And a partial sort is faster than a full sort. If you only want the top k in sorted order, that works in n plus k times log k. For example, one popular sorting algorithm is a sort, where you put everything in a heap, in a binary heap, that can be done in linear time. If you do it carefully, watch the algorithm's data structure lectures, I explain it there, or Fabian also explains it there. And then you just pop from the heap, just give me the smallest, give me the smallest, and a heap operation costs log n time, you do it k times so you get this running time. And there are also, you can also modify quicksort, so this recursive divide and conquer thing to get that complexity. So it's linear plus k times log n, which is for small k much better than n log n. In particular, if you have constant k, it's linear and not slightly super linear. You don't have to do this, but maybe you have fun doing this, then you can implement it a little faster for exercise sheet two. Just so you know. Now where do these scores come from? Here I just wrote some scores and that's what the rest of this first part will be about. Where do these scores come from? What are good scores? Why does it say 0.8 here and 0.1 here? Well there we have them again. What the score should somehow measure is how relevant university occurs in document 17. So how relevant is it? And think about what I showed you earlier for example when I... Matrix. That was a good example. So this is the matrix. So it's the first movie about the matrix. That was a good example. So this is the matrix, so it's the first movie about the matrix and every mentioning of matrix here is very relevant for what this document is about, right? This document is about the movie and this talks about the movie. If you take the second movie it also mentions matrix here but the word Matrix is not really relevant for this document. It's mentioned here in passing because Keanu Reeves is in that movie and he also played in Matrix. So this occurrence of Matrix is much less relevant in this document than here. It should get a lower score here compared to there. And here you already see a hint how you could do it, which is not always true, but that's one heuristic to do it. Matrix here occurs a lot of times, right? And here it occurs only once, which is one indicator. And which will indeed be our first heuristic. Just look at how often does the word occur. If it occurs really often, then probably the document is about it. Okay so term frequency, that's the simplest heuristic and the first thing we do. And here's the problem with that. So let's do it like that. So here we have term frequency count, so what it means now, in document 57, whatever it is, the word university occurs five times. Word Freiburg occurs three times. The word off, which doesn't mean anything, occurs 14 times. In document 123, it occurs 23 times for whatever reason. It's just a preposition, right? You have a lot of ofs and the. Let's just see that again so that it's super clear. The Lord of the Rings. Yeah, I mean, everything is highlighted here. You have a lot of the here, of course, when it mentions the Lord of the Ring, but also the, of, the. These are just of the Ring but also the of the, these are just frequent words right, they just occur. It doesn't really mean anything how often they occur. But now let's just do what we did earlier with these scores. Now let's just sum them up. So for document 57 we just take these as scores, 5 plus 14 is 19 plus 3, 22, 2 plus 23, 25 plus 1, 26. So now this document gets ranked before this document, right? But if you look at it, 57 occurs these meaningful words, university and Freiburg, much more often than these, the meaningless one of. So actually it looks like 57 should be first right? And not 123 just because of of. That's the point of this slide. And this is what we solve now. What we now look at for each word, we also try to figure out is this a word which means a lot or just a fill word like of the which occurs everywhere. That's actually an easy way to do it. You just count for every word and how many documents does it occur overall. So for a reason that will become apparent in a second. I'm taking powers of two here, it doesn't have to be powers of two, it's just so that my example calculations give a nice result. 16,384, every computer science student should know this number by heart. It's two to the which power? 14 maybe. Other opinions, other. So let's start with 1024. 1024 is also a number, every computer. It's two to the? Yeah, it's 2 to the 10, which is very useful if you want to convert between powers of 2 and 1 million or so. One million is 2 to the 20 approximately because 1000 is approximately, is approximately 2 to the 10. So if this is 2 to the 10 and this is just 2 to the 14. How about this one? You can also deduce it. By the way, doing these calculations in your head is great for math. 19, yeah, because 1 million is two to the twenty, so you can come from that side. Very good. So we have two to the nineteen here, and we will use that in a second. So university, Freiburg is super specific in my example here, university is mentioned more often, right, because there are a lot of universities, and of, yeah, it just occurs a lot. Let's say, yeah, let's say we have a million documents, two to the twenty. And now what we compute is inverse document frequency. Inverse document frequency is just, if a word is very frequent, I want it to be small. Let's just look at id of off. And let's first take this one. Off occurs two to the 19, and here we have two to the 20. So that's log, if we take the formula which I've written there. Log. Do we have a question or are you just somebody so this is log 2 of I think somebody should mute their microphone or can you mute them I think you can find out who it is. So the formula is log 2 of number of documents which is 2 to the 20 divided by how often does this word occur? 2 to the 19, 2 to the 20 divided by 2 to the 19 is 2, log2 of 2 is 1. So it's really small. Let's look at Freiburg. So this will be log 2 of total number of documents is 2 to the 20 divided by the frequency which in this case is 2 to the 10. So it's log 2 of 2 to the 10,, it occurs in 2 to the 14, so somewhere in between, 2 to the 20 divided by 2 to the 14, 2 to the 6, log 2 is 6. So you see, and now if you just look at these numbers again, just as numbers, 6, one and ten. Then what this says is this is a super frequent word, ignore it, so a very low IDF. This is a super specific word, Freiburg, so it gets a relatively large number and university is somewhere in the middle, so it gets some in the middle number. And now we just, why do we use the log2? That's an important thing which we will see again on a later slide. If you would use these numbers directly, the differences are just too large, right? I mean here you have half a million and here you have one thousand. It's just the difference is too big. You somehow want to dampen the difference and when you want to do that you often use the log2. There are also theoretical reasons for those but we will come back to that later but not now. You just look here the difference is a 6, 1, 10. Still a difference but not so big right here. The difference is a factor of 500 that's too big. Now let's just use these words use these scores and let me just write down the IDF scores again so that we so our IDF of what was the first one University was Let's play a little memory six very good. Six, IDF of off was one. Very good. IDF of five work was ten. Perfect. So those were the numbers. And now let's just use them here. So these were, yeah this was our TF example exactly and now we have our, so let's also write it here how we got it. So this is now 5 times 6 right, 30 is 5 times 6. That's just a TF score times university. And this is just 14 times 1. So you see we have this high number from the frequency here, but now it gets only multiplied by 1. You see what's happening. Here we have 3 times 10. So Freiburg occurred only 3 times, but it's a very important or meaningful word. So this is three times ten and here we have two times six. Twenty-three was a large frequency but now multiplied with a small IDF. And here we have, it only occurs once but in a very meaningful way. One time ten. And now we sum up the scores and now as it should be 30 plus this is 74, this is 45 and now 57 will be ranked first. And that's the whole point of TF-IDF. And I spent quite a bit of time on this because TF-IDF and its variations variations it's probably one of the most important scoring formulas. You do not only find it in in search engine context but whenever you are in a context where you want to rank stuff you will use something like TF-IDF. It's such a simple idea but a very fundamental idea. You take frequency on the one hand and then something like invert document frequency to tell apart features which are more important from features which are less important. So now comes a refinement. So here are some problems when you use that in practice. The IDF part, let's just say for now, is fine. This can also be refined. The TF part, let's just say for now, is fine. This can also be refined. The TF part has some problems. Here are two problems. Now assume a document is longer, let's say twice as long. If a document is twice as long, it will contain every word twice as much on average, right? So that's kind of unfair. You have maybe the word matrix occurs in some document twice as many times just because the document is longer. You have short documents, you have long documents. You should somehow take that into account, especially if you have documents of varying length. And now there's another thing, if they have the same length and now and that's similar to these large numbers which we have seen for inverse document frequency, if a word occurs twice as many times, yeah, maybe then the document, the word is a little bit more relevant for that document but maybe not twice, maybe that's too, but maybe not twice. Maybe that's too much. Maybe just 1.2 times. That's very similar to what we have seen here. So yes, off is more, that's the same rational as here. It's 500 times more frequent, but it's not 500 times less important because of that, right? We should dampen this somehow. We did it with a log2 here, and that's a similar thing we want to do now. And here's a formula which does that, which I will explain now in the next three slides. And this is a variation of the TF-IDF formula. And I just replaced the TF by a TF star and to be precise by this here and we will now understand this funny formula and this you will also use it in the exercise sheet so it's now TF like before but multiplied with something we don't have to understand it now there's a K here, there's an alpha here, we will now understand why we do that. So this DL is the document length, average document length plays a role and B and K are just parameters. Then there is some magic settings. If you don't know how to set them, you will play around with it for the exercise sheet, set them like this. Let's just look at two special cases to understand the formula a bit better. So if you set k equals to zero and b equals to zero, so if you set b equals to zero, then this alpha value will be what? What's this alpha here? Whatever it is, if you set B to zero. I'm confused, there's something wrong with this. There's something wrong with this. Yes, I want to keep my inc annotations. This is not supposed to be there right? Yeah, this was a copy and paste error because of last minute changes. And Frank I don't know if you are listening but it's still super slow the machine here. So alpha is what for B equals zero? One, yeah that's correct. That's equal to one. Okay. So for, and then we have Tf star is just Tf and let's write it like this as a fraction k plus 1 divided by k plus tF. So if we have k equals to 0 what we will get is tF star is equal to tf times 0 plus 1 divided by 0 plus tf, which is just tf divided by tf, it's 1. Which, this is what we did in the first lecture, right? Whenever the word occurs, we just write, we just count it as 1. We don't differentiate between different frequencies. So let's now look what happens if we let k go to infinity. So that's t f times k plus one divided by k plus t f. Now how would you do we do that? Well, now we have k goes the denominator, numerator and the denominator go to infinity. Let's just write at the bottom that's always a little bit harder. Let's just divide both of them by k. So here we have 1 plus 1 over k. I divide the numerator and the denominator by k. And here I have 1 plus tf divided by no, tf divided by k. Okay. And now you can tell me what's the, if k goes to infinity, 1 over k goes to 0, tf over k goes to 0, and this will be just tf. Yeah? So this tf star, the larger you choose k, the more it will be the normal tf. Any questions about this for now? Okay, and now let's... why this formula? Now I want to explain why this strange formula, where does it come from here? And I will do it this way. I will now name three properties which I think are pretty natural to have. And then I claim that's just the simplest formula which has these properties. So what's a natural property? You want to replace tf by something. If tf is zero, then your modified thing should also be zero. And let's just verify this on the... So let me write my formula here again. So that's Tf times k plus one divided by k times alpha plus Tf. So if let's just, if this is zero, k let's look at this part here. This part is not zero. I mean this is not zero, this is not zero for, at least for the non-binary case, when k is greater than zero. So this whole thing can only be zero if the first factor is zero. So that's pretty easy. Okay now you want to modify it in a way we already said this you somehow want to dampen the effect of if if you have twice as many occurrences then you want this modified thing not to be twice as large, just a little bit larger. But it should still increase, yeah? It shouldn't be worse if the word is mentioned more often. So let's just look at that. Why does TF star increase, STF increases. Let's just verify this for this formula and let me maybe write it down here so that it's not so crowded. Let me put a marker here and argue here. So how can we see, and I can tell you these are popular exam questions, so tf star, let's just write it again, it's tf times k plus one divided by k times alpha plus tf. And now we are wondering if tf increases, what happens to this? Now tf occurs here in the numerator and in the denominator and that somehow makes it hard. So let's just divide both by tf and then it becomes a little bit easier to see. So we divide by the Tf, the Tf is like it's up here in the numerator and now we divide by Tf at the bottom so what we get is k times alpha divided by Tf plus one. And now we have it only once in that formula. And so, yeah, now it's in the denominator and there it's again in the denominator. So if tf increases. And please do ask if it's not clear or if there are any doubts. Yes? only in the denominator it should get smaller if the number gets bigger. Yes, but it's in the denominator of the denominator. Yeah, it's divided by, maybe I should have written that more clearly. So it's something like, yeah, let me just write it here in a simpler form. You have something like 1 over 1 plus 1 over x. So now if x increases the whole thing increases. The denominator button, the denominator is again the denominator. And there's still problem with system resources on this machine, which is why the writing is slow. But we will cope with that. So now let's look at the third property. If the term frequency goes to infinity becomes larger and larger our TF star shouldn't become larger and larger it should hit a fixed limit. Let's just verify that it does that and figure out what the limit is. So now we want to find the limit for a term that occurs super frequent of and same formula here, tf times k plus one. I'm sorry for the delay, it's also a little bit annoying for me but I can't help it. Now we have the same problem, we want, yeah, we have the tf and the numerator, the denominator, it goes to infinity but we can do the same thing. Let me just write it again. tf goes to infinity and let me just use the same formula here as above and then you can see what happens. K times alpha divided by TF plus one. Well and now you can see what happens. This goes to zero because it's divided by tf, 1 tf and then you have k plus 1 plus 1, so it's k plus 1. So this formula which we have seen before goes to k plus 1, so that's essentially the meaning of the k parameter. So it essentially says determines the limit. So you want to have, let's just recall these three properties again. Whatever you replace the TF with it should also be zero only if the TF is zero, very natural. It should increase for a word that occurs more frequently and if the word occurs super often it shouldn't go beyond a certain limit which you set which is a parameter. So here we have the standard setting 1.75, that says however frequent a word occurs, the TF star will be at most 2.75, that's what it says. And yeah, now you want a formula with these properties and that's as simple as you get. I mean that's just the simplest formula. You tell me if you come up. This is not a proof now that this is the simplest formula, but yeah, it's a simple formula and if you want to have these three properties, I don't think you can do it with a simple formula. And we have an alpha here. What does that mean? Well, I can have an alpha here and let's just see how we can make use of that. So if I can put any number alpha here and I still have all these three properties. So we have a free parameter which we can also put to good use, that's what the second slide is about. How do we use that parameter? Well, let's rewrite this formula again a little bit. By just doing it like this, by dividing by alpha again, both numerator and denominator. Then we get tf divided by alpha here. And here we have this divided by alpha is K and TF divided by alpha. So what we have is, it's like replace TF by TF over alpha. So it's like normalizing the TF. If alpha is two, just divide the term frequency by two. That's the effect of the alpha, dividing the term frequency by something. And this is a So what is done by this BM25 formula? How the whole thing is done is you normalize by the document size. You compare the document to the size to the average size of a document. So if a document is twice as long as, is that myself or somebody else making these noises? So if the document is twice as long as the average document, then this thing here would be two. So you could just say okay it's twice as long so I divide the turf frequency by two, but that's a bit too extreme. So you do this thing which is sum in between thing. So if you set the B to one it would be exactly this extreme case. If you set the B to zero which we have seen earlier alpha is one and then it's the, let me maybe also write it on the formula in case, yeah it's not here, so for B equals to zero, you don't normalize and for B equals to one you do the full normalization, then you have dL over average dL and you typically do something in between. We have seen an example B. So these are two parameters to play around with K and B. That's the formula. It's amazing if you go to the literature, how much literature you find about theory, theory behind the BM25 formula. I don't find it very convincing. It fills a lot of pages. I think these two pages, I don't find it very convincing. It fills a lot of pages. I think these two pages, you don't find that anywhere. I haven't seen that in the textbook so far. For me, that sums it up why this formula is so successful. Why did I pick that formula and why is it called BM25? It's because these people or other groups working on it, they tried out a million things and then one of these formulas, they had all these different names for this formula, this formula was called best match and they probably had versions of it like BM1, BM2, it's like the programming language C, right, it's the successor of B. And there's also D, so yeah, that's how these names come to pass. So some class of formulas and then the 25th iteration somehow worked and this became famous. At the time, pretty long time ago now, all the competitions, the search engine competitions many years in a row, always this formula won, which is why it gained, gathered some fame. And if you ask me, these two slides which I've shown you here, they explain why this formula, yeah, what's the, you don't find it in textbooks, but you find it here, why it works. Okay, so that concludes almost the first part of the lecture. You're very welcome to ask questions. Just two, some more minor remarks. Now that's not minor, you should implement this. So how do you implement, yeah, so you should, let's go back to the slide with the score. Oh my, this always takes, Yeah, here's the formula. So you should now use scores and you should use these scores here. Now how do you do that? So every document should now get a score and you should compute with these scores. How do you compute these scores while you build your inverted index? And that's what this slide is about, which I will go to now. Well, think about it. Let's first ignore the IDF. Computing the TF scores is something you can do while you build the inverted index. You have done the first exercise sheet, right? You are adding, you are going through a particular document and now you see the word the first time in that document, you see it the second time, the third time. What you did for the exercise sheet, you just did, oh, I've seen it already, I don't append it again to the inverted list. Now you can just have a counter, oh, I've seen it already once, now I've seen it already I don't append it again to the inverted list. Now you can just have a counter oh I've seen it already once now I've seen it a second time increase the counter by two now I see it a third time increase the counter by three. So this you can just do while you construct the inverted list. Very small modification in your code but now you only have the TF. So now you did what you did for Exercise Sheet 1, you have the TF scores. The nice thing is the TF-IDF scores is just something, some function of the TF scores. And while you're doing that you can also compute document length, average document length, that's easy, you see a document, you just know how long it is. If, yeah, maybe, let's see, if it grows louder we have to close the door, but maybe not for now. That's also easy. Now the IDF scores, how do you do that? Well, what is the, for IDF you need the DF, the document frequency. Document frequency is just how often does a word occur in a document. That's just the inverted list of a document. Let's just go through an inverted list again. Here we have it. Yeah, let's say that's the complete inverted list of university for whatever document collection that is. So that means university is contained in document 17, 53, 97 and 127, which means there are four documents containing university. So it's just the length of the inverted list. Inverted list length. So very easy, you have compute your inverted index and now the length of each list just gives you, you want to close the door? Yeah, it's a bit annoying. Thank you. So that's it. When you will code it, it will become clear. So in the first pass you compute the TF scores, in the second pass you easily get the DF, and if you have the TF, the DF and the document length, you can easily compute that formula. That's how you do it. This is not needed for the exercise sheet, you can do it but you don't have to. So what you should do for the exercise sheet is play around with the parameters, k and b, that's what the sheet is about and we will look at this more in the second part. You can also do other fun stuff like if a document contains all the query words, they don't have to now, maybe you want to boost the results, say yeah, doesn't have to contain all words, but if it does, then it's even more relevant. I give it some boost. But only boost it, you have seen it for exercise sheet one, when you restrict it you say all words must occur, you're making it too hard, some relevant results will just fall out. Take the popularity into account, the BM25 formula we've seen is purely about term frequencies, how often terms occur or not occur, maybe you should want to take popularity into account. So we give you a lot of numbers, they are sorted by popularity, we give you number of votes, IMDB rating, Wikimedia links and so on. Note something here, when you have, let's just look at this again when I go to the data set here. Let's just look at it. Okay, I have to go to the end I think. I have to go to the end a lot. Maybe let's just cut out the, cut F1, 3, let me just cut out the abstract. 3, 2, 4, no 3, 4, 5 that's what I wanted right? I'm just cutting out the, hmm, what did I do wrong? Minus, Thank you. Now I just have the titles and the scores. So this was number of IMDB votes, this is IMDB rating, and this is a number of Wikipedia links. The rating is like a good score, you could take it immediately as a score because it's between one and ten, like school grades, right? This is a number of the kind which we have seen earlier. Number of votes, some movies have millions of votes, other have ten votes. That's too big a difference. You need to dampen it somehow if you want to use this. So a typical way to do this is to use log in some form, as we have seen earlier. You don't have to do this, but you may want to do it, you may want to somehow refine your formula. Anything else that comes to your mind. So we have a little competition, maybe that's the right time to show this. We can just go here and go to the exercise sheet. Maybe that's now the right time to show it before the break. Timing is perfect I think until now. So the exercise sheet is just redo exercise sheet one. And I should say redo means take your own code and pimp it up to add the BM25 scores, but you can also just use the master solution and proceed from there. So that's the first exercise sheet, extend your inverted index by BM25 score. This is what the second part of the lecture will be about. You have to evaluate it and now you should play around with your scoring function and try to be as good as possible and the second part of the lecture will be about some measures, measuring how good is what you have done and you should try, yeah, everybody will post a row here and you can just try to get the best results. So it's like a little leaderboard here, a very simple leaderboard. There are totally other methods which we don't talk about here which are out of reach also for the second exercise sheet. So one thing that's of course super important if you have a real search engine is analyze what users actually query, which queries are popular and what they click on. I mean we can't do that here because we don't have that data. Obviously important and another thing that's also a big topic you can use machine learning to learn good rankings. I mean that should be obvious that it plays a role here if we just look at the formula. This formula has some funny factors here like K and B and yeah how do you set k and b? I mean this is this could be learned I mean that's the that's one thing you could learn these factors but you can also learn much more not use a fixed formula but learn the whole ranking all together. That's called learning to rank that's a big topic we will not deal with that at this point. So let's make a break here for five minutes or so. Before we go into the break, any questions from your side right now? You can think about questions in the break. Yes, please. Yeah. Okay, this is just about the base of the logarithm. So the question was does it depend on the collection whether log 2 is good or some other log is better. This can be answered with if I find the whether you take to whatever base you take a log of something to the base of B is just the natural logarithm or any logarithm of X divided by that logarithm of B, which means if you choose a different base, then it's just a constant factor for everything. It's like you compute all the scores are multiplied by two, which doesn't make any difference. Now all the scores are twice as large, but because it affects all the scores equally, the ranking will not be affected. Which is actually an important point, which I come to back later. The absolute values of the scores don't really matter. It's just the relative order they imply. So which means the base of the logarithm is immaterial. It's just two for convenience. Any other questions right now? Yes? So if the keyword is in the title for example shouldn't it be more than just having it in the description? If the keyword is what for example? Is present in the description. If the keyword is what, for example? Is present in the title of the movie. Yeah, that's another idea which was not on the slide. Exactly, that's like the boosting idea can be very valuable. Like say it's in the title, that's more important. Then you immediately get into the area of spamming, but we don't have spamming. If you do that on the web, the early web engines did this. And then of course, all the porn sites put all kinds of words in the title so that that would be first and so on. But we don't play that game here. So that's the problem with boosting functions. You can immediately abuse it to get your documents higher. But yes, for the exercise sheet, you're welcome to try these things. How do you measure the quality? That's what the whole second part is about. Exactly, very good question. Okay, let's have a break now. If you have more questions you can ask them later. Five minutes. So let's continue with the second part. And there was already the question. Yeah, but that's what the first slide is about. Is it? Yeah, maybe, yeah, I mean, the question is, what's the motivation for the, I hear you talking, what's the motivation for the second part? How you're playing around with these parameters, how do you find out for what's actually how good, I mean one way to do it, how people often do it, you, yeah, let's just go back to, you have your program, maybe let's start with that. You have matrix, it works okay, two matrix movies are there, the second one should not be there. So what do you do? You play around with your method, you try your query again, you see oh now I have the three matrix movies, okay perfect. But now it's only for one query so maybe you try it on three more queries. So that's how you would do it, right? You have some queries, you try it out on these. And this is essentially what we are doing now more systematically. So that's what evaluation is about. And it's of course super important for anything really, but in particular for search engine work. So we need the ground truth where we have a number of queries and what we want to have for these queries, the relevant document. For example, the query is matrix movies. Let's assume it has this word here, which doesn't carry any meaning if you have a collection only about movies, but still user typed it. And there are four matrix movies. Is it still four, is it five in the meantime? I'm not sure. And let's say they have these IDs, so line numbers in our collection. And that's what we have in our file. Let me just show you this. So we have prepared, and that's actually, Natalie has prepared quite a lot of work. And very important work. It's also on the slide. So we have prepared 11 queries. So films directed by Steven Spielberg. I should use another. This is better. So films directed by Steven Spielberg. I should use another. Yeah, this is better. So films directed by Steven Spielberg, here they are. So that's the complete list of movies directed by Steven Spielberg. No, no, it goes on here. So there are more of these. James Bond movies with Daniel Craig. There are five in our collection and so on. So these are, they are also, yeah, films about Batman. So somebody might just query Batman films, right? And you will see it on the exercise sheet. You actually don't have to take that as query, but it's also part of what you can do is to pick the query. Maybe you don't want to query films about Batman because about doesn't mean you just want to Batman for example or Batman movies or whatever. Anyway, this is the query you can figure the description of the query. Here are the relevant documents. So that's the ground truth we give you. So we did that for you because it's a lot of work. And we have built a ground truth for 22 queries. I've shown you only half of them. That's also important. There will be a later slide on this. The training, ones for training and the ones for testing. Here very similar ones and we intentionally did them in a parallel way. So here you have films by Steven Spielberg and here in the testing one you have films by Stanley Kubrick. So they are parallel in that way. Why is that important? Let me tell it now and again later because it's so important. When you do all your tuning and trying out stuff, just look at the training data. You're not really doing training in the machine learning sense, but you're doing training in a human sense. Like you are developing and tuning your system based on something we give you. You play around with, you try to be as good as you possibly can on these queries here. Then in the end you should not evaluate on what you were tuning to for hours, you should evaluate on something else so that what you did is actually meaningful. I have a slide about that, a separate one because it's so important. But that's why we have two ground truth set, one for training, one for testing. So and now the question is how do you evaluate, how do you use this for evaluation? Yeah, and this is called a benchmark. If you have something, look here some queries or basis for queries, here is what should be the ideal result. So now you need a measure which somehow your program runs on these queries, does something and now you want a number which says how good did it do. That's like the simplest one here precision and let's just do it together. Precision at one, at two, let's just make a start. So here we have a result list, here you see the 5, how do I do this? So 5A2 is, so this one is relevant, this one is not, this one 17 is not relevant, this one is relevant, this one is relevant and this one is not relevant. So let me do this in red here. So not relevant, not relevant and not relevant. And you will compute bits about these numbers. So your search engine, you gave it matrix movies, it has computed this result list where the first one is relevant, the third one and the fourth one and the other ones not. Now precision is just look at the first position, so you just look at the ranking until here and you wonder which percentage until here is relevant, so that's 100% and now you just tell me based on what you see on the slides what the precision for the, so here it's 100 percent. You just look at the first one and they are all relevant or one. Precision at two, you look at the first two, how many are relevant, what's the percentage? Fifty percent exactly. Or 0..5 it's the same thing. Precision at 3? 33 point blah blah yeah exactly. 0.33 and so on. Precision at 4? 50% yes. And that's one of the most common measures really. This is very simple. Precision at five? 60 percent. Three out of... No, what is it? Precision at five? No it's 60. Yeah, I misspoke. 60 percent. It's three out of five, it's 0.6. And then there's one particular precision measure, it's just precision at the position which is equal to 3 here. Yeah, no, I don't have to show an example. I mean it's just the precision at, yeah, you just plug in that r and then you get that number. So if there are only three relevant documents, then p at 3 would be the position at R, it would be 33%. So that's kind of the simplest number you can compute. So average precision, let's do the same thing again. So we have the, okay, so this is, yeah let's do this. Now we have a slightly different result list so the first one is relevant, which other ones are relevant? I think it's very similar as the one before we have, this one is relevant and that's at position, and let's write down the position below here. So this is the first document, this is the second document, here is the third one, the fourth one, and the fifth one, and let's just say this is the 40th position. Let's just pretend that it is. So yeah, one relevant document, it comes very far back in your list. So now when you want to compute average precision you are asking for precision at certain positions, namely at the position where you find your relevant documents. So you want precision at 1, precision at 4, precision at 40. So you tell me what's precision at 1? 100 percent yes. What's the precision at 4? 50 percent yeah we have four documents, two of them are relevant, 50 percent. What's the precision at 5, the position of the third relevant document? 60%. Yes. And now, what's the precision at 40, which is the position of the fourth relevant document? 10%. Yes. Because it's 40 and there are four relevant documents, 4 over 40, 10%. So you compute these precisions and now you just take the average precision, which is just the average of these four. And this is something which you will compute for the exercise sheet. And you can tell me, so it's just the average of these numbers. 60% plus 10% divided by 4. What's the sum of the thing in parentheses? 220 divided by 4 is? 55, very good. 55%. So that's the average position here. It's just the average of these and you see this last one really hurt us, right? 10 percent. So it hurts when you have a relevant document that comes very far back. Maybe you did first for the first one but to have a good average position you must not, you must take care. The question is what is if a document is not in the result list at all? Maybe, yeah, maybe your search engine, it does, right? The search engines we wrote so far, they don't return all the documents but just the subset, namely those for the second exercise sheet which contain at least one of the query words. Maybe for some reason because you don't index numbers, your relevant document, a movie does not occur at all. And then we just do as if it would occur at position infinity, right? You already see it here, far back the number goes goes towards zero so we just set it as zero which is the same as saying it occurs at position one billion. So that's the border case. Then there's mean precisions which is just, that's easy. These were numbers for a particular query and notice this is already an average so that's the average over certain positions so it's called average precision and now you do this for many queries like for 11 queries and now you want one number in the end and not 11 numbers in the end so you compute this number for each of the 11 queries which is already an average and you average that over all queries and you don't call that average position but mean average position. But mean is really just a synonym for average. And you can do that with all the other measures as well. You just take the average over all queries except that you don't call it average, you call it mean. It's not mean, it's the mean. And you will find these a lot and variations of these information retrieval but also in other research, very natural measures. And let's just look at the table on the wiki. Yeah, you see here you have m, so precision at three, how many of the third document and that's something which is very important for web search, just the top hits, how good are they. We absolutely want those to be relevant and then you just average that over all queries. Add R and also map. Map is a very hard measure to optimize because average precision is hard because now all the relevant documents have to be good. It's hard to get that high. Here are some more measures which are now a little more refined. Those were the simple measures and that's what the rest of the lecture is about. So just five more slides, that's already the last part. So let's look at those. Also very popular exam questions. And actually also in our example, now what's a good, let me just, yeah I don't know, maybe, I don't know if we have Batman for example. So you have a query which says movies about Batman. Now what's a movie about Batman? That's maybe an ill-defined. There are movies which are really about the Batman, only about the Batman, so these are super relevant documents for Batman query, but then there are movies which are maybe only, maybe Batman is only one of several characters, right? You have many such queries where you have shades of relevant. Super relevant relevant like right on target, somewhat relevant, not relevant. And you can have more shades. So that's really very frequent. You have shades of relevance. And now you certainly want a bonus to have the very relevant ones before the somewhat relevant ones. Should the idealist the very relevant ones first, then the somewhat relevant ones and then the not relevant ones. And here's a formula which does that and here's an even better formula which does that. So now you sum up, so this is just summing up the relevance, so this is like zero if it's not relevant and otherwise it's this score, so a two for very relevant. This does not take into account the positions, right? If the very relevant one is first or tenth would not matter here because I'm summing this up, which is why I now give this discount here. I divide by the logarithm of the position and you will see in the example on the next slide why that is a good idea. Now it helps if the very relevant ones come first. We will see it on the next slide and that's why it's called discounted because you now give a discount to the scores if they come later in the list. Yes, and this will be, let's first go to example then we understand the rest of the formula. It's always, so let's do the, so we have a relevant one here. So now we, how do we start? Maybe we should, how did I call the formula on the previous? ECG at 5 was just the sum from 1 to 5 of this relevance information which is 1, 2 or 0 divided by log 2 of i plus 1. So it's a sum of 5 values now and you tell me what the five values are for this list. So I have search engine returned five hits in that order. The order is important. First one was relevant, third one was very relevant, fourth one was relevant, the other one's not. So what's, I now want a sum with five things what's the first thing here one that's correct what's the second thing zero yes the third one two over hmm two over four 2 over 4. It's log 2 of, we have to plug in this. 2 over 2. So the positions are, maybe I should write the position, it's not zero based, so this is the first hit, no it's already written there, i is. I mean, it's just 1, 2, 3, 4, 5. So what's the denominator? 2? Yeah, it's log 2 of 4. That's correct. It's log, yeah it's 2, let me just write it log 2 of 4 which is 2, yes. Next one is, so now we have a 1, log 2 of 5, that's correct and then we have a zero again. So this is one plus one plus one over, what's one over log two of five? I wouldn log of 2 over log of 5 I think independent of the, oh my everything is so slow here, 0.43, hope you can confirm, tell me if it's wrong, so this is 0.43 approximately. So the whole thing is 2.43, yes? So now we have that number but the problem is this number is not, I mean, is 2.43 good or is it bad? We should somehow, and that was the point we haven't talked about it yet of the other part of the previous slide, we somehow somehow, and that was the point, we haven't talked about it yet of the other part of the previous slide, we somehow have to compute the ideal ranking. So the ideal ranking would be, let's just put it here in a separate, so, in the ideal one we would have the two first and then the one and then the one and then zero and then zero. We first have the very relevant one, there is only one, then the two somewhat relevant one. And now let's do a little faster, the same sum for the ideal value. So what do we have here? First one? Two. Second one? One over log two, right? Oh yeah, it's three. You are right, I'm wrong. Log two of three, another funny number. Third one is? Yeah, one over one over log two of four, right? Plus zero, plus zero. Okay, so we have another funny number here, which is one over log it's 2 plus, and this is 1 over, it's 2 point, is that correct here? This is 1 over log 2, 4, which is 1 half, right? So it's 2.5 and this here it's 3.13, is's, Natalie did I make a mistake or does it look correct to you? Looks correct, see it's 3.13. Okay let's assume it's correct, anyway the idea was correct. So now we have, this is the idea and this is what we got. And so, yeah, what we can do now is just divide this by the idea. And that's how you compute the, yes, just a second, let me compute the final number here and then the question. So let's just compute that 2.43 divided by 3.13 and now we get a percentage again which is always nice right. This is like 77.6%. Yes please the question. Because it will be a zero and we can implement more hits and there's a lot of problems. Yeah, that's a very good question and I come back, I think, on slide 10. What about, so right now I'm returning a list and what about all the documents which are not relevant at the back of my list. And web search does that, right? You never go to page 200 of your web search result. If you go further down, or some people maybe do, then it becomes quite irrelevant at some point. And the question is, that was your question, I think, should that be punished? And these formulas don't punish it. But it's a reasonable question, and I have something to say about the next slide. I mean, what's behind that question is shouldn't the search engine only give me those results which are relevant and then cut off at some point and say no more results now. I don't show you anything else. And that's just too hard, which is why it's not done. You always do this rank thing and you don't care what comes later, as long as the relevant ones are at the top. But it's a very reasonable question. Are there any other questions at this point? So this was normalized discounted gain. And you see the idea here, you compute these numbers. You normalize them by the ideal and the ideal score depends on the query so you have to compute it for every query but you can and then you again get a percentage. And this is frequent. And here's another one and I think that's the, yeah sure. regarding the ground truth. When we look back at the examples for the matrix, so the ground truth for matrix would not include the John Wick movie, although it mentions matrix in the description, but only the actual matrix in the description, right? Yeah, that's a very good, another very good question. Ground truth construction, so what is ground truth? This is, and this is true also for all the machine learning benchmark, whatever problem they solve, the question is what is a relevant document? This is something which the experts which do this have to decide and that's, they don't necessarily agree, right? So let's go maybe to our benchmark here, which I showed you earlier. Yeah. So here are some which are kind of very objective, right? Films directed by Steven Spielberg, I think that's objective, so there's a perfect ground truth. But yeah, exactly. This is already debatable what should be in here. So this is indeed a big, there can be mistakes in the ground truth, there can be disagreements between different evaluators. For the exercise sheet we don't have shades of relevance, it's just relevant or not, so we don't have this discounted thing. And so actually we have to ask, so Natalie, how did you decide? You did that query. How did you decide what to include when you, because actually what we did is we looked at a larger set and then we looked at each movie and said, yeah, this is about Batman. So what did you do? I took those movies that were Batman a rehearsal Batman. So what did you do? where it was I think it was in the test set for Sherlock Holmes. Someone was posing as Sherlock Holmes. So it kind of, I can't say it's about him, but I did want to include it because it's not the actual person that's meant by Sherlock Holmes. Yeah, so let me repeat it for the Zoom people. So you heard a longer story, which somehow conveys that it's not so hard. For some movies, you really have to think about it. So films about Sherlock Holmes. There was a movie where somebody was not playing Sherlock Holmes, but posing as Sherlock Holmes. Now, is this a movie about Sherlock Holmes? Debatable. Natalie did not include it. But yeah, very hard questions and it's a problem for a lot of benchmarks that there's disagreement between evaluators and even mistakes. So it's a very valid question and a big topic for not only for search engine benchmarks. Okay let's move on to our last measure and then there's time for more questions. I will again go through to the exercise sheet. That's important not for our scenario here, also a typical exam question, but for a scenario where you have a competition and it's again what we have here. We have now done this, Natalie has done this before. That's quite a lot of work. And now we did this for our movies collection which was like 150, 100,000 movies or so. I mean, you can't look at all of them but you can kind of do exhaustive searches there. Imagine you have some much bigger collection with millions of documents. There's just no way you can find all relevant documents in that collection, right? Even for the one in a thousand, there's already no way. You have to use some heuristics. You have a web corpus and there's a query and now I want to find all relevant documents for a query. Impossible. And so in these competitions, how do you do that? And so you can actually run a competition where you don't have any relevant judgments, but you just take what the people submit, everybody submits something and you just say, okay, the top three of these engines, let me just declare them relevant. Or maybe judges look at some documents and not at others. So you have some relevant judgments, but for some you just don't know whether they're relevant. So in the end, and let me jump to the example and then jump back to the formula, you have something like this. So your search engine returns something like this and for some positions you just don't know what you returned, whether it's relevant or not because the judges are telling from the other engines, we do not know whether it's relevant. So you have this situation, not relevant, relevant, I don't know. And the question is how do you evaluate it? So you can forget about the motivation part if you just want it as a formal problem, it's you now also have I don't know for the reasons I just explained. And that's actually not so easy. And here, so these formulas, they always look very frightening when you look. BM25 also looks frightening and you wonder wow why is that 1 over R sum RR 1 minus so but if you look at this in detail with an example then it becomes clear and we will do that on the next slide. I will not explain this here this is for reference I will explain it by example and then if you want to understand it at home you can look at this yourself. So here's the example and what you need to know is you need the example and you also need to know how many relevant documents are there overall so in this case there's one more, so three overall. And how many non-relevance are there overall? So for all the others we don't know, but we know that there are, I've just not listed them yet, so there are 13 not relevant ones, three relevant ones. And now we want to compute these numbers which were defined on the slide before and RR what was RR actually yeah okay that's the and let's try to understand the formula before you have seen it. So this is simple this is very simple it's just the search engine has returned this list and of two documents in that list we know that they're relevant. So that's just this number, it's just two. R is just the total number of relevant documents, that's also easy. So that's just a three. There are three relevant documents in total. I have found two. N is just the total number of not relevant documents. So 13 are not relevant. That's 13. And now come these numbers. They are the interesting ones. So this now says at number at number three, let me take the mouse so that you see a pointer, what score do I give to that position? Now I can't use precision at three. Precision at three would have to say okay among the top three how many are relevant? I don't know whether the first is relevant. I don't know whether the first is relevant. So what this measure here does is it looks at one minus, let me just briefly check it with the previous slide, it looks at how many non-relevant ones come before the relevant ones here. So I want to punish if non-relevant ones come before the relevant ones. And I want to know the denominator is this in this case, the minimum of R and N, I will talk about that in a second. Let's just take it for granted now. Minimum of a 3 and 13 is 3. So it's just there is 1 over, so it's just 1 non-relevant one before and this is divided by 3. There are 3, there are 3 relevant ones, 13 non-relevant ones, take the smaller one, so this is two over three. So 66 percent. And you try to understand that number, maybe try to understand it now before we do the next one. If there would be no irrelevant one before the number three, I would get the highest score here. So I get punished if relevant, non-relevant ones come before my relevant ones. If there are only question marks, they might be relevant, I don't know, so I'm not getting punished. Number eight, what do I have? And here you see the maximum punishment. I have a relevant one here, and three non-relevant ones come before me, and there are only three relevant documents overall. So here I get one minus. How many non-relevant ones come before me? Three. There are three relevant ones in total, so it's zero. And zero in percent is zero percent. So why do I get punished so much? Think about it, I have only three relevant documents, I don't know about the others. If three not relevant ones are before me, that's, I mean, in the ideal ranking the first three would be at the top right and so I'm coming at a position yeah after after the ones where I should have all the relevant documents so it's kind of taken as the the worst case here I already get it's a zero. a zero. No we can't. Yeah thank you for that question because that's exactly why do we have this min R over N? So the N is clear. I'm counting what I put in the numerator here was number of not relevant documents that come before and the largest number for this is obviously this 13 here. So that's by dividing by n I ensure that this is a number between 0 and 1. If I would just divide by n here and not by min R over n. I'm a confused now. Yeah, I was just, yeah, yeah, you're right. So now I'm, okay, let's mr at 3, it must be 1, and our location is just 2. Say it again. If the writer on the next slide is a bit confusing, it's mr at 3, it should actually just be 1. If the stars are going to be 1 and the paper should not be the one minus thing? Yeah I'm sorry I confused that. Okay the NR is not the one minus thing. Okay but I can fix that by, yes. Oh yeah yeah okay I took the NR for the whole thing. Okay, we will come to that. Okay so this is not actually what I wrote here, but I think we can fix it this way. I'm sorry. Yeah but making mistakes is great because you usually learn something and tell me if this is now right so this is now what I actually computed here was 1 minus nr of number three divided over the min. Is that correct Natalie? I think so and so I already wrote the, not the NR thing but. And now to your question. And I think there is a, now you said that and you are absolutely right there could be five relevant, non-relevant documents here and then I would have five over three and then it could become negative. Now let's look at the subtle detail what it says here. Okay, so I have, aha, I'm just looking at the top, at the R top ranked non relevant ones before. So yeah, so I'm just cutting it off. I'm just looking at the three top ranked not relevant ones. You see there are a lot of subtleties in that formula. So if I would have five non relevant ones here before that relevant ones, I would just take three of them and not so yeah. That's just how it's defined and it's defined that way obviously so that this is a number between zero and one. Okay so to summarize I mean it's actually if you look at it there are simple reasons for each of these components of the formula but simple things in combination very quickly give something complex to understand. Let me just summarize the reason for the N-ness that's just natural normalization here and the reason for the R-ness to punish more, right? If you just have three relevant documents you already get a zero here if you just have three not relevant ones before. You wouldn't need it technically but it just makes it harder to get a good score. So you get a zero pretty quickly and then the BPREF is just the average of these, yes, so it's just the average now over, okay but now you're just taking the, so it's just 0 plus, but it's now not the average over all but I think just over, yeah, over the 2. Oh no, but you're dividing by R so so you'll make, yeah anyway, for the other one you get a zero. So it's a really hard score to, let me write that. So what I'm now doing is I have three here, two-third plus here I get a zero. It is in my list but so many non-relevant ones come before and I get a zero. I don't get a zero but I just write it here as a zero because I divide by three which means it gives two over nine. And how much percent is two over nine? What's one over nine? Doing fractions in your head is also a good exercise against anything. One over nine is what? Eleven yeah eleven point one one one one yeah so it's 22 percent exactly. So you see pretty bad because yeah the first one was okay but this one already gave me zero and this was not even in my list. Yeah so it's hard to get a good p-pref. Okay so we are coming to the end. Overfitting yeah that's important let me just repeat what I said earlier because it's so important, it's also written on the exercise sheet. Oh my, this is so slow on this machine. So of course if you play around with the benchmark, it's trivial to achieve a perfect result for that benchmark you could just write a program that says if query equals to this then output this. That's not a search engine that's extreme overfitting. If this then output this you just fix the output for the results in your benchmark. You shouldn't do that. So what we have here is this training and this test set and let me just go to the exercise sheet and please read it carefully. There are subtleties here so you get the training benchmark. This you can look at as much as you want and tune around with your engine. And you can also modify the queries. Just take what's written there as a description. So you don't have to, you can change the query, right? You can just say Stanley Kubrick director or directed or whatever you like. You can choose the query words. That's just here the description of the query. Do all that with a training benchmark. And only when you are done, when you have fixed your parameters, I will take this K, I will take this B, I will boost things in the title, I will do this and that and whatever. Now, and also on your test benchmark, this is the test benchmark, you can also think about which, how will I pick my query like I just said. But you have to pick it and now it's fixed. And now whatever you have now, now you run it on your test data set and now you enter it in this cable and then you are done. So you're not allowed, of course we can, I mean it's just a game, it's a fun competition, but then you shouldn't go back and say, ah, I think I can still do better, let me tune this a little bit. Then you are starting to overfit. Yes? Okay, so you want to redefine words. Can you give an example? directed by with the white space in between. Okay, what you want to do is like phrase search. You want to take positional information into account, which is very important. Like you have directed by, you are asking that, and you only consider it a match if the words are nearby in your document, yeah. And you could do that in several ways by including positional information in your index and just giving higher score if the values are close together or even nearby or by merging these words. You are free to do whatever you like to. I mean it's more work of course, but yeah. And if you think about it further, there are millions of ideas gradually becoming more work to implement. But yes. And there's this, here it says refinements. So yeah, if you want, you can write a, oh sorry, you can write a long entry here if you try out a lot of things. But yes, so let your fantasy run wild. You're very welcome to. And here's some, so we are almost done. Yeah, that corresponds well to an earlier question. So this is almost a philosophical question. It's so great to close what some want to ask. If you think about it, it's a bit strange, right? We are giving you the ground truth as sets, or as, no not as, sorry. Yeah, it's a set. We are not saying in which order they should, I mean we are treating them as a set. Here it's a sequence, but we are not saying 57 should be a first or something. We are just saying this is the set of relevant documents. All our measures consider this as a set. So sets of relevant documents, but your engine returns a ranked list. You're comparing a ranked list against the set and this immediately leads to the question, okay, shouldn't my result be a set or shouldn't my ground truth be a ranked list or ranked set or whatever. And this is just some food for thought, we could discuss this forever but but well, search engines, why not let them output the set? Yes, you could try to, but the simple answer is just that it's super hard. If you would choose a cut off, you're always in danger of cutting off too much, and it's just, yeah. It doesn't hurt if you have things further down in your list which are irrelevant, because people don't look further down in the list anyway. But for those few who want, you better show it in case you have a relevant one at position 100 so that you can find it if you're really up to scrolling down the list. So that's the reason why search engines do it. Why isn't our ground truth also ranked list? Well, that can, I mean, you could give, yeah, here you couldn't do it for example. Films directed by Stanley Kubrick, these are just the films. They don't have an order, right? Now you would have to say which films are better or something, you would need popularity measures, that's a different query then. I mean, they don't have a natural order. Sometimes you have these shades of relevance, we talked about it, but you don't really have a full order, right? And then the third question, which you might also wonder wonder why don't we give the thing scores and you compare actual scores so you say this should get a 0.8. Well we also had that earlier because you are asking such great questions the absolute scores don't really mean that much right the scores and these engines which is also why they are not usually displayed. They are just vehicles to get a ranking, right? You could multiply all scores by a factor of two, you get the same ranking. The absolute values of the scores don't mean much. In the early days, Google would show their page rank, their Google's internal score, but they don't do that anymore for a long time. So that's it for this lecture. Quickly go to the exercise sheet, I think I explained everything. So a lot to play around with, there's a minimal version. If I'm correct you have two weeks Natalie, is that correct? Yes, you have two weeks, which means you shouldn't, don't start in two weeks. You should start now. Absolutely, if you have problems or if you have no problems, at least start now that you have started, that you know what it's about. Maybe continue later, but always take the motivation and inspiration from the lecture. Start immediately, then make a break, finish later, that's okay. Don't make the mistake of only starting in one and a half week. The Q&A session is not on this Friday, but Friday in a week, just for all the accumulated problems. Is there any question right now left over? now, left over. Does not seem to be the case, so have fun working on the sheet and see you again in two weeks. Bye bye.Welcome everybody to lecture three, information retrieval in the summer semester 22-23. Weather got warmer again, I was very surprised when I left the house this morning. So can you please listen, thank you, I will say something about your experiences with a two exercise sheet, you had two weeks to work on that and then today's lecture is about efficiency and we will see all about that, here's the summary and the exercise sheet will be a mathematical one, in particular you will learn something about a technique called Lagrange multipliers and you have to apply that to a new algorithm which you will learn today but let's first start with the organizational stuff, experiences with the last exercise sheet. Most of you found it interesting, you learned a lot, it was quite time consuming but you had two weeks for it. There were many small problems as usual I think because it was a bit more coding and some of you had difficulties understanding the doc tests here, a few quotes, so some of you really liked it, I had a great time, I've lost track of time, too many little problems, mixed experience, first fun, then trouble, yeah, the tests were, can you please listen, the tests were, yeah it was not so easy to pass the tests right, because if you make even the tiniest of mistakes, but that's what unit tests are for. I think you made an important experience, namely when you have even a medium sized, I would call it a small project, what you did for exercise sheet 2, if you don't have unit tests, then your code is not correct, I mean it's almost 100%. I think, was anybody here where you wrote your program, you ran it through for the first time and then all the tests passed 100% without? Yes, there is someone. Perfect. OK, so there are few exceptions but usually, and when that is not the case and you don't have tests, it means, OK, it works fine, you think it works fine, but there are small mistakes. OK, very interesting and challenging. More hints on how to use the functions. So, Natalie will talk and I will talk about that for the next sheets. The next sheet is mathematical. I think what you are saying is that some of you had trouble understanding from the doc test exactly what the functions were supposed to return and that is maybe due to not fully understanding how Python doc tests work, so we try to give more explanations there. And that was a similar comment, very productive and somebody doesn't watch movies, likes books, okay. Somebody doesn't watch movies, likes books, ok. So the results, so it was about tuning, I think it was a very important experience, you have the theory and now you try to make it work for an actual application and search engine, let's maybe also look at the table. So here is the table, what you did, so you tried all kinds of things playing around with the parameters, you had a training and a test set, trying all kinds of methods and I think, yeah, one thing you experienced is you change a little bit, it makes a rather big difference, also one experience is that you have an idea it sounds good but it actually makes it worse. So these are typical and valuable experiences. Most of you I think found that if you, that B which is responsible for the weighting of the document length, if you make that smaller or even zero, gave better results. This was what we started with, the baseline from Natalie, of course there was room for improvement, these were the best results. What you also see, also interesting, so mean average position, that's a mean value because it's very hard to get good, very far from 100 percent and this is typical. 61 percent is even a pretty good value, so it's really high, you don't get perfect numbers, but that's just how it is, that's realistic. It's really hard to get a perfect ranking, even for such a toy collection and small benchmark. So yeah, super important but also super hard and it's hard to understand what's going on. Some of you took it as a learning experience, some of you found it a bit annoying but that's also part of this kind of research, the same when you do anything in machine learning. It's a lot of tweaking, trying, trying to understand what's happening. It's of course very hard to understand and also to predict what's happening. Some of your favorite movies here, I will just show it, it's on the slides. If you want inspiration for watching new movies, especially movies which are maybe not so well known but still very good. So here the highest down on the list, at rank 24,276 in anime film. Here you have a list of recommendations. Okay, I think that's it for the introductory part, we have a full program today, so let's just continue unless you have some questions about exercise sheet 2 or about the organizational part. You can also write something in the chat anytime, also the people in the room can write something in the chat and the people at Zoom can also just talk and ask questions. Okay, then let's start first recap and motivation for this lecture. So we talked about list intersection already in lecture one, and then in lecture two we have merged the lists. When you talk I can hear you. I think you have to be, it's too loud otherwise, thank you. So intersection means you type keywords and then all the keywords must occur in your matching documents. We have found in the second lecture that it's a good idea to do merging and use scores, which means not all scores, not all keywords have to occur. So both modes are relevant and used in real search engines. And it's also basic operation in database engines. That's called a join. Most of you, I guess, have heard some database lecture and when you join two tables on some column or even several columns, these columns are usually sorted or you can sort them and then the join operation joining two tables into one, it's also exactly a list intersect or merge depending on whether you want an inner join, only the parts of the table where the columns match or an outer join, also things where only one table matches. And we will actually come back to that in lecture 12. And I'm just writing that here to say that this is not a random routine we are looking at here, and today we have a whole lecture about it, but it's really a very important and central routine, intersecting or merging sorted lists of integers. It's central for efficiency of every search engine or database engine, so very fundamental. And today we will go back to intersection, so not merging, just for the sake of focus, and it's about efficient list intersection. We will start with some coding to give you some motivation and also because it's quite fascinating I think what you will see. So stay buckle up. We will run code and measure it so I just wanted to say something because you will do this a lot in your future career but also in the course of this lecture. Time measurement is actually not so easy because there is variation. You run it, you run it again and it takes twice as long what's happening. I mean there are other jobs running on your machine, you have stuff like the garbage collector in Python, also in Java it's run, it doesn't run, it changes your runtime. Which part of the data is in which cache? We will not talk about that much today, but it's a big topic, there are all these caches, data read from disk can be cached in main memory, there is this L1 cache which is very fast access to a very small memory for things you are using all the time. TLB is something about virtual memory management. This also makes a big difference. We will do something very simple today, we will just repeat each measurement three times so that we at least see if there is a certain variation. So we are not computing any averages, just repeating three times. Of course interestingly if you repeat something three times that can also distort the truth because you are doing the same thing three times in a row, then by doing it the first time now all kinds of things are in the cache and now the second and third time might be faster because of that, but unrealistically faster. So by trying to avoid problems and repeating it, you are again creating problems. But yeah, this is just a side note, so time measurement if you want to do it scientifically correct and so that it actually says something, it's quite hard. Most of the lecture today will be about a new algorithm which you haven't seen so far. So far we have had the zipper algorithm, which is this very simple linear time algorithm zipper, because it's like a zipper, you go through the two lists in an interleaving fashion, so if they are very similar it will actually interleave, so like a zipper. And before we go to the more complicated algorithm, let's just look at, you take one basic algorithm, even a simple one, and now you try all kinds of tricks. And what kind of tricks will we try? All kinds of variations in the implementation and the algorithm, you will see in a second. But it's still all zipper. What difference does it make? How do you store the list? So in Python there is list, array, numpy, arrays and so on. In other programming languages you also have different data structures for how to store arrays or lists. Just implementing, not changing the implementation, but just minor tweaks in the implementation. Let's try different compilers. There is who knows PyPy in the room. PyPy, do you know PyPy instead of Python? You only know it a little bit, which is why you show me. Okay, you also know it a little bit. Okay, we will also look at another programming languages. So we will just look at a lot of variation for one and the same basic algorithm. And let's just see which bandwidth of running time do we get. So we will see all kinds of codes doing the same thing with the same algorithm. What do you think the range of running times will be? What's the largest difference? Like the fastest program we have in the end and the slowest one, what will be the factor? 2, 5, 10, what do you think? You can also write it in the chat, just a guess. So like everybody in the code writes, implements the same thing, and then we look at the slowest code and the fastest code. What do you think the factor will be between the thousands? A thousand? That's a lot right? Let's see, a thousand, somebody said a thousand. A thousand is a lot, it's maybe a bit unrealistic. So let's start with small variations in the running time. So I've prepared, because we also have other stuff, I've prepared something, namely I've prepared the code which does the list creation and the timing, and you see it here. Let's just go through it very quickly. It just creates, it's in Python, let's start with Python. It creates two lists of a given size. So this is how I'm intending to use it, intersect timing.py and let's start with two lists of different sizes. So maybe one is of size 1000 and the other is of size 1,000,000. And then I have a fourth or third parameter depending on how you count which just says which variation will I use. I will call my algorithms intersect one, intersect two, intersect three. So this I guess will not work because I have not implemented anything yet. And what will it do? It will just create two random lists, so the numbers are not from infinite range but it's integers from a range which is just the size of the sum of the two lists so that I have quite a bit of intersection but also numbers which are different. First list of size n1, the first is of size n2, so here it's 10,001 million. And I just repeat it three times and time it and I output the time in milliseconds and also the size of the list just as a very simple check whether they are actually doing the right thing, we should always get the same result size. And let's just to warm up start by implementing the standard zipper in some obvious way maybe. So we had these two, this is basically what you did for exercise sheet one, just intersect and you tell me if I'm doing something wrong as usual, so ideally when I'm finished with your comments it runs. So what do we do? We run this because it intersects, we run when we are at the end as long as neither of the two lists is at its end. So we have this, right? As long as this list is not at its end and this list is not at its end. Okay, and now we check if the first list, so they are in a sorted order, ascending, so smallest number first. We are now at one position in list A and another position in list B. We are comparing the two elements and we are checking which one is smaller. Let's say we are smaller in list A and then we just proceed in list A and we don't output anything and if we are maybe now we are at the end of list A and then let's just break. Okay and now let's do the same thing for B. So if bj is less than ai, that's the other direction, then we proceed in this list and if we now happen to be at the end of this list, then we break. And now let's check if they are actually equal, if ai is equal to bj and then we actually, what is it? Append, I always forget, is it append for python? Yes, thank you. Which one do we append? Intersect either of them because they are equal and then we proceed in both lists. And little detail here and then we return R. What are we actually doing when we have an element multiple times here? I want to ignore that detail basically. So what this does is, if we have it, I hear you talking, thank you. If I have it two times in one list like 3333 and in the other list I have 333, what this code does it will output it two times, so the minimum of the number of occurrences. That's just the semantics we are implementing here, unclear what we want, let's just do it that way. Is this correct, is this correct implementation of zipper, what do you think? That's our first one. Yeah? No? of zipper, what do you think? That's our first one. Yeah? No? I personally don't have the quality checks with brace and it works. So I think this could break something. Why would we need this? I mean, we need this, right? If we don't do this? We don't do that. do this? It cannot be equal to that because, sorry, if the condition is given that i is equal to that k. Can you give me a line number? 28? It can't go inside that because you can write this i less than length if of a. I is less than here. Now I'm confused, why can't this happen? So this condition can be true, then I is increased, now I may be at the end. What am I getting wrong? I think the confusion comes from not using that error. Yeah, we will use that in our second iteration. The first question is, is this wrong or is it just clumsy? What's your claim? Yeah, that's okay. Okay, that's deliberate. I wanted to start with, you just write your first version and, okay. Is anybody claiming that it's wrong or do you see a mistake? Okay, let's just try it for now. Let's first, I mean there is a dog test here which just checks it for a simple list. Let's just run it and the dog test intersect timing.py. Okay, it works. So it works. Dog test with nine elements, it works. And let's just see the timing now, and then we see what, okay, 200 milliseconds for intersecting 10,000 with 1 million. Okay so I heard talk about ELIF, let's just copy this now and we will do this four times, we will write... So this is now intersect two, we have to take care that we also intersect two, so this is with less ifs and more elifs. Ok, that's what some of you suggested. So let's just, yeah, that's now quickly. Actually if, ok, if this, I think what, yeah maybe I shouldn't delete this. We do this, else, ok if a is not smaller, now we know b is equal or larger, so let's just do this, then we don't have to do this. And otherwise, actually we know, otherwise we know that we are in the equal case, right? So either A is smaller than we increase I or B is smaller than we increase J or neither this nor that, then it's the equal case and then we do this. Is this correct? I mean that's now a little less clumsy, I think that's what you were getting at. Let's maybe run the dog test. We have to make sure that we write two. If we write one here, then in the dog test for two, we just check one again. OK, let's see difference in timing. One, two. Oh, OK. Surprise. Oh, ok, surprise. So what did we do wrong? I'm surprised. Actually I was expecting it to be faster, but maybe you see a problem and we just continue. Say it again. So where does it break? So what do you suggest to make it faster? Yeah but it's interesting right? You, but it's interesting, right? You all said it's clumsy, we made a change, and now it's slower. It's actually part of... Maybe we could change the ends to more and ask the other way around. No, but if or, then we are doing it, then we are having more iterations. Well, we could try that, let's maybe try, let's maybe start with this variant with, I think the, yeah one idea was this might be true for a number of times in a row, especially if you have different lists of different, I think that's what I heard, maybe we can also try your idea later. So let's do a while loop here, let's do something like this, so while, while, we are smaller in this list, let's just proceed in this list, but if we do that then I think we should, we have to introduce this again, right? And then we have to break, and then we have to do the same thing here. So basically we are back to the first one, except that we now have a while loop here. If j be and then, but now we again need this condition here. Absolutely right, we also need, so the code becomes more complex now, it probably will be smaller right? Because now we are going in this list more than once, so we also have to check, thank you very much for this. So we have to len p and this, we go here and then we might be at the end, we should step, stop. Now we know we are at the end of neither and we are here and now this is intersect variant with while loops to benefit from the case where we can skip larger segments in one list. Ok, let's try that. Let's first do the doc tests. Where was my doc tests? Here we are, ok, it still works. Let's do it again, 1, 200, this was, now it's faster, ok, so lesson, important lesson, you just wait and run it again, then it's faster, probably something was happening on the machine, still faster, interesting, quite a bit faster, 3 times faster, and also interesting here, we made the code more complicated, right, this was much simpler, now we made it more complicated again, but actually faster. Okay, and interesting. Let's just, we could try out a lot, but maybe first, so I hear you talking, would be nice, either you have to talk very...so, avoid...let me do one very simple thing, maybe in a language like C++ you won't need it, but let me just...I'm checking a lot against the length of A here and I'm not sure whether Python will figure out that A actually does not change at all. I mean a more strict language like Java, C++ and so on will know this, maybe it says here this is const, it will not change and it knows that these len A are always the same number. Python might not be able to figure this out. So I think that's a very reasonable thing to try, to just give this a name here and then we just... What do you think? Can Python figure this out or not? Will it make a difference? What's your guess? Yeah, it can look up exactly, so that's also a question. Maybe it can't figure it out but it's so fast anyway that it doesn't matter. Yes, that's very important, otherwise I'm not actually checking that this. So avoid evaluation, repeated evaluation of lan A and lan B which don't change. So another very super simple, let's first check whether it's correct what we did, yes seems to be correct. Ok let's do it again, that was our first one, 200 milliseconds, 120, we are down, ok a little bit of variation, still faster. Ok. Now here is another fifth fancy idea, and there are lots of ideas to have, let's try, and that's it for trying, then I will try some other variations. What's one other annoying thing? I mean we have dealt with this in all the variations here, you always have to check am I at the end of one list, am I at the end of one list, you have to do it here in the while loop, you have to do it here before we proceed, you have to do it again, again, you have to do it four times. One simple trick to avoid this is to put like special elements at the end. Avoid repeated check whether we are at the end of the list, whether we are at the end of one of the lists by adding a sentinel, so that's called a sentinel, so someone who's paying attention. Let's just append infinity to the end of both lists and you will see how a sentinel, infinity, in this case depends on the application, what the sentinel is, it's just like a special element which you insert for example in the end to avoid infinity. In this case doesn't have to be infinity depends on the algorithm at the end of both lists. And let's just, here we should change this to 5. Let's just see what, actually now I, yeah let's just keep it for this while loop. And now what I'm doing, I'm just appending infinity here and to b. So what do I have now? At the end of both lists I now of infinity, but after the last one, so I will not actually go there. And now I am saying, I am claiming that I can do this. So I don't have to check it in this while loop, because eventually, so I am still, I am in this loop which means none of the two lists have reached infinity which is after the last element. And if one of the lists runs out, then one of these elements will be infinity and the other will not be, which means this loop will break, will stop eventually without overflowing because there is infinity at the end and I think I can do the same thing here. We still need this because after the while loop now we may be at the end of one of the lists and so we should but we don't have to do it in the while loop so it's rather small change. Now we have modified and I think now we have to remove it again. So it's a bit ugly because we are now in other languages where you would say, OK, A and B do not change, they are const actually, now they change. But just a little bit. So just appending inf, again the code becomes more complicated, but in a way that doesn't cost runtime, appending an element once and then removing it again. Is it correct? What do you think? Let's prove correctness for running our doc test. So will it make a difference? OK, this was wrong. That was our first one, 200 milliseconds, our second variant not quite twice faster, the third one three times faster, the fourth one four times faster, still a little faster. It helped, it helped. So we tried a lot of different things, more things to try, small things didn't change the basic zipper, and we are almost five times faster than what we started with, just in Python. And here are some things on the slide, I think you can read it yourself. Now of course the question is why was it faster? Of course if you, yeah I said it while we were trying these things, but it's of course one thing to have a hypothesis, the other thing is whether that's really why it's faster. You would have to look at the actual machine code that's executed for understanding what's exactly the reason. But I think it's fair to assume that we removed code here in some cases, less code was running, at least in the repetition parts. Variations in how lists are stored. So python has built in lists which can hold any element of any type, so it's a pretty wasteful data structure and it's also even not necessarily an array because you can insert something in the middle. So actually I don't know 100% how lists are implemented. I'm assuming, I'm pretty sure that Python has an optimization that if you have a list which can hold anything and you can insert anything, but actually it just contains only integers and you never add something, then it's storing it in a more efficient fashion. I'm pretty sure it does that. But Python also has array, and actually I've implemented this but I will just run it for you. So here is, yeah, and it's quite simple, it's just you import array and all you do is, here if you look at this line, I'm just creating from my list an array and I'm saying it's an array of integers. So I'm just telling Python look, I'm only dealing with integers, please store it more compactly. And there's also numpy which everybody uses for all kinds of linear algebra. And yeah let's just try it. Let's maybe not try all variations but just the array, the first one maybe. the first one maybe. Ok so let's, this was the original one and this is the array one, so if we use the more efficient data structure it's slower. Let's also try numpy, even slower. Ok so you use array, the efficient built-in one, it's slower, use numpy, it's even slower, interesting, right? So why is that? Actually I think it's because list is optimized, it does something special for when everything is of the same type, but array shouldn't be slower, so that's a bit surprising. And NumPy is really not, NumPy is made for, if you want to do linear algebra and special operations like you have an array, give me the mean element of everything or even do matrix multiplication, 2D arrays, then these operations will be implemented in C and very fast. But if you use a Numumpy array and then you just loop over it like we do here, that's a bad idea. I mean you shouldn't do that. But it's something you have to know, right? So that's why I'm showing it. You might think oh numpy is cool, everybody is using it, I'm also using it to get some performance benefit. No you are not getting performance penalty. Let's use PyPy, who knows PyPy in the room? So PyPy is just a drop-in replacement for Python, you can just, and I mean, let's just do it. So here is our original program. And now I just use PyPy. Just instead of writing Python 3, I'm writing PyPy 3. Little bit of history there, all kinds of projects, I think a dozen at least, which try to make Python faster. So there is Cython, where you can write in your Python program C code, there is C Python which tries to compile Python to C and then run the C code. There are ways to connect Python to C, integrate C in Python, integrate Python in C with all kinds of native interfaces between the two, quite messy, you always have to interface between two languages, writing that one language and the other. PyPy is just, you just write PyPy instead of Python and you install it before and then hope that it's faster. It's quite a bit faster. So it's quite amazing. How does it do that? And also interesting here if I run it for the, and I get funny effects here, which already give a hint at what it does. But you don't get these for, now the array is also fast, interesting when I do it with PyPy, could talk a lot about PyPy. It does just in time compilation, it does not compile the whole thing and then run it, but while it's running it, it's seeing, okay, let me compile this loop body into C and then run it again. So that's why we have such effects. Loop body for the first time it did some compilation and running it for the second time, it just uses the already into C compiled body and runs it. So and the great thing is you can, so I wonder why not everybody should always try PyPy, you have code, just run it with PyPy, if it works maybe it's much faster. It doesn't work, it's a big project and of all the projects I just named, maybe the biggest one and their goal is to be almost fully compatible, which is very hard because Python has so many features. So some features in latest Python versions don't work, but a lot actually works. For example, Django, who knows Django, this web framework which is written in Python some of you, in Django you can just run it with PyPy it works. So even big frameworks which use a lot of features and a lot of code, they work with PyPy. PyPy is written in R Python, restricted Python, and the R Python, so the code PyPy is written is interpreted by code written in Python, and Python is compiled to C, so it's a, let me look at the PyPy logo, actually I haven't looked at the, yeah that's the PyPy logo, it's a snake eating itself, because it's kind of recursive, written in R Python and the interpreter for that language in which PyPy is written, it's interpreted by Python. And the improvement is enormous, right, it's a factor of almost 100 just by, yeah, and you might not even notice. So now let's do the real thing, let's do the C++. Just why is C++ still a great language? In case you doubt that, we use a lot of C++ at our chair. I hear you talking. So C++ allows programming in a higher level language type safety, nice abstractions, but the goal of C++ is to do it in a way unlike Java, that you don't lose any performance, zero performance, compared to writing code in C or even a lower level language. And it does that pretty well, so you can always write very efficient C++ program. C++ syntax and concepts are a bit weird at times, for the simple reason if something is around for a very long time and you have enormous amount of software written in C++, you have to be backwards compatible, not fully but at least in important respects. Which means if you did a mistake in your language design 20 years ago, you are kind of stuck with it, you still have to support this stupid construct where you now realize, oh I should have done this differently 20 years ago. That's always the problem with languages which are around for a long time. C++ does a pretty good job still, you have this bit clumsy syntax at times, newer languages like Rust try to do that better, but of course they have the problem that they don't have this rich ecosystem. But, and I've written this for you so we can just look at it together, for most code it's actually not bad. This is the code for NC++, I mean it looks very similar right? Now you have a parenthesis unlike python, you don't have a colon but you have curly braces for the loop body. Otherwise it's just, this is the if condition, looks exactly the same. Let's just try it, let's just run and now let's maybe, and then we are done with this part and we make a break. Let's just run it and I think first we have to compile it. Yeah let's compile it with nothing special here, so just standard optimization, the executable binary will be called intersect timing. Let's compile it and now let's run it. So instead of calling an interpreter and now do this and run it. Ok, it's too easy for C++, we have to go a little bit higher. Let's do that, let's maybe start with a Python program, Python 3, and let's maybe just add a zero here. Ok, now it's 10 million intersecting 10,000 with 10 million with our slowest algorithm. Okay it takes two seconds and now let's do the same thing with... So what do you think? And now let's take variant 5 with the sentinels, 2 milliseconds. It's because the random number generator is different. And also note, it's a different random number generator, so I'm using, I mean it's very hard to get the same random numbers for both lists, I would have had to do some special magic. This is just for consistency between the methods. Let's also compare it with Pi Pi again. And interesting thing, there are lots of interesting things here, so if you're interested in this look at this again and try to savour all the little details. It's also pretty fast, let's maybe also look at it. Do you know how Pi Pi takes a pretty long time in the beginning? Even though it's much faster when doing the actual intersection, this beginning part is where it computes the random numbers and a lot of them, 10 million of them, and it's not fast with that, so PyPy does not make everything faster. And the C++ code was actually also much faster in this initial part. So PyPy is not a secret source for everything. So we see PyPy gets a long way, but C++ is still 5 times faster, and a factor of 5. And actually you were right, maybe you were already peaking at the slides, so if we look at our slowest Python program, which wasn't so dumb, two seconds and our fastest C++ program, 2 milliseconds, it's a factor of 1000, and I think this is a really important message. I mean this is definitely happening a lot, you have web frameworks out there which people use even for a lot of data, like our campus management system I guess is an example of that, and it's doing things a 1000 times slower or even slower than that than it could be possible by very simple changes. We are not even changing the algorithm here, right? This is all basic zipper, now the following part will be about more sophisticated stuff. And also, yeah, so one thousand times faster with a simple algorithm, just algorithm engineering. So I think that's super interesting. So huge potential for performance gains, even if you don't change the algorithm, but you really have to understand what you are doing. There were so many effects here, why was this slower, why was it slower the first time? A lot you need to understand about the machine, about the programming language, about how the language maps to the machine code and so on. So if you work with us, this is something we deeply care about, because we really want fast programs, so very interesting. That was just a glimpse, and with that I think we go into a five minute break now and see you in five minutes and then we continue with a better algorithm. So that was about engineering a basic algorithm and factor a thousand differences. Now we talk about other and better algorithms So the rest of the lecture is more theoretical and also for a change the exercise sheet will be theoretical. I will talk more about that in a second. So just some preliminaries, just talking about technology. We always have two lists, just looking at the two list case. And let's always call the smaller list A and the larger list B. They are both sorted. Let's call the number of elements in the smaller list K and the other N. So K is always less or equal to N. Let's maybe write that here. Just do so. K is less or equal to N. of course they could have equal size. This intersection is commutative, doesn't matter, a intersect b or b intersect a, so we can always just swap it so that a is the smaller list. And if they are equal it does not matter which one is which. We denote the elements like so in array notation, so one based because it is just simpler or clearer for explanation. So these are the elements of A and for the following always assume like few elements. When I draw a picture I will just draw it. So we have like in A there are maybe four or five elements and then we have B, this huge list. And they are denoted like this. And both lists are sorted because otherwise you can't do things very efficiently. So we are just, but remember where they come from inverted lists, they are sorted in the pre-computation. So we actually have these sorted lists. So the general idea of all the following algorithms, now we will have a progression of more and more complicated, not too complicated algorithm. You have the small list and you try to, let me maybe draw a picture here and already show something. So I have my list A here and it's just some elements, maybe three of them. So this here is my A of one and this is my A of two. And by the way, in case you are noting that everything goes very smoothly here with the writing Frank who is not here right now he did a lot of work to, he changed the whole setup basically. It's a new machine because it turned out that the old machine which wasn't so old was just too slow for running Zoom, Camtasia and PowerPoint and stuff so this apparently took quite a heavy toll on the machine and now it's just a new machine with more power, now we don't have these lagging problems. So what you want is, so you have this, and here you have the larger list, B, which I maybe just draw as a line because it's so many elements. Now what I want to find out, and both are sorted, I want to find out where does this fit in the list B. So here this would be index J1 and what it exactly means is written here. So this element A1 somewhere fits in the sorted order of B between two elements of B and the place where it fits just has a particular index J1. And of course the position of where they fit will be in ascending order because both lists are sorted. So this might fit something here, here, so the second element of A might fit in here and the third element of A might fit in here. So that's basically the picture you should have. What we are trying to do is we want to locate the elements of A in B and when we have located them so we know that the element just preceding this position here, the element in B is smaller and afterwards it's larger or maybe equal and now if we want to do intersect we can just check ok is it equal to the one before or after and yet we output it. And the details are not important now, you can read it later on the slide. These details are not important for what I am going to explain. I just say it again, just want to see where do the, in this case three elements of A fit into B and if you know where they fit in the sorted order then it's easy to check are they actually equal to this element of B or not, so to compute the intersection. Is there any question about this? Except about the details which I glossed over now but which are also not important, but that's the basic idea. Okay, so the first, you just do binary search. So let me just draw the picture again, I won't draw it too many times, but maybe this one time. So I have my list A with three elements and I have my list B with a lot of elements, so here I have K and this kind three and here I have N some large number, maybe a million and now I just want to see A, where is this first one in B and I just do a binary search on B. How expensive is binary search of one element on B? Log n, it's log n, and I have to do it three times k times, so the running time is k times log n. It's a very simple algorithm, not too bad actually if I have a small k, a is just one element, it's log n, three elements, three times log n, still log n, so it's great for small k. If k becomes large, if k is on the order of n, then this becomes n log n eventually, right? So even if it's n half or n over four or something on the order of n, then this is not a good algorithm. It's actually slower than zipper. But it's a good algorithm for small k. So let's elaborate on this idea. One obvious idea is, and let me just, that's pretty obvious. If you think about it again, three elements, let's just do three for the sake of example. So our k is equal three and our b is some long list. Now let's assume we have already located this one here. So we know it's here. So we know it's here. Then, and let me also, so this is the first element, this is the second element. Since these lists are sorted, we can, it suffices to binary search in this range. Binary search A2 in this range. Right, we don't have to search before A1 anymore, before A1 was smaller and everything before A1 will be even smaller. I guess this is clear right? I mean you don't have to binary search the whole of B again, and of course it's good if at some point we find an element rather at the back of B then the remaining range will be smaller and smaller. Let's just look, so it's a small optimization of the algorithm. What's the time complexity in the best case? Now, what's the time complexity in the best case. So we have k elements and I want something with k and n. If you think about it for a little bit. Yeah that's true, we even need just K, that's the very best case, that's true, it's not even the one I had in mind. Yeah, somehow by luck the binary search just hits the element immediately. Ah, okay, but then you are in the middle and then you have to search the others in a, yeah, okay so there are, let me, actually, yeah it's not so easy to see what the very best case is, so the somewhat best case I would argue is when the first element and you find you do a full binary search once and then you are already at the end of B. I mean that's also your case, I'm not sure, because if the first binary search is successful immediately, then you are still in the middle of B and you still have to search the second half or the second element. But you could argue that just after a few halvings, so maybe it's here in this end, so it's not that easy to say what exactly the very best case is, but some quite good case is certainly when you find with one binary search which takes log n you are already at the end, so your A1 will be, and maybe we can just draw that here, so my B is like so, and let's just, and it's anyway, it's just to get intuition, let's just say A1 is here, A1 is already at the end and nothing comes afterwards, I don't have to search anything for A2, A3 anymore because I know that they won't occur in B. So it's basically one binary search. But maybe it could be even faster. But the point is in the best case this would be very fast. So what's the worst case? Complexity. Very fast, so what's the worst case complexity? So this is supposed to be an improvement of the previous one, which does k binary searches. What's the worst case, yes? k log n, that's true. Yeah, it's k log n. When does it happen? Yeah if it's the first, if all of them are, yeah, I think I, exactly, if all of them occur at the beginning of the list, and then you you basically it doesn't help you so much that you search in the remainder because they are all so A1 and then A2 they are all here and A3 and the remainder of the list is always the whole list. And then we have, so this is k log n, so like in the k binary searches one. Then there is kind of the typical case, and the typical case looks like this. And I, let me maybe draw it here. Yeah, in the typical case, typical also in quotes here, it's like they are evenly distributed. Let's even say they are perfectly evenly distributed. So we have this here, and this here, and this here. And what's the running type then? If they are perfectly evenly distributed? Is it better or is it, what do you think? So now I mean we could write it, it's actually not so easy right? Let's start, so the first one is, you have the first one is log n, now the second one is log n minus, if they are n, n over k, so it's searching in a little bit less and the second one is log n writing at the bottom is always, the first k over 2 of these are still searching in a range more than half, right? If you just take the first half of A, they are still searching in more than half n over two. They are still searching in more than n over two of B. It's a bit sloppy formulation. So yes, it becomes smaller and smaller, but the first k over, the first half, you still search in half, so it's k over 2 times log n over 2, it's still k times log n. So asymptotically it's actually not better. And that's quite typical, if you have something with log, it doesn't really help if it becomes smaller, but not rapidly smaller. So even in the typical case... Do you agree or do you disagree? Yeah? Yeah. Okay, so the question is how would you argue that the letter is better if the asymptotic complexity is the same? Yeah, but that's this fundamental question about asymptotic complexity ignores constant factors right? So an algorithm can be, if you take the first part of the lecture it was the same linear time algorithms. Everything we saw was theta of n, just the size of the two lists, but there was runtime differences of a factor 1000. So in practice this of course matters a lot, but so it's just two worlds. There's the asymptotic differences, where an algorithm can be asymptotically better, which is worth something and then there are these hidden constant factors, which you ignore in the asymptotic analysis, but which are also very important in practice. So it's just two aspects and both are important. And now we are in the theoretical part, we are trying to make it asymptotically better. So in practice this would be better, but asymptotically it's not better. Okay so now let's try to make it asymptotically better. Let's try to do that and here is an algorithm and that will be the algorithm which you analyze for the exercise sheet. And it's doing something called galloping search. And it's a sequence of exponential and binary search. And let me just show it to you, here it is on the slide. And I will not explain it to you in complete detail because that is anyway something you only understand by yourself when you do it for yourself. But I will show you the basic idea which is actually very simple. So let's again assume we have, doesn't really matter how many, three elements here and now we want to locate them in the large list and now, so I need a little more space here, here is my large list and let's assume I have already located my, let me draw the picture, I don't know, maybe I have already located my A1 here, it's here, J1, and now I know that my J2 is somewhere in the remaining part. So A2 will now be somewhere to the right of A1. So this is where A1 is located and I want to find it but I don't want to do a binary search in the whole part. So I first do an exponential search. So what do I do? First, let me draw it like this. So this here is now position j1 plus 1. And this is position, this is a jump of 2, so this is now position j1 plus 2. And this is now 4, so hence exponential, so this is now position J1 plus 4 and this is now not exponential, but let's just ignore this for now, J1 plus 8 maybe let's go a little further here. Yeah, so I'm just probing B at positions which increase exponentially. So here it's written formally, so I'm starting from the position of the previous element I've located. I know my next element will be located on the right and now I'm checking is it located one to the right, is it located two to the right, four to the right, eight to the right. And when do I stop? I stop when, and that's written here, so my, and I'm locating A2 right, I'm trying to find A2. Actually I was using orange, yeah orange and this does not work at all. So my question is where is A2 located here and so what I am probing at every position here, so here I find maybe that B, let me write it like this, that B at position J1 plus 1, I should write a first maybe, so I'm comparing the element a2 and so let's say a2 is still smaller than b at this first position. And I'm trying to find the first position where it's actually larger. So here a2 is also smaller than this one, b at position j1 plus 2. And it's also still smaller here, a2 is still, am I doing it the right way around? No, this is where, what do I want to find? Do I want to find the first element where B is, Is it wrong here or is it wrong here? I think it's j1 plus 4, so if I have, so I am going the elements of B are becoming larger and larger, right, so at some point my A is staying the same, so it should be, I did it wrong here. It's trivial but it's easy to get confused about it when you're sitting right in front of it. So it's a, let's just for the sake of example, just, it's always good to have a little example. Just a second, let's just assume we have 70 here, we have 82 here and 183 here, right? And now we are searching 82, so here it's element 17 and now we are seeing here, we are seeing maybe 21, 35, 49, and here we are seeing 115 in B B right? So at some point, so my element in A, my 82 is larger. Just a second, I'm coming back to here, it's larger here, it's larger here, it's larger here, and at some point it will be actually smaller. And yes, there was a question, please, or comment. Yes, that's the way I... Yeah, it's actually, you can do it both ways, but it's inconsistent with what I wrote above. It doesn't matter which way you do it, the analysis works either way, because the sum of the powers of two is also essentially a power of two, but you are completely right. So I, yeah. Just a second, so I'm just wondering how I'm... Let me just draw it differently. I mean it should be consistent with what I did above. So let me just draw it like this and then like this. And then like this, and then like this, and then like this, and like this. What was your comment? So like if the A is greater than B, then we move on, but if it is equal, then we stop. I agree, you are completely right. These are also important details, also for your proof. If it's greater, if it's equal, I can already stop. So as long as it's greater, I still go on. So I'm searching for 82 and I'm finding all these elements in B which are still smaller, smaller and then some point here I find. And now it's also consistent with what I wrote above. So for example, let me just write examples here, so I'm searching for the 82, right? 82, and maybe here I'm finding, and here I have number 17 at position j1, maybe here I'm finding 21, here I'm finding 41, maybe here I'm finding 45, and here I'm finding 129 or something like this. So here for the first time I'm actually larger in B than the element I'm searching. And what do I know now? Now you can binary search in this range. Making mistakes, it's also good to understand something. Anyway, I think the idea becomes clear, so you're just probing at these selected positions which increase, which are at an increasing exponentially increasing distance from the location of your previous element. At some point you find, ok, I know that my element I am looking for is to the right of this, but it is to the left of this, and in this last range, and of course there is this border case where you are already passed B now then you are just searching the rest of the list and now you are just doing a binary search here. So that's the basic algorithm and now you are doing that for A2 and when you have located A2 you do the same for A3. Now you start from the position A2, you do the same thing, you jump in these exponentially increasing steps. So that's the basic algorithm. Is there any question about the basic algorithm? And all the details you have to look at them yourself anyway because you have to prove something about this. And I haven't said it, but I hope it became clear by the, just by talking about it, I mean what's the point of all this? The point is that you want to avoid searching the remainder, the whole remainder of the list. Let's assume that we would have been successful here already at J1 plus 4, right? Then I just have to search this range and I don't have to search everything that's to the right of this. And I'm overshooting the ideal position a little bit, of course ideally I want to find the right position right away, so I'm overshooting it just a little bit, but I'm overshooting it just by a factor of two at most. And this is something you can prove in the analysis. So that's the idea. Yes? So there are doubts whether this is asymptotically faster and it's good that you have these doubts because the proof is not at all trivial that it's asymptotically faster. The question is, is this faster? We have seen these progressions on the previous slide, where we tried things and they were not faster, so maybe this is also not faster. So you're completely right to ask this question, but it's a very hard question to answer and it's not easy to answer it in any direction. And we will see it on the following slides, so it's actually quite complicated, which is what the rest of the lecture will be about. So it's a very reasonable question to ask, is this even faster asymptotically. So this is one claim so what we will define, we will do the following, let me again draw this little picture here, so we have the elements of B. And now we have the positions in B where the elements of A are located. So here we have, let's say J1, here is J2 and here is J3. These can be in any position, that's important now, we are not dealing with typical worst best case, they can all be at the front, they can be evenly distributed all at the end, any position. And we are defining these Di's, and so this is just the gap between, so this is just the gap to where the first element is located, this is the gap from the first to the second, D2, and this is the gap from the second to the third. So these are crucial numbers for the following analysis. And what I'm claiming, and you have to prove it for the exercise sheet, but maybe you can get an intuition now is when you're starting and let me just go to the pointer again, let's say I'm starting from here and I'm trying to find J3. I don't know where it is, could be very close, could be very fast and what I claim is that the previous algorithm finds this in log of this d3, assuming d3 is not zero, then you don't have to do anything, right? So I'm assuming that the dI are not zero here. So finding this from here takes log dI. And why is that reasonable? Let's just say that you do a binary search in exactly this interval of length d3, that would take log d3, right? Let's just say you know that this third element is somewhere in this range, happens to be exactly to the right of it, then this would take log d3. Now assume you're searching in the range that's just twice as large, like log 2 of d3, right? So log 2 of d3 and that's just log d3 plus 1, which is also theta of log of d3, right? The log of something twice as large, if it's the, let's just take the log to the power of 2, it's the same for if you are in asymptotic notation because it's just a constant factor difference, just do it like this. So even if you are searching in a range that's twice as large as this D3, then it still logs the size of this range. And this is actually what the exponential search does, it will overshoot the real thing, so the real location of my A2 is somewhere in here, and by this algorithm it will overshoot the real thing, so the real location of my A2 is somewhere in here and by this algorithm it will overshoot it by at most a factor of two. So actually that's why you will be able to prove this. And that's something you should do for the exercise sheet, so I could now do it all here but the experience shows that if I now also work out all the details for you then you just don't understand it. You have to understand the basic idea, I think that became clear now and now you have to try to prove this yourself, guided by the intuition I just gave. Was there another question? And of course you need to sit down and think a little about this. Yeah? The y or the di you mean, yeah? Of n, so the question is does this d3 depend on the whole of n? So the d3 is just a fixed number which depends on your input. So the J1, J2, J3 in my pictures are what you actually want in the end, but you don't know it yet. So this is like for a fixed input, D1 may be 17, which means the first element is located at the 17th position of B, J2 might be 85 and J3 might be 215. And then these Ds are just the differences between them. So these are actual numbers determined by your input but you don't know them. And this will be a topic on the next slide. Which means your running time depends on some numbers which depend on the next slide, which means your running time depends on some numbers which depend on the input but you don't know what they are. But still you can say the running time is locked of some number, you don't know what it is but it's well defined, it's a well defined number. So that's a reasonable question. But we will see you can say something about this d1, d2, d3. Namely, so what's the total time complexity? It's just, so for every one of these searches you have log of di and you have to do something constant for each element also, just set of the search, so you have k plus the sum of these log d i's and that's what I just explained, I mean it's not zero even if the d i is one you have to do something constant for each element, that's why you need the k plus here. And now this addresses exactly your question, this is an unsatisfactory running time, right? It's k plus, depending on some numbers, and I don't know what they are. It's numbers which are well defined, but I don't, and in particular, is this good or bad? I don't know, right? It's k plus some log of some DI, is this now better or worse than what we have seen before? We don't know. And you can prove, and you will also do the proof for the exercise sheet, that this sum is maximized, it's the largest, which means it's the worst, this algorithm has the worst running time if the elements are perfectly distributed over B, which means all the Di are N over K. And this will be the last part of the lecture, will be about this. Which means, so what this means is that and you will prove it in the exercise sheet using a technique which I will explain in the last part. So you can prove that this sum here of this log Di is less or equal, so actually this becomes largest, which means it's the worst case if all the DI are N over K. So that's the worst case. That's the worst case, which means this running time is certainly log of n over k and actually you need the one plus here because otherwise this log of zero is not defined. So that's just a technical detail and you will... 1 plus because log 0 is not defined. There are all these little details here but you really have to work this out yourself in the proof. The whole exercise sheet is just about understanding the algorithm and proving this yourself. It's not long proofs but you have to understand something. So here we have this running times depending on this DI and I'm actually claiming the worst case, the algorithm is the worst when they are evenly distributed and then I get this running time and that's actually not a bad running time, that's now k times log something which is smaller than log n, right? Before we had k times log n, now it's k times log n over k. And actually there's also an exercise, I think I will skip these, these are just for, I think, let's just go to the exercise sheet before I explain to you, before another break how you actually can prove this. So the exercises are split up a bit, but it's essentially the questions which were on the slide. So the first exercise is just proof that one step of this exponential binary search is exactly log of the gap, which you don't know, but it's still log of the gap, then you prove that this is maximized for when the DI, when the elements are evenly spaced, and then the last exercise is also important, it's, and I'm putting it all in the exercise, even this bound, is this even better than zipper, is this better than linear, and is it better than K log N? This is the last exercise to show this. This is actually a very good bound, it's always better than linear, which is not obvious, but you prove it here, so it's better than k plus n, and sometimes even much smaller. I mean let's just look for, let's maybe this one for node for constant k, for constant k, what is it for constant k? Then we have a constant here, and we have n divided by some constant, log n divided by constant is still log n the bound is... oh now it's very quiet the bound is O of log n so, yeah, it's actually, yeah, if you plug in k equals to one, you just have log n here, k equals to two is two times log n over two, it's also log n. So it's actually a good bound, the question is what happens if k becomes larger, this is sheaf, but that's how it is, If K becomes larger it never becomes worse than linear. This is the third exercise. So all the details of the analysis you are doing yourself, none of these are very big proofs or anything, but you have to understand what you are doing and work out the details yourself. Please also, this is a part here. If you have a great handwriting, this excludes most computer science students, I'm sorry to say, maybe it has changed in the recent years, climate is changing, maybe other things are changing too. I don't think so. You can also submit high quality scan of high quality handwriting. But if you are in doubt whether your handwriting is hand quality then the answer is no. Then you should just typeset it using TeX, which you have to do anyway at some later point in your study. But if you have a nice handwriting or you want to practice your handwriting, you can also. But two things, the scan has to be, so the worst thing is something terrible handwriting and then a terrible grey scan where you hardly see anything, your tutor will have a terrible time. You probably just or she rejected. So yeah, the default I think for most of you is just to type set this. Do you have a question or you were just stretching? Ok, so here just some additional information, I will not talk about it, just for your curiosity. And the last part of the lecture is about, and we will have another break, and then we will go there, is about this thing here. We have this funny time complexity of log of some values which we don't know, and I claim that the worst case is when the elements are evenly distributed, and to prove something like this you need a technique called Lagrange multipliers. Who knows Lagrange multipliers, have you seen them somewhere else? A few people have seen them. It's a very important technique. We will use it again in another lecture, which is why I will explain it to you today by an example. And we will do that in the last part after another 5 minute break. So see you again in 5 minutes, fresh for a final proof. We come to the fun part. Most fun, all the parts have been fun, this is now the greatest fun because it's mathematics which you all love, I know it. Deep inside you, some maybe are not fully conscious of it yet. So Lagrange multipliers, there for the following kinds of problems you might not realize that this is the problem you have for one of the tasks which I've described earlier but you will see it for the exercise sheet. It's finding the maximum value of a function under a certain constraint and maybe I should if it's too loud we have to close it but let's try it for a second to give you a hint of why this is relevant. Here we wanted to find the maximum value of this log Di and what's the constraint? The Di, we don't know what the Di are, but the constraint, they have a constraint, namely their sum is less or equal to N, right? These are the gaps between how the elements are located in D1, D2, D3, but they certainly sum to something less or equal to N. It can be less than N because the last element is not necessarily located at the end, so it's less or equal N. So you're asking yourself a question of the kind, this sum with these numbers, which are well defined but which I don't know, what's the largest possible value of this under a certain constraint on the numbers. That's exactly the kind of problem which Lagrange multipliers solve. And now here I have an equality not an inequality. I have a slide about that. So you're asking the question some function depending on several variables, so it was these gaps d1, dk, here it's just any variables, x1 to xn from the real numbers, I also have a slide, what about if they are just from a part of the real numbers, under a certain constraint which is an equality. And I will just explain this method by an example now and then you can practice it on the exercise sheet, we will have another application in the next lecture. There are some assumptions, I will go into the details here. These should be nice functions, which means when you compute the derivative, there are nice functions too. Partial derivatives, if you are afraid of partial derivatives, don't be afraid, it's just you take one variable and treat everything else as constant. So it's like the normal derivative. So here are some conditions, I will not go into details. And what's the algorithm? So you want to find like where is this function the largest? You are constructing a function which is called the Lagrangian after this famous mathematician, Lagrange, where you just take the f function and the constraint function and you put them together. So it's just f plus or minus some lambda times the constraint function. That's what you do, you see it in a second. And now you compute all the partial derivatives of this function. So you derive by x1, x2, xn. You set it to zero. Now you get the systems of equations and you solve it. And now you know that the maximum you're looking for is at one of these solutions. So you are basically reducing the problem to solving a system of equations, namely computing these n plus one partial derivatives, setting them to zero, finding the solutions. And then you just check at all these solutions, what's the value of F and you pick the largest one. There are all these little details, they are mentioned on the slides and you will encounter them when you do the exercise. That's like with a normal curve, if you only have one solution you know it's a local optimum, but is it a minimum or a maximum? Well if you only have one, then the simple way to check what it is, is to just compute some other point, right? And then if that other point is smaller, then you know that you have a maximum. If that other point is larger, the value, then you know you have a minimum. But it will be a maximum here and for the exercise sheet. Okay, and our best way to explain such things because it's really an algorithm is by example. So assume we have a cuboid, this will I think take 5 to 10 minutes or so, so a little over time but I think it's good to have this on the recording. So I have something like this, a box, let me just draw a box here, and I have a side length, this is maybe x, this is y, and this is z, so they don't have to be equal. What's the volume of the box? x times y times z. What's the surface of the box? Well, it has x, y twice here and at the bottom, so it's 2xy if I just want the surface. It has this one, this area here twice, which is yz. And it has another one, which is, is, no no this is not yz, yes it is yz and this is xz at the top. Is this correct what I did? Yes that's correct. Okay and both of these I have twice, you don't see the ones opposite, so that's the surface. It doesn't really matter, it's just so that you have something too which you can imagine. And now I'm wondering, let's just say I have some amount of material, that's a typical toy example, some amount of cloth or something, six square meters, and now I want, and I can choose y, x and z with a fixed surface and I want to maximize the volume. How do I maximize the volume? So it's now a Lagrangian task, I want to find the maximum value for x, y, z when the surface has a fixed size. And note here this little transformation that I did. So my surface, I said it's 6 square meters. Let me just drop there. So it's 2xy plus 2xz plus 2yz is equal to 6. Well I can divide it by 2xy plus xz plus yz is equal to 3 and I just can make it into something equals to 0 by just bringing the 3 over here. Because for Lagrange you need a function which is 0 at the...so my...and that's just how I define my g here, right? And I can of course always do that. That's just how I define my g x, y, z. So that way I now have it in the form of a Lagrange multiplier exercise, so I want to maximize x, y, z, the volume under this constraint, that this function g is zero, which means the surface is exactly 6 square meters. Check the assumptions, we will not do it here, you can check it yourself if you are interested. If I just take the derivatives of, these are all very smooth functions, partial derivative of this by x, so x is treated as the variable, y and z as constants, it's yz, these are the other partial derivatives, this is continuous, fine. If I compute the partial derivatives of g, I get this function, which is never zero, at least not where the constraint is fulfilled, you can check it. This is also something you have to check for the exercise sheet. It's just that the assumptions which I have written earlier on the slide are fulfilled. They are fulfilled here. And now let's do the actual thing. This is also what you have to do for a different function of course, for the exercise sheet, now we will just compute, we have this Lagrangian which is just the volume minus lambda times the side constraint and now let's just compute the partial derivatives. So by x, which means my x is now the variable and y and z are constants. You could imagine numbers here like x times 7 times 5, right? And if I derive by f it would be 7 times 5, 35. So it's like this is yz minus and now I derive by x, so here I get lambda y, then I get minus lambda xz, I derive by x, it's lambda z, and if I derive yz by x, it's like a constant, it disappears, three disappears. Yes. And this is zero. So I do this equation system. And now, and you check please whether I'm not making any mistakes. Let me just write all the, this is now xz minus by y lambda x minus lambda z. This should also be zero. Let me compute the derivative with respect to zero. And then I also have to compute the derivative with respect to lambda. And if I compute the derivative with respect to lambda, this disappears, there is no lambda. This lambda is the variable and I get this here remains, this is a constant, which means I get xy plus xz and this is no coincidence, it's always like this, I get exactly the side constraint. So setting the derivative by lambda to zero just gives me the side constraint. It's just by the way how this L was constructed. Okay, let's just, and now we try to solve the systems of equations. I mean maybe do it a little bit quicker. And this is now typical mathematics exercise. How do you solve? It's actually a non-linear system of equations, you have products of the variables, how do I solve this? Let me just, let's just put one and two together, I hear you talking, you can ask a question if you talking. You can ask a question if you like. You are absolutely right, but of course it's the same thing because I'm setting it to zero, but as I wrote it here it's wrong. Yes, you are completely right, so I should write it like this. And then of course I can get rid of the minus, but you are right, this is not the derivative, it's minus the derivative. Let's just take these two together, one and two, what do I have? If I just subtract them from each other, I get, I think, x, let's just take two minus i, then I think it's, I think I get this x-y times lambda, is this correct? I think it is. So if I just, yeah, if I just subtract them from each other, I get this disappears, lambda z is in both of them. If I just combine these two, I get this, and this, what does this mean? And I mean I won't go, this is just solving a system of equations by some creative idea. When does this happen? When is x-y times z equals to x-y times y? It either happens when x is equal to y, right? Then it's zero here, it doesn't matter what z or lambda is, or if they are not equal, z is equal to lambda. And I think similarly you get similarly I think you get two more constraints, you get x is equal to lambda. And I don't think I will finish this in all details now, because you can do it as an exercise or you just do it as only, there is only one solution here. It is a bit more fiddling around but not much more. The only solution this has is x, y, z. I mean there is already a lot of equality here. If you think about it a little more, there is only one solution. The only way these equations can all be zero is when x, y and z are all equal. There is a lot of symmetry here, so it is not too surprising. If x, y and z is all equal, then what is lambda? We can derive this here. This is then y square minus 2 lambda y square and lambda is one half. But the value of lambda is actually not important at all. It's just a variable here in the mix. What you are interested is x, y, z. And that's the only solution. So in this case we have only one solution. There could be several only solutions. And so I think what we, yeah, so our only solution was, so what was our f? We were computing the volume x, y, z and our g of x was this and I think we didn't really compute the value of, we can do it here, I mean if they are all the same, and this constraint has to be satisfied, what are actually the values of x, y and z? This we can do because it's easy. You know that they are all equal and the constraint g is satisfied. Then what's the value of x, y and z? There is only one answer. One, yeah. Because g x, y, z has to be zero, x, y, z is equal to one. Okay, and then we have f x, y, z is one. So the answer is that, which is maybe not so surprising, you could probably have guessed it, if you have a fixed surface and you want to maximize the volume, then a cube is the optimal solution. And this is the way to prove it, it's actually not so easy, you need this technique, so you have something under certain constraint, that's the way to prove it. So the cube is, if you want to maximize the volume, makes sense because if you do something, I mean, otherwise it happens at the cube where everything is equal or it happens at the border cases where it's a very long box or something and when you have a very long box then one of the edges becomes zero and the volume becomes zero. So maybe not so surprising. Okay, this is just, let me mention it very quickly and then we are done. Lagrange actually wants the side constraint to be an equality, but what if we say it's an inequality, so I'm saying maximize the volume under the constraint that the surface is at most six. And you can't directly apply Lagrange, but that's a very simple trick. You just say okay, so it's something not six, but something that's less than six, just give it a name, S, then do Lagrange with this equation, and then just look at the end, okay when does this whole thing get maximal and of course to maximize the volume you should also maximize the surface. So first you get a solution in terms of S and then you wonder when is this maximal, well it's maximal when S is 6 in this case. So yeah, the same trick also works for it, it almost works. And there is just a little detail in case you stumble across it. So in my definition I assume that all the numbers were from all the real numbers. This also includes negative values in this case. So my edge length could have been negative. I mean this I can still compute. It's not a volume I mean this I can still compute, it's not a volume then but I can still compute it. But in this case it didn't harm because the maximum occurred at a positive values. For the exercise sheet it's a little more complicated because your numbers, this di must not be negative because you have the logarithm of them and you can just ignore it. You can just ignore that everything is, that you can still do Lagrange even though you are restricting your numbers to be non-negative. So that's it for today. The exercise sheet I explained it to you. Is there any question right now? If you have trouble please start working on it early. So it's just these math tasks which are basically one bigger math task. If you have a question ask on the forum. Try to ask in a way that we can help you. Is there any question right now? So thank you for your patience, have fun with the sheets, see you next week.Welcome everybody to lecture four, information retrieval in the summer semester 22-23. We've prepared another fascinating lecture for you, I hear you, thank you. I say it every time because it's true. We will talk about the third exercise sheet which was efficient list intersection intersection which was a theoretical sheet. And today we will talk about compression. You will learn a lot about compression. It's again a mathematical lecture with a practical application. The sheet will again be mathematical. I will show you a lot of things but it's nice mathematics. I will have a survey at some point and if you want to participate you should be logged into Zoom, so maybe in the background while listening to me if you have a device with you, you can just, so that you can take part in the survey, just log into Zoom. The link is on the wiki, right? You can just, of course you should set loudspeaker, microphone and everything to off. Here's the meeting link. It will take a while before we get there, but so you can already prepare. Okay, so experiences with a third exercise sheet. so some of you love math, maybe one third or so, some of you struggle with it, a few more struggle, a bit more struggle than love. Okay and here are some quotes, I really like that the sheet was more mathematical, actually I enjoy this more than coding, so just that you know that such people also exist. It was a different experience than the last two sheets, I hope to not experience this ever again. So there was a certain spectrum, need to brush up on my math skills, a lot of people wrote that. And I always try to be representative, which was, there were several comments in this vein. No major issues, several people wrote that, but Lartesch for the first time you need it anyway later so why not start now. I noticed the hint on the, I hear you talking, I noticed the hint on the exercise sheet was not as exciting as usual. I don't know, did you look closely because it was in this very special handwriting font which actually took more time than usual to print on my machine so yes also it was more work than usual. We have several people who want to become teachers or study something that goes towards becoming teachers and in the spatula poly, poly because you study two subjects, apparently there is no math in the curriculum, which is, some of you pointed out a design error and if you really had no math lecture so far, then doing proofs is a little hard. Sorry for that. Handwriting, some comments. Computer scientists are used to run algorithms as time efficient as possible, writing as an algorithm. Very matter of fact. Several of you wrote something in that vein. It's not required to be a computer scientist. I don't care enough to improve it. Okay. Interesting, sexism in 2022 is still very real, I agree. Girls get peer pressured into having to write nicely, boys get shit for having girly handwriting. Interesting. Computer scientists have bad handwriting because they are not doctors. Doctors have worse handwriting. There is a Dijkstra at mesmerizing handwriting as far as I know it's true. Let me see if this link works. Edsger Dijkstra, one of the few European Turing Award winners, he made a point of writing. This is not a font, it's his handwriting and he was very proud of it. He was a very special personality, so very strong personality, very opinionated also, also a bit obsessive. So some, and there is some other guy, some Dutch guy I think also and they had kind of a strange relationship, I think he scolded him at first for his bad handwriting, then he went on to, so this is his PhD thesis, which, so, yeah we are thinking of introducing similar standards for this faculty, so I think it's a nice idea to write your PhD thesis like this, it's amazing right? But he didn't start out to get, have a nice handwriting, but then Dijkstra gave him shit and then he improved and he was also, he also had a difficult personality and it's quite a story behind it, if you want to read the story behind this guy. He eventually committed suicide unfortunately because it was a very strange story, a very special personality, these two, Dijkstra and this other guy. So help with mathematics, now I hope many of you have logged in, then I can ask my survey. So some of you have what I like to call math phobia, and yeah maybe some of you want help with it, maybe some of you don't, so we would like to know how we can help you, it's an ongoing topic for computer science students. So there is a question on the exercise sheet, you can also answer it anonymously if you want, but as you like. Here's one point and then I'm going to ask my survey and you have enough time to answer it. Math is easier than you think. I think a lot of, and that's why I like to call it phobia, it's about barriers in the head which somehow originated at some point in your life and as it is with barriers they appear very large but I don't know they started out maybe small and then they stay there and if something stays for a long time sometimes people accept it as a given but actually it might be possible to just overcome it and here's an example which I like to give, which I think is a great example for the point I'm trying to make here. There are many aspects but I think that's one important aspect. Assume you have mastered the four basic arithmetic operations. And I think most of you in this room have like three plus five, eight, you have no problems with that. Four minus minus 7 minus 3, okay negative numbers, little bit more difficult, multiplying numbers, dividing numbers, and then you can put things into parenthesis. So it's just 4 operations and you can put things in parenthesis, so give a different priority. And now you get this expression, and now you are the master of arithmetic expressions, of arithmetic operations. And I'm asking you, if you as the masters of arithmetic operations, does this formula frighten you? You have to solve this now. Maybe your life depends on it, or you get a million dollars if you solve it, but you have as much time as you like. Does this formula frighten you? Point is not really, right? You look at it and you say, yeah, that's work, I have to be careful not to make a mistake, I have to double check three times, but just because it's complex and you need some time doesn't mean it's hard, right? You realize oh it's just 12 minus 3 is 9, I have to multiply with 7, 63 and so on, so you just do it one after the other. And my claim is that a lot of mathematics, maybe even all of mathematics, and certainly all of mathematics and computer science is just of this kind. And I think what happens with many people is they see formulas and there is sub i and index and then maybe there is a superscript and then there is li and di and an epsilon and then you get this, wow. And it's like, but if you know the basics, like the basic operations and there are very few operations, so actually mathematics is something for lazy people, but you don't have to learn so much, then it's just as not frightening as this formula. So I think that's an important point. So if it frightens you, maybe it helps you to realize that what's behind this is just a very few things which you have to understand. Once you understand them, even complex things, you can break them down. And they're actually not so hard. It looks hard because of complexity but not... So, oh no, you are locked in? Ah, that's strange. I tried the polling and now I can't do it anymore, but I will try something else. I will try to give... That's really strange because I... Let me try to start it from this device, maybe I can start it from... Now, I can start it from here. Do you see the poll now? It's about mathematics. You see it? Great. Take your time. Just do it as a background job and we'll come back to that later. Okay, so the ten introductory minutes are over and we just start with our topic today. And while I do, yeah, maybe you can reserve 30% for listening and the rest for... So we are talking about compression, thank you Frank. Motivation for compression. Inverted lists can become very large. So what's the length of an inverted list? That's how we started. It's the total number of occurrences of that word in the collection depending on how exactly you do it. A word occurs in the document, you have an entry in the inverted list. So in the web scale collection, and here are some numbers actually about number of websites indexed by Google. So they were proudly presented on their website in the beginning at some point they stopped. Actually I said that in the first lecture is not so much increasing anymore it's more about quality than about quantity but initially it was a challenge to increase this. So very long lists, so here's some single, if you type a single keyword query on Google, maybe let's just do that together. Google actually gives you a hint at the length of the, no I didn't want algorithm mousse, something to eat, I want algorithm. So algorithm 851 million results. So that's kind of the length of the inverted list Google has for algorithm. And it's interesting because when I did this two days ago it was this number, so apparently it depends on the machine and a lot of personalization going on. That's really interesting. Okay, here are some others. So this is some real numbers, I tried it two days ago. Also let's try and. With and I get 25,270,000 million. So 25,270,000,000. Let's try the word the. Okay, 25 billion, 270 million. Looks suspiciously similar, so apparently Google is not giving you the exact size of there. It's probably just the number of things they have indexed divided by two or something like that. I guess that's what they are doing, kind of using Zipp's law to approximate these numbers. Because maybe they don't have the size of the whole list or they don't want to give it to you. The point of this slide is just to get started and to tell you that these lists for real data can be really large. So you want absolutely to talk about compression. Now of course you have to store these things, it's a list of integers, so you want to save space, but the great thing about compression is that it does potentially not only save space but also time, and let me explain that to you on the next slide by an example. Let's assume the index is on a hard disk. Hard disks are these rotating disks from ancient times but you still have them for a reason. You will see and they are still used, we still use them for our very large data projects. Here's an example. Let's assume you have such a hard disk, they are very cheap, that's why they are good and in use. Takes let's say 50 megabyte per second, if you just read from it one byte after the other, you can read with 50 megabyte per second. Let's assume you have an algorithm which compresses stuff by a factor of 5, like 100 gigabyte become 20 gigabyte. Let's assume this algorithm can decompress at a speed of 33 megabyte per second, meaning you have 33 megabyte compressed. If you take it, then you can decompress it in one second. So that's the size of the compressed data. And now let's say we have a 50 megabyte inverted list, it's just ins, takes 50 megabytes and we want to read it from this. So what if we read it in uncompressed it takes one second, right? This is our disk transfer time, it's 50 megabytes on disks, we read it, one second. Let's say it's compressed on disk, so it's compressed, which means by a factor of five, it's just ten megabytes on disk. So instead of in one second I can read this compressed data in a fifth of the time, 0.2 seconds. And now these ten megabytes, I have to decompress them. This takes another 0.3 seconds, but not so much also because it's compressed. So together reading it compressed much faster, I need a bit of time for decompressing half a second. So it's actually faster. It's not just less, not just that I saved a factor of five on disk, it also took me only half the time to read it. And that's a major factor when you... This is often the bottleneck in search engines. We can say that from experience, getting the data from disk if you have very large data. And then you wonder if you have, we have seen these algorithms which skip stuff and if you have, yeah you can also skip, one can skip uncompressed parts, I would say ones can skip compressed parts without uncompressing. Maybe you have some pointers in there which tell you this whole sequence, you don't even have to look at it, just jump over it and you don't even have to decompress it. What about solid state drives and RAM, so memory. So here the transfer rates, how fast you can read are much larger. So for hard disk, like rotating disks, it's 50 to 100 for SSD. These are just some ranges, approximately right. For RAM it's even faster, you can read with 4 to 20 gigabyte per second. Then this trick does not work so well anymore because I mean what's the typical decompression because yes now you can read much faster if it's compressed but now you have to decompress it and that also takes time. So I tried it yesterday decompression speed of gzip for example is around 50 megabytes per second. So if you have this faster media it will not be faster if you combine the two. But just to state the obvious, if you compress the data, then you can fit things into RAM or on your solid state drive which otherwise you couldn't, so it still pays right. So you have it on this faster medium whereas if you don't compress it, it's too large. And why don't you just buy more RAM or more solid state drive without compressing? Well here are some prices. Hard disk, one terabyte, costs about 25 euro, so dirt cheap, SSD four times as expensive, so this is quality disks, not some. And a terabyte of RAM is incredibly expensive, 5000 is probably not even enough. So that's why people still use solid disks. Okay. Okay, I will go to the survey at the end of this part. So how do we make use of this? Here's one observation. This is inverted lists, how we had them. So integers, and I omitted other stuff here like scores or positions or what not. So integers in increasing order, so here's an obvious trick. So if you store it like we did it so far, everything is an int, which uses four, typically eight bytes. But you use eight bytes to represent the number three, right? That's a bit wasteful. So here's an obvious trick. You just store the differences. From beginning to three, it's plus three. From from beginning to 3 it's plus 3, from 3 to 17 it's plus 14 and you can imagine if this is a large inverted list and everything is in sorted order, these gaps will be small numbers. So you have a lot of small numbers, a few larger numbers. This is called gap encoding, very obvious and simple idea. When does it work? Well it works as long as you go through the list from left to right and it doesn't matter whether you have it in this format or in that format right? You can if you go here you can just sum it up you get the real numbers and that's what one does. What doesn't work if you want random access so if you want to know give me the element the ID at position 1 million, then here it wouldn't work. You would have to sum up these numbers here. But very often you just go through from left to right. All our algorithms did something like this. With jumping you need additional tricks. So if you do this simple trick, now you have a sequence of mostly, not always small numbers, you have a lot of small numbers, some large numbers, how do you store numbers where you know something about the distribution, like many small ones, a few large ones, how do you store them efficiently? That's the topic of the lecture. And let's just look at why isn't it obvious? Well, let's just look at the, let's just verify here. Maybe the, here's some binary representation of the, we always start with one. So, yeah, so the integers, the binary representations, let's just check because we need that number later, that it's really log two, typical exam question by the way, so if you prepare for the exam, what's the exact length of the binary representation of number x, you have to figure out, okay, it's probably log something, log to which base, yeah it's log 2 because it's binary representation, do I round up, do I round down, is it plus one? A simple way to do it, I mean there are not so many possibilities, you round up or down then you have to add one or not. You can just try it out and see which one works. So let's just do it here. So if I do log two of, that's the number one, log two of one rounded down, log two of one is zero, rounded down is zero, plus one is 1. So that's correct. Let's check this here. The logarithm of 2, now we are at number 2. So it's 2. Log 2 of 2 is 1 plus 1 is 2. That's also correct. You need 2 bits to represent the number 2. And now you see why rounding down is correct. Now you are not yet at the next power of two. So it's log two of three rounding down. So it's the same, it's one. Only if you get to the next power of two, four, it would be two. And it's true, if you go from two to three until you hit the next power of two, you don't need more bits. So it makes sense. That's when you always have the rounding down and so on. So just that you have, so it's something log2 of x and this is the correct formula. So this encoding is optimal in a sense, that's an interesting question, so can we do better, are there shorter codes, I mean this is one way to encode it, we could give any codes, the whole rest of the lecture will be about this. But there is another problem, so let's just use this representation, it's kind of the shortest binary representation of a number, it always starts with a 1 because you don't write the leading zeros. And now let's just use this for gap encoding. So here, plus 3, it's 1, 1, plus 14, plus 4. So these are the binary representations, I concatenate them, perfect. If this would work, then we would be finished for today. Why doesn't it work? What's the problem? Do you see a problem with this approach? Yeah? We don't know where the numbers start. Exactly. Yeah, we don't know. I mean I could now draw boundaries here with my pen, but on the computer you can't draw boundaries, you just have a bit sequence, you don't know, okay, is my first thing a one, a three, a seven, and actually here is an example on the next slide, so if we just take this exact sequence, it could be this, it could be this, depends on, yeah. And when you think about it, what's the problem here? Why is it not, why is it ambiguous if you go from left to right? There are even, there are a lot of possibilities, interesting exercise, how many possibilities are there. Why is it ambiguous if you go from left to right? It's ambiguous exactly for one reason because you have codes that are prefixes of other codes that's exactly the reason. So there is a code 11 which is the number 3 and 111 which is the number 7 and this is the prefix of this. So if you encounter 11 you just can't know is it 3 or will this continue as 7. And that's so, and this is something to think about, I will not prove it today, it's actually an interesting exercise or maybe who knows an exam question. So in a prefix free code, and this is not a prefix free code, no code is a prefix of another, and this is equivalent to saying that decoding from left to right is unambiguous. So if you want just sit down afterwards or anytime when you prepare for the exam, try to prove that this is exactly the same thing. Unambiguous encoding from left to right and prefix freeness. But here is an example. Okay. And I think now it's time to show the result of the poll. I'm curious myself. And poll share results. to show the result of the poll, I'm curious myself. And poll, share results, do I also get the results? No, I don't see them, interesting, I have to look. I love mathematics, 5%, hmm, really, interesting. It's okay, ah, we don't see it here on the, but ah, you don't see it there, that's a pity. And I don't see it because I'm the... Who sees it on their device now in the room? OK, those of you with the device. OK, I'm sorry, that's because I can't launch it here, but because I'm host on this device I can't see the result. I will just tell you what the result is. I love mathematics and I had no problems with ES3, 5%. It's okay, it needs some time but I get by without help. 40%. I struggle a bit but with enough time and effort I can do it. 33%. I struggle quite a bit and would like help to improve, 18 percent. I don't really care, 5 percent. Okay, love and I don't care, 5 percent. So we have kind of a normal distribution. I think it's a very interesting result, thank you. And what's most interesting for us, okay there's a comment that not all the options were covered, I agree, a survey is always discretizing in unfair ways. That's why we have a question at the end of exercise sheet 4 where you can give us more information, a free text answer. This was just to get an idea. So most interesting for us, interesting for what can we do to help is of course the category I struggle quite a bit and would like help. So especially for you please tell us when you give your feedback for the fourth exercise sheet how we can help you. Because that seems to me, how can we help you? If you have problems and you want help. If you have problems and you don't care, we can't help you, I think that's obvious, but 20% want help and we want to help. Okay, so now the rest of the lecture is about codes, prefix-free codes, because this code, which is great, it's the best code but it's, prefix-free codes, because this code which is great, it's the best code but it's not prefix-free so we can't use it. And let's start, yes please. Okay by prefix I mean it's on the slide but maybe it was a little bit fast. You look at all the codes, so this here if you continue this list, one is one code, it's the code for the number one, one zero is the code for the number two. And the question is, is one of these codes a prefix of another? Do you have a pair of codes where one is a prefix of another? And yes, here you have a lot of pairs. One is a prefix of 1 0 0 0. So this is not a prefix-free code. And we will see exactly right now an example of a code where this is not the case and then it will become even clearer. You're welcome. So here is Elias Gamma. A code, an old code from actually not so hard, but you just have to start early enough with inventions and it's always easy. So let's just look at the first two numbers and it's written here how this code works. 1, 2, 3, 4 and let's maybe also do the number 10. And here's a simple idea. The question is how do you construct prefix-free codes. And we will see a lot of them in the following, a lot of codes. So what you do is first you write log2 x rounded down zeros. And then you write the number in binary like we have seen it before. So what's log2 of x here? It's a 1, 0. So log2 of 2 rounded down, no it's no 0, I'm sorry. Log2 of 1 is 0. You don't actually write any zero. Let me maybe start with the number two, where I write one zero. And now I write the number two in binary. Number two in binary, maybe I write it in another color, maybe in wonderful green, one zero. This is now the alias code for the number 2. 0, 1, 0. When you store it, you store it without colors because there's no way to store a colored bit, just in case you want it. It's just for illustration purposes. So 3, log 2 of 3 rounded down, it's also a 0. And now you write three in binary. Let's do the four. Now the four is, you have two zeros, and now you write And now it's your turn, 10. First tell me how many zeros for number 10, how many orange zeros? 3, yeah that's true because 8 and then from 8 to 15 is 3. So you see these numbers don't become so large very quickly. And what's the binary representation of 10? So it's a, oh no that's the wrong color, then it's not a LES gamma if you write it in blue. How many bits is it? Four bits, so it's 1, 0, 1, 0, yeah it's 1 2 4 8 2 and 8 and there you have it and what's the last gamma for the number 1 we left this one for the end and in what color green yeah that's correct because you don't have and now let's verify that this is prefix-free, at least for these examples. Is there any code here that's a prefix of another code? And think about why that is the case. And it's actually harder to understand why is no code a prefix of another, it's easier to understand why is it unambiguous. If I ask you the question why is it unambiguous if I go from left to right, why do I know when the code ends, what would be your explanation? If I now have a sequence of Ilya's Gamma codes, I start somewhere, I know now comes Ilya's Gamma codes, how do I know where it ends without boundaries drawn in my... Because the same pattern is not repeated in any other... Okay, the same pattern is not repeated. Can you try to make it more precise? You are the algorithm now and you are now seeing a bit sequence and you know now comes in a Lias gamma code, how do you know when to stop? Yeah? Yeah, so after log2 zeros you just get a number for log2 plus one bits and then you know that number is there. Yeah, that's correct. So you just read how many zeros until the first one. The binary representation starts with a one always because it's like the shortest binary representation. So you just read, so even if they are not orange you know how many orange there are because you just read until the first one. Now you know three and now the rest is just three plus one because x in binary is just this number plus one. So that's why it's unambiguous. So very simple idea and it works. Now oh oh, what have I done? I have, yeah, this is the code length, you can easily figure this out. I didn't realize there was a, let me just do some PowerPoint magic here. And here is, I used to have a slide on this, but I just dropped it because maybe you can imagine it. Now you can do something called bootstrapping. You can do it for fun if you like afterwards, which means applying something to itself. If you take this orange, what you do with the orange zeros is you're encoding a number in unary, right? This is encoding the number three by just writing three zeros followed by a one. And if you just take this whole thing, the three zeros followed by a one, this is now the number three which you're encoding here, and now you don't encode it like this, but you use Elias gamma to encode it. So you take this prefix part, this is the number four, and now you write the Elias gamma code for number 4 here. And if you do that, then you get shorter codes. There will be log plus log log and so on. And this is called earliest delta. And this you can iterate and iterate at some point you get earliest omega, but that's more for theoretical fun. So then you can get close to log to the optimal log2. So you don't want a factor of two here. But it's a rather theoretical scheme so I'm just mentioning it on this side. Here's one, column, here's column. And that's the one important for the exercise sheet. It comes with a parameter M called modulus. So that's one way to do it. There are so many coding schemes. It's a lot of fun to think about one yourself and I think there's still room for coming up with another one. I have another slide on what coding schemes there are around. So let's look at this one, it's also quite famous. And let's do it by an example. Let me check how much space I have here. Let's take an example here. Let's take a number, let's say our modulus is 16 and you will see in a second what that means and our number is, let's take 42, that's a nice number. And now what's the code? And now let's first write, let's first, we also have some q in urinalries with zero, and let me maybe use, how many, ok first I need to do some, I haven't explained it, but I will explain it now, I will compute it for this example. We are computing the modulus and the, how often 16 fits inside this, let's just compute it. Q is, Q is, so diff, you have this operation on machines but it's just dividing and then rounding down. So that's 42, I'm sorry, 42 divided by 16 rounded down. So how 16 fit into 42 and what's the answer? 3, yes. Thank you for paying attention, it's 2 exactly. And what's the remainder? Yeah, the remainder is 10 so that's 42 modulo 16 and that's 10. So what you do now, so the number is you'd write the Q as leading zeros so you have 00 and how many let's write while we are doing it how many are these? These are exactly how many bits? It's exactly x over m bits. Now we write a single one. Let's do this in wonderful leela, that's one bit. And we do that, you will see in a second why we do that. So we need a 1 here as a delimiter so that we know when the zeros end. And now let's write this in binary, we use the color green for this. And this is now fixed length, so we know the modulus is 16, so we can just always do it in log2m bits. So modulo m, it goes from 0 to M minus 1. How many bits? So this is not an integer number. Do I have to round it up or round it down if I want to fit the numbers from 0 to M? So here it's exactly a power of 2. So if M is 17, is it log 2 M rounded up or rounded down? Do I need 4 or 5 bits? Up or down? Up, yeah you need up exactly, you need up down because you have, then maybe you are wasting some bits but otherwise you don't have enough bits. So that's a fixed size, so here it's 4 bits and what's 10 in binary? We have already seen that. It's 1, 0, 1, 0. And now, and you write that this is fixed width, right? So this part, let me just go to the pointer mode. This is always 4 bits, so you could also have for the number 0, 0, 0, 0, 0 here, or for the number 3, 0, 0, 1, 1. This is now fixed length representation, which doesn't always start with a 1, which is why you need the 1 here, so that you know unary representation only makes sense if you know where it ends. So that's the code, so the total code length, we can write that up here, code length for code for number x, and think about whether you have any questions. So the code length, and that's important for the exercise sheet, for a code for number x with modulus n, and this modulus is just a parameter. Is, yeah, it's just this, you divide this, this is the number of zeros, plus one, plus, yeah, this is just how many bits you need. Is there any question about Gollum encoding? Yes? The one? Why we need the one? Yeah, the green one is a fixed length and it can be anything. So here you can have all 16 combinations from 000 to yes. Because the remainder can be anything. Just take all the numbers from, if you have a 2 here, which means we have 32, yeah from 32 to 48 I guess then you get, then you have 001 and then you will have all the combinations here, so you need the one to figure it out. Any other questions here or on the chat about column encoding? Okay we have one more, there are many more, oh yeah there's another question, yes? I have a question, the m is a parameter, it's something that we choose ourselves. Yeah it's a fixed parameter, so like there are many golem encodings and this is the golem encoding for m equal to 16. It's not something which you can vary, you say okay I want the code for m equals to 16. Is there a reason why you would choose one over the other? Oh that's a very good question, I would say that's what the whole exercise sheet is about. It's exactly the question for why in the second part of the lecture we will have some theory, because the question is what do I choose, do I choose alias? Do I choose column? Which m? Exactly, these are the quick questions. So which one do I choose? Which parameter? And here's another one, it's called variable byte encoding and this is actually quite important in practice if you think about these schemes, and let me go to one of the slides where I had a bit sequence. I'm not, I'm just, sorry, only a matter of hours. Here I have a bit sequence or here, and now I want to decode this, and it could be this for example, and let's say I could figure out that it's this now I mean I need to extract if I want to go from here let's say I know that it's this representation to here now I need to extract two bits extract them turn them into number three then extract the next four bits and so on so I have to and things are stored as in, so I have to extract bits from these numbers, shift them, convert them back. And these could also go over byte boundaries, right? It's just a sequence of bits, so a code could start in one byte and span several bytes and at another byte. This is kind of expensive because when you're doing decompression you want to have it really really fast and you want to avoid having codes going over byte boundaries. So here's a very simple scheme which is actually also used for UTF-8 which we will have half a lecture about it so it's actually practically very relevant. And here you use each code has an integer number of bytes. And here is an example. So let's just take a number. I will just explain it by example. So for example, I don't have any more text on this slide, no. For example, let's take X, let's take 537 in binary. We need, and this is something which contained, it's 512 right? Okay do we have, so 512 is contained, let me write the binaries once in, so that's the largest thing that's contained. 1 here, so what have we left now if we divide, I mean there are many ways to convert numbers to binary, I'm not sure that's the most efficient way too. 256, 128 is also not contained. 64 is not contained. 32, I have 25 left and now comes 16. Okay, 16 is contained. I subtract 16 from 25, what remains is I think 9. So now I have 8. I subtract it 1, 4 is not contained, 2 is not contained, 1 is contained. So is this correct? Let's just check it. So that corresponds to 1, 25, 537. Ok, so this is 1, 0, 0, 0, 0. Typically exam questions or oral exam questions. You have to pay attention that you are in binary. And now the question is how do we write that? How many, we want an integral number of bytes, how many bytes do we need? Is one byte enough? I don't think so. One byte is not enough, So two bytes are probably enough. So how would variable byte encoding do it? It would do it like this. Now I have to draw two bytes. That's how bytes look like in a computer. So if you look at it with a microscope, it will look like it's not a joke. You think I'm joking but it's true. So I'm exactly how they look like, that's why they're so RAM is so expensive. So what you do, you have a particular bit at the beginning of the byte which just tells you is this a... so we have a two byte sequence now and in the rest we write the green number and here we just say okay this is this this says that it's the last byte of the sequence. So you just use a particular number of bits in the beginning to say whether it's the... And one means it's not the last byte of the sequence. So if you have several of them and now you just put your binary number here, so let's just fill it up from the back, 001100, so you just have 7 bits for the information here and now you continue here, we have two more zeros, zero, zero, one and we can always write leading zeros there. So that's now the... And this is... Yeah, so you have this signaling bit in the beginning, so you're wasting one bit for every byte for just saying, okay, is the code still continuing or does it end here? bit in the beginning so you're wasting one bit for every byte for just saying okay is the code still continuing or does it end here? You have an integral number of bytes and if you want to get what the number actually means you have to extract these bits but you don't have to do bit fiddling. Now you still have to do bit fiddling but the next code will now start at a byte boundary. And we will see this more in lecture 7. There's a question or comment? Can you say it again? The question is whether it's this way around or that way around? I think it's up to you, I don't think it's defined in a way. For UTF-8 the signaling will be a little more complicated, it will be sequences of bits in the beginning. Actually for UTF-8 we will see that we have a sequence like 1 1 1 0 in the first byte which tells you ok the whole thing will be 3 bytes long or something. So this is just one way of signalling using the first bit and it's just convention for your coding scheme. But it's not that there is the variable byte encoding, there are several ways to do it and that's just one way to do it. Any questions about this scheme? So we have now seen, oh there's another, yeah? So the question is can it be combined with the previous ones? I mean you could combine it by just taking this number here and encoding it with Elias Gamma and writing it here, but why would you do that? I mean the binary representation is the most compact way to write a number, right? It's the shortest way to write a number. This has log2x length. If you take any of the other schemes, they are longer. The problem with binary is that it's not prefix free. If you just concatenate binary and go from left to right, you can't decode. So this is another way of making it unambiguous by introducing these bits in the beginning. Yes? Why don't we use a... So you want an unambiguous bit sequence which says and then we have to use bit stopping to... So you want an unambiguous bit sequence which says this is the beginning, this is the end, which you then don't use in any of the codes. If it's part of the code and we have to use bit stopping, then we add additional ones that we have to use in the... Yeah, so the suggestion is why not have a pattern for the beginning and the start and then you avoid this pattern in your code somehow, can be done and I'm certain there are codes which do that, the question is which one is the best, right? So it's another idea to create a code and there are so many ideas to create codes, the question is just which one is the best, so yes, can be done, there are so many ideas to create codes, the question is just which one is the best. So yes, can be done, there are no limits on creating codes. And let me just, and the whole remaining half of the lecture is about which one is best for which application. So there's this huge variety, so you always have to ask, I mean there are these three dimensions of a code, how much does it actually reduce? I mean you can come up with codes and they are extremely wasteful. So I think with these patterns you have to pay attention not to waste too many bits. Then there is the speed, how fast does it compress? And then there is how fast does it decompress? So for example, all of you probably know gzip and bzip or bzip2. Gzip is like pretty fast in both directions. Bzip compresses a little bit more but takes super long to compress. But it depends on the application. Maybe you have an application where it doesn't matter if it takes 10 hours to compress your file because you're just providing it for download. You don't care, you just want it to be as small as possible and then it should be reasonably fast to decompress. So there is this whole spectrum here, different trade-offs. Of course you can't have all of them optimal, you have to make some trade off. If you are interested in beautiful mathematical schemes, here is just one example from the huge variety. That's a particularly nice one, I would need a whole lecture to explain it here. Because it uses mathematics, actually what it does is, it takes a whole message, not just the individual singles and encodes a whole message in one big number somehow and it does it with mathematics. And the nice thing about this one, you know the nicest mathematics is always where the mathematics behind it is pretty complicated but when you use it in practice it's a super easy algorithm. So if you don't want to understand it, just use it. It's actually pretty easy. And it's used in practice to this day. So for example Facebook's compression library is, I think that's also where it was invented. And it's a relatively recent invention which is also interesting. So compression, this has been researched since 1940 or something and still in 2014 somebody comes up with a new scheme and I think you can come up with new schemes. It's just a very rich world and very fascinating. So there's a Wikipedia article on this and it's very nice, not so easy to understand. Okay so the second half of the lecture is about the question, so when do we use which scheme, and before that we will make a five minute break. So let's continue with the second half, and it's eleven slides, and that's it. So what's the motivation, I already mentioned it. So which code, which do we take? We saw several, well some even had parameters, each parameter gives you different codes, so many other ideas. As usual the answer for that such question depends, the question is on what, on what does it depend. Well if you think about it, it depends, the question is on what? On what does it depend? Well, if you think about it, it depends on your distribution of symbols, right? So for example, in natural language you want to encode natural language and E is much more frequent than a Z. Let's say you want to encode individual letters, then you want to give e a shorter code than z. But it somehow depends, if all the letters are equally likely, you will choose a different code. And by the way, if everything is equally likely, then binary representation of a fixed length is the optimal scheme. Think about it. And so now we are trying to make this more precise, and this is actually what gave rise to coding theory 80 years back or so. And we will see some very, very beautiful and fascinating theory. We need a few, little bit of mathematics, so that's kind of our arithmetic operations from my earlier analogy. So we need to define the entropy. So now you have some m different things. So always think of the natural numbers, one, two, three, four, something. And now we have a random number from this range, like from the first m integers and each of them has a certain probability to be picked. And now we can define the entropy of this distribution. The entropy of the distribution is a measure for how uniform or non-uniform is it. And let's actually use the uniform distribution as an example, let's say every of the M symbols is equally likely and let's just compute the entropy in that case, let's do it up here. So if each symbol is equally likely the question is of course why does it have, why is it sum over pi, log 2 pi. If you, at the end of the lecture it will become clearer because this will be a recurring scheme. At first it looks a bit arbitrary but you will see that there is a good reason for this. So let's just do it. if the Pi are all m, then we have sum from 1 to m and then Pi is 1 over m times, and now we have log2 of 1 over m. And there is a minus here, which shouldn't be, minus is important in mathematics, it's just a small line, but if you think of your bank account, it makes a big difference. It does, whether there is a minus in front of it or not. So yeah, sorry for adding this. And so what is this? It doesn't depend on i, so it's m times something 1 over m, so it's just minus log 2 1 over m. And log 2 of 1 over something is just a negative of this, so the minus goes away. It's log 2 of m, which is what's claimed down here. And this is actually where the entropy has its maximum here. If you go away from the uniform distribution, then the entropy becomes smaller. So this is like chaos, everything can happen happen and if you have more structure, one symbol is more frequent than another, then the entropy, the chaos becomes smaller. But let's just accept a fixed length encoding. You just take log2m bits, you have to round it up, if m is not a power of 2 and you just encode everything with a binary encoding and you pair it with zeros. If you have a fixed length then unambiguous encoding is also not a problem because the length is just fixed. Okay, so here is a very beautiful theorem from 1948 which gave rise to so much theory and many fields. It's by Claude Shannon, you can read up on him, very interesting guy with a very interesting life story and here is the theorem and it's very beautiful. You have a random variable with finite range, think of some symbols, numbers from 1 to m, you want to encode them and now you have a code and this code has certain code length, it uses the code length l1 for number 1, L2 for number 2. That's just what our L's are. For the rest of the lecture they are the code length. And now, why did we define the entropy? It looked arbitrary. Shannon's theory says, you can think about it as long as you want, no code length if you take the expected code length, which means expected, it's kind of the average, so you now have all your symbols, you always encode the A with this code, the B with this code, C with this code, and now you get an average code length. And some symbols appear more than others. So that's why the expectation we will see in a second. And then this average cannot get better than the entropy. Like the entropy of the distribution is a lower bound for how well you can encode. That's why entropy is such an essential measure. So that's the one side of the theorem. And now the other question is, okay, I can't get better than that, can I get as good as that? And yes, these are always the most beautiful theorems which tell you, you can't be better than this, but you can always make it. And you can always make it up to plus one, which is, looks strange, why plus one? Can we also do it without the plus one which is looks strange why plus one can we also do it without the plus one there's actually a very nice reason we would see it in a few slides where this plus one comes from and you have to look at the proof so so in words you can't do better than the entropy and you can always if you have a given distribution which you want to encode then you can always, if you have a given distribution which you want to encode then you can always achieve entropy with the right code. And we are now going to prove both directions, not doing the full proof, some of the proof will be delegated to the lectures, to the exercise which is a great way to exercise some math and also to understand this whole thing. If I'm just doing the whole proofs for you, it's just entertainment to learn something, you have to do it yourself. So here's a central lemma and it says, I have code length and these code lengths satisfy this property and you will see why this property is a reasonable property. No, no, I'm sorry, there are two directions of this lemma. If you have a certain code then this code length will satisfy this property. And if code length satisfy this property, then you can find a code with these code lengths. So this property is kind of central, it says, and it's known as Kraft's inequality. So let's try to understand this. What does it mean? 2 to the minus li, and if I sum this up, this has to be less or equal to 1. Here is an example. Let's assume I have three code lengths and they are all 1. I mean, then I have 2 to the minus 1 plus 2 to the minus 1 plus 2 to the minus 1 and maybe more code length. So it's one half plus one half plus one half and I'm already larger than one. So this cannot happen. I can't have three codes. And think about it, what does it mean? I want three symbols for which I want to encode them with just one bit. This cannot work, right? If I have two symbols I can give one the bit 0 and the other one the bit 1 and now I'm done if I'm prefix recoding because now I can't have any other codes because 0 and 1 are given away. And it makes sense. So now I have two code lengths, 1 and 1, one half plus one half is 1, so I can't have any more codes and this is the deeper reason of this inequality. Ok, now let's just try to prove these two directions partly and understand a little more. So let's assume we have code length like this and now we want to construct a code with this code. So now it's about constructing a code given some code length. No I think I'm, no it's the other direction, I'm sorry I'm confusing you. Now it's the first, I am sorry just forget what I just said. This is, I have a code, the other direction will be constructing one. If it's prefix-free, then this holds, then at all the lengths can be one. Here's a nice mathematical argument, so it's just one slide, so focus for a second. How do we prove this? So we are now given some code, we know it's prefix free and then this inequality is supposed to hold. Why is that? So let's assume we generate a binary, a bit sequence at random. Pick a zero, a one, a zero. So let me just do this, so we just consider this algorithm, so I'm just doing the following. 0, 1, 1, 0, 0, 1, this is my algorithm and I'm just doing this at random. And now when do I stop? I stop when I have a valid code or when there is no code left that starts with what I've already generated. So either at some point this is a code, maybe this is a code in my encoding, or there can be no more code. And this is well defined for prefix-free code, right? Either it would not be well defined for a not prefix-free code because then I wouldn't have to know, wouldn't know whether I can stop here or whether I should continue because maybe this is already a code but there is already also a code starting with this. So I should somehow mark this here so this is well defined because it's because well defined procedure for prefix-free codes. When the code is prefix-free is probably a better way to say this. Is prefix-free. Then I am either the prefix of a code that will still come into existence or I can say no more code with that prefix. So now let's denote this event that by this procedure I generate some code and let's call that CI. Then by the way this procedure goes either I will generate this code or that code, so the ci are independent, they can't happen both at the same time, which means this here holds, if random variables are independent, then this holds, this equality holds, then if you have the probability that one of them happens is just the sum of the probability that each of one of the individual probabilities, that's just the definition of independence, and if you look at the probability of generating a particular code, what's the probability of generating a particular code? Let's say, and I'm picking each bit of random, what's the probability of generating this code up here? Let's say that is a code, I'm generating each bit at random and I stop when I have the code. What's the probability of generating this code? Yeah? One half? Yeah, one half multiplied six times, that's exactly right. So it's, yeah, you have to get the first bit right, you have to get the second bit right, and you have to get the, and you have to do that li times, so as many times as you have bits in the code, and this is just 2 to the minus Li, and that's where this number comes from, 2 to minus Li, so it's just the probability of seeing that code if you generate it at random. And here if I just plug this in, then I just have the sum of the 2 to the minus li less or equal to 1. Okay I think you have to look at this yourself to really understand the code. I think the main takeaway from that slide is how this number comes to path, this 2 to the minus li. So it has a natural origin. Here's something that's easier to understand, it's the other direction. Yeah, you have a question of course. Yes, yes. It's a very good comment which actually I in question, which I also had when I, that's just how it's used, so code, and actually on this very slide I am using both usages. This is the whole encoding, like I have a whole scheme, like Golomb encoding with a module 16, which assigns this code to number one, this code to number two and so on. And then there is a particular code. So you could call the whole thing encoding scheme. Usually it's clear from the context. So when I say prefix-free code I mean the whole encoding scheme. If I say ith code I mean the code for number i. This means the whole encoding scheme. It's a very good question. The particular code for a symbol I, whatever my symbols are. But yeah, that's also how it's used in the literature, but maybe it would be cleaner. But does it clean it up? No, I still appreciate it. Yeah, go on. How with the C1, there we also speak about C1, I think there are single bits of one word. Or am I now, isn't, I understand like this that we start assigning random single bits and look if it's okay, and if it's okay, and if it's a valid word, we don't have in our book, we add it to our whole encoding scheme. Oh no, no, I think, ok, I think I see where the misunderstanding is. This is just, this is just an algorithm for proving something, it's just a thought experiment, that's the whole thing. This is just a vehicle for doing this proof. It's not an encoding scheme or something. This is just, just, just accept that we are doing this algorithm. We are generating one bit at a time. It's a random experiment we are doing and then at some point you will see a code or you know that there is no code which starts like this. It's just a random experiment we are doing. And the purpose of the random experiment is that we can now set up this inequality here, and we just use it to prove this. That's the whole point of this. It's not an algorithm which you use for encoding or something. It's just a thought experiment, a random experiment to prove this property here. C1 is an event, that's the event that something happens in my random experiment. So I'm doing, let me maybe say this again, I'm making a random experiment, and this random experiment has a number of outcomes. So one problem, or events, so one event that can happen is that I generate this code, that I generate the code for number one. Another event is that I generate the code for number 17. Another event is that I don't generate any code at all. This is why it's not equal to one here, that can also happen. There are some bit sequences arising that don't lead to any code at all. And the whole purpose of this experiment is that I can I think the proof is a bit tricky because it's kind of using a trick to prove something that maybe confuses you. Okay, I see the confusion. This is not the probability how it occurs in the data, it's just the probability in this random experiment. I'm just setting up a random experiment to prove something, to make an argument. It's just a vehicle for the proof, it's not about how frequent this symbol occurs in my data or anything. It's just, it's like saying let's play this game and let's see where this leads us. So we're just playing the game, pick one bit after the other, each with probability one half and let's see what property does this game have. And then we can use this to prove this. It doesn't have anything to do with how frequent these things are in mind. Okay, but I understand the confusion, I will consider this for when I explain this the next time. And maybe we can continue this offline if there are still doubts. Yes? I hope that it's the independence of the probabilities because otherwise the proof would be wrong. I mean for probabilistic independence, you would have the same formula with multiplication. And actually C1 and C2 are not probabilistic independence because if you know that C1 happens, you know that C2 can't happen anymore. So... And that's true right? Yeah that's true, but that's not probably independent. Yeah, I'm also thinking, yeah I thought about this before. because what we are needing here is not independence but actually I thought about it before, I mean yeah there is mathematical independence which means x conditional on y times y, probability of y is, I mean there are certainly disjoint events if you think of them as sub. Now that you say it I think you are, I think you are right, but we should clear this up afterwards, because I thought about it and I thought there is a relation to independence, so sorry if this is really independence. Or just mutually. OK, let's maybe leave it at that and clear it up afterwards. I have a feeling that you're right, but since I thought about it and then decided to write this, but maybe it was just too late in the night. So let's continue. This is the more involved but easier direction. And thank you very much for paying attention, asking these questions. Now the other direction is, so here it says, if we have a code, then the code length have a certain property. Now it's, I give you code length with a certain property, now I can always construct a code with this length. That's interesting. And let's do it with an example. Let's pick this code length. Let's say we want one, and what the lemma states is it can always be done. So I can choose, I say I want a code where one code has length one, the other code has length another code has length two and then there are two codes which both have length three. And now we first have to verify that Kraft's inequality satisfied, which is this. No it's not Li, it's 2 to the Li and let's just verify that it's satisfied and it will be 2 to the minus Li and this is now 2 to the minus 1 plus and I think I can just write it like fractions, then it's easier to see. So it's one half, two to the minus one, two to the minus two is one fourth, plus one over eight, plus one over eight, and this is one. So now the lemma says, okay, 1, 2, 3, 3 satisfies this inequality, there is a code with these code lengths, how do we construct it? And now I give you a construction scheme which does this. And here is the math and I will just show it to you by example as usual, I think that's the best way to understand it. So here I have, so there's one thing that's called the M, so let's write it here. The M is the max code length here in my, for my I, and my max code length here is three. So I have three bits, so what I will draw, I will draw a binary tree, a complete binary tree. Let me try my best to draw a complete binary tree of depth. A nice one I hope. And now, so that's a complete binary tree of depth 2, I think it's not bad. So, and now, could be more symmetrical but I think it's okay. Each left edge gets a zero and our codes I think were always, I don't know left one always gets zero, this gets the number one. Zero, one, zero, one, zero, one. And the point is that each path from the root, so this is my root here, to one of the leaves, so here I have the leaves, gives me a code. Or I can stop somewhere in the middle, I also get a code. And now consider the code length, I showed you all the contents already in sorted order, smallest first. So I start with code length 1. Now I want a code of length 1. So now I pick a subtree and let's give this our first color and let's maybe start with red. So red is to the 3-1 is 4. So what I do, I will just pick a subtree of size, so I will go here and I will stop here. So I will just take, the whole tree is still free now, I could also go in the right direction, let me go left here and I stop when my subtree has this size. And these are always powers of two so it always works. So now I have a subtree of size and I will not show all the subtrees like this, subtree of size 4. And now this will be my code, the code will just be the path until I hit my subtree. So the code is just 0. And this part of the tree is now gone. I can't use it anymore. And now you can maybe already imagine how it goes on. Let's take what's our next color. I don't know. Let's start. Continue with green. So now I want green. I mean this is not a proof now, but maybe you can imagine how it works. So it has code length 2. So this 2 to the m minus 2 to the 3 minus 2 is 2. Now I want to pick a subtree of length 2 from what's still left. And because these numbers, if you sum them up, if you sum up these 2 to the m minus le, so what I want to pick for each code, I get this here, and it's less than 2 to the m minus le, so what I want to pick for each coat, I get this here, and it's less than 2 to the m, which is exactly the number of leaves in my tree. So here it has eight leaves, two to the three, by the way, by construction. That's why it works, so now, that's why I always have something left, so let's just say. And now just for the fun of it I don't have to go left, I can also go right here. And this is now my subtree of size 2. And my code is the edges I went along here. The edges are 1, 1. And you also see I'm getting a prefix free code because I'm always going in a part of the tree I haven't seen before. And see how it all works out, I still need two codes of length three and I still have two paths of a leela, it's a beautiful color, so now we have a L3 equals to a 3. So now we want to pick a subtree of size 2 to the 3 minus 3, 1, which is just a leaf, so we can just pick one of the remaining paths here and let's maybe pick that one, so that's this subtree and that corresponds to the code 100100. And I hope it should be clear but the way these paths are chosen because they lead to disjoint subtrees it's's just equivalent to a prefix-free. If you take any two of them because the subtrees are disjoint, the codes, none of the codes are prefix of another. A code being a prefix of another would mean here that this one subtree is a subset of the other. And now we need a fourth color, that's the biggest problem of this encoding, I think, is picking four different colors. I think I will try orange, I don't know orange. Do we have a better color here? Maybe this nice blue here. Yeah this is okay. So another one, three, so we need one more path which lead to a subtree of size 1 and there is just one left and it's this one, this one, this one and it has the code 101. And this scheme you just, we just constructed this whole algorithm of constructing a scheme is called, this is called Huffman encoding. Writing down here is hard, this is called, and the hardest part about Huffman encoding is how to spell Huffman with how many F's and how many N's. That's also a popular exam question. Encoding, I'm sorry it's Huffman, there's a guy called Huffman. It's hard to write at the bottom. Any questions about this scheme? It's a beautiful scheme and it works for, and I have to underline this above. And you can also imagine implementing this, right? You don't have to draw a tree for that, but that would also be a nice exercise. So you're given these four lengths and now you should, and you can do it for any length. And you can also see you get more intuition now why this inequality makes sense. This inequality is exactly the reason why you can pick one subtree after the other and there is still enough of the whole binary tree left. Any questions? Now we are kind of at the peak of complexity, now it becomes easier. Now we have proven the central lemma, now we can prove the source coding theorem. Now we want to show we have any encoding expected code length, you can't get better than the entropy. So what does it mean expected code length? Our symbols have a certain distribution now. This is the distribution probability of symbol is pi, the pi sum to one. Let me just write that here. So some probability distribution, so the pi always sum to one. And now, so by Kraft's inequality we have shown that for any prefix-free encoding we have this property on the code length and now this thing here you want to show it's at least something and now this is again Lagrangian multipliers and it's exercise one of the fourth exercise sheet to show that this thing now becomes smallest if all the code lengths are equal, namely log2 of 1 over pi. That's how it becomes smallest. So if, yeah, and this is this exercise and we can just plug it in now. Actually I already did it for you here, so with no writing here. So if the Li are all equal, log 2 1 over Pi, then this becomes smallest, which is why you have this here, because of the Lagrange. So this is the smallest possible value and that's just the definition of the entropy. And here you get intuition of why the entropy is defined that way because it just comes out here as the minimum value of the expected code length. So we actually get it for free from what we have, not for free, you have to show this Lagrange. So this is the one direction, the other direction a little bit more to show. You can always achieve entropy plus one and now we will see why the plus one, it's a very simple reason. So we want to achieve, so now the, just to clarify now, the PI are given and we want to find, we want to find a good code. So it's just that it's clear in which direction we are going. We are given a certain distribution which determines the entropy and now we want to find a good code for that distribution. And now I'm just saying, well we have seen this log2 1 over pi, let's just pick our code length like this. Now we have to decide, do we round up or round down? I claim we have to round up. Why? Well, we have to satisfy Kraft's inequality and let's just do it here. Some i, I don't specify the range here so I just leave it open like this. 2 to the minus and if we plug this in, round it up, log 2 of 1 over PI. So now it's minus round up and rounded up of X is greater or equal to X of course and it's here with a minus so this is less or equal than if we just drop the rounding up. 2 to the minus log 2 and I'm getting close to the minus log2 and now I'm getting close to the border. This is, oh my, do I have, yeah, I have some space here. And this is equal to minus of log2 is 2 to the log2 of pi. And 2 to the log2 of pi, do I still have space here? Yes I do. It's just some i over pi, 2 to the log2 of pi and this is just 1. And this only works if I, oh my mouse disappeared, this just works if I round up, right? If I wouldn't round up, I wouldn't have less or equal here, I wouldn't satisfy Kraft's inequality. So rounding down does not work, I have to round up here, that's important. And then I get to one. So and what, yeah, that's written here, what I have just proven. So now I've just said let's try these code lengths and now there is an encoding with these code lengths, we have just seen how to construct something like this. And now by the definition expectation is this, so it's just pi times li and now let's just look at what the expectation is for this particular code length. So this is now, and I hope there's nothing more written here, that's correct. Some I, so it's now the expected code length, simple I with probability LI and it has this code length, log2 of 1 over pi. I had to round up otherwise Kraft's inequality would not hold. So this is, now I want to upper bound it, I want to say this is not more than something, if I round something up it can increase by at most one compared to the not rounded up number, right? Not even a full one, it's strictly less than even. is less or equal then, well I have sum i Pi without the rounding up, log 2 1 over Pi and I have just the sum Pi again. So if I just put this in parenthesis, Pi times this and pi times one is this. Well and the nice thing is that, yeah, this here is exactly the entropy. That's how we define entropy, so you always have entropy and this is just one. And this is where you have the plus one. The plus one comes from the fact that you have an ideal code length according to the definition but the ideal is not an integer and you have to round it somehow and you have to round it up, otherwise Kraft's inequality doesn't hold but by the rounding up you lose this plus one here. But it's nice that it's just a plus one and a not plus number of symbols because what you sum up here is number of the probabilities. So to really understand it you have to do it yourself of course but I think you got a lot of intuition here. And now just some hints for the exercise sheet and then we are done. The exercise sheet will be, well one of them will be to do this Lagrange, we have already seen it. The other will be, can we close the door for, it's just a few more minutes, then we are done. The last slide. So we have kind of seen that this is, if you are given a distribution then the best code length is log2 of 1 over pi for the reasons we have seen. And so, and now we are given a certain code we want to know, ok for which distribution is this a good code and it's a good code if this inequality here holds. Because then the expected code length is just the entropy plus one we have shown that. So if you have a code with yeah you have symbols with this distribution you have this code length then it's a good code length for that distribution by what we have shown. So to just say that again, if a code is entropy optimal for distribution, the expected code length is optimal. Okay so you have to prove that and I will show it for one example and you have to show it for another encoding for the exercise sheet. So you have to prove this inequality here with a plus one for the exercise sheet, we make it a bit simpler, you can prove plus five or plus whatever makes it easiest for you in computer science, we always write constants to make things easier. So let's ask this question, Elias Gamma, for which probability distribution is it a good encoding? It's not always a good encoding, but there is always one probability distribution, several, but at least one for which it's a great encoding. Well how do we answer that question? This is the code length of Ilyas Gamma. We have seen that you encode number i and you get something 2 log i. If you remember it from the beginning, it's like the prefix which says how many bits and then comes the binary representation of the number which is log 2 again and the prefix is also log2 and then you have a plus one because that's what came out of it. So now we want to find the distribution such that, well this yeah, this is what was written on the previous slide that this Li is less than log2 1 over Pi plus 1. Which means, oh how did I do that? You want to find, so that's the question you are asking here. I want to find the Pi such that this holds. And actually for the exercise sheet it's easier, we're already giving you something and you just have to prove it, but here I'm showing you if you don't know the distribution already how you would do it. Now if you look at it, I don't know, do you have an idea what pi would you choose such that this inequality holds? 2 log i is less or equal than log 2 1 over pi. Probably need a bit of experience with computing with logarithms and so on. So if you just take equality here, what would come out for pi? Ignore the rounding and take equality here. What would be a good Pi? You can also write it in the chat. If you want this to be equal to that. Let me just write it here, let's just forget about all the log2i and let's say we want to make it equal and then we can see how... What's pi so that this holds? Yeah? What, one? One over i squared, that's correct. If you do one over i squared, let's just do it here. And now so that it's an inequality, it has to be, to find my pointer again, it has to be greater or equal, let's just do the math here. So if Pi is greater or equal to 1 over i square, let's just ignore i equals 1 for a second, then 2 to the rounded down log 2 ai is less or equal if you ignore the rounded down, rounding down only makes it smaller, log 2 of i is log 2 of i squared and that's log 2 of 1 over pi. So yeah, 1 over i squared is the probability distribution and this also holds for i equal to one, you can choose p1 as you like. Because for i equals to one, log two of i is zero. So the left side will always be smaller, no matter which pi you pick here. Okay, so we are just looking for pi, one over i square, and we can choose pi one as we like. Here's, I could talk, it's a very nice proof to show what's the sum of the one over i square. It's actually converges, one over i does not converge, the infinite sum, one over i square does converge, and strangely enough, you have pi here, pi squared. Why does pi squared occur when you sum up the reciprocals of the squares? It's a really nice proof, I could give it to you, but no time unfortunately. But you see that it's a number larger than 1, but not larger than 2, which means if you remove the first one, which is 1, you get 0.6449. And so you just take P1 equals to 1 minus 6441. So you just take, I can write that here, so P1 you just define it as 1 minus 0.6449 and then you get a probability distribution and you can say that Elia's gamma is entropy optimal for that probability distribution. And for the exercise sheets, so now, and now it comes back to the beginning of the lecture, so this is what you will do for the exercise sheet. If you do a gap encoding, then you get smaller numbers and larger numbers, but the smaller numbers are much more frequent, and actually the numbers will have this distribution here. One minus p over i minus one times p, so that's the distribution you get. And for that kind of distribution with a certain p, where p depends on the length of the inverted list here. Golem encoding is the best one. So that's a really nice result and you will prove it for the exercise sheet. So you have gap encoding, then the gaps assuming that the IDs in the list are somehow randomly distributed which they are not, but it's a good reasonable assumption. Then the best encoding is column encoding and you can even say exactly for which modulus. So the mathematics here does not even tell you the right encoding but also the right parameter and this is this exercise. Let me just mention that in practice you would probably use a variant of column encoding because column encoding is one of these schemes where what the codes can go across byte boundaries but so in practice you would always choose codes which pay attention to byte boundaries but you can always make variations of codes which do that, so it's still useful theory. Any question about this last part, entropy optimality? That's what the exercise sheet will be about, and also another Lagrange. I think it's easier than the last sheet, the next sheet will be practical again, but use the occasion to brush up your math. And there will be a Q&A, maybe Q&A session on Friday is a misnomer because it will be a session where you can also get explanations again or anything can happen. I don't know if you have suggestions for how this session should be organized, please let us know. And if you are one of the 20% who said you want help, you should definitely come to this session on Friday because there are several people there. We have breakout rooms, so you can also go to a room with someone and get just explanations just for you if there are individual problems. So we can accommodate that. Any questions for now? Okay, so that's it. Have fun with the sheet. See you in one week. Bye bye.Welcome everybody to lecture five, information retrieval in the now winter semester 22 and 23. Weather is not holding up, there is snow on the mountains and I can still hear you talking, thank you. So I will say something about your experiences with the last exercise sheet which was about compression codes, entropy and beautiful mathematics. And today we will talk about fuzzy search. And we have prepared a fascinating lecture for you with a lot of new material. This time there will be quite a bit of math, basic but beautiful mathematics in the lecture. But then there will be an algorithm and your task will be again to implement something, so I guess several of you will be happy with that. And we have also compiled a beautiful data set for you and we will talk about that in a second. So about the feedback about the last exercise sheet. So the first exercise was about Lagrange again, so second time to practice it. Many of you like that. You had more problems with exercise too. And as usual there's mixed feedback, so it was a mathematical sheet. Mathematics is cool because it's so hard in interesting ways. It's a matter of willingness, more time is helpful, very complicated tasks, spent four times, four days trying to solve it. So several people said that they spent a lot of time on exercise two, which was about proving this inequality, which is why I have an extra slide where I want to show you something. Needed a lot of time to show this part, that's actually what I'm going, what my slide will be about. I hope there will be more coding sheets and less math. Then I made this comparison with the arithmetic expressions. That's of course correct if you have this arithmetic expression as I had it on the last sheet, it's basically you evaluate it even if it's complex, you evaluate a part, it becomes less complex so it's strictly monotonic in that sense, whereas a math problem it's a little more of a search. But I still argue that it's pretty mechanic, that's one of the points I want to make in two slides. And somebody said the exercise was more about math skills than about compression. That is true, but since many of you are still having trouble with the math skills, that's hard to avoid. This is already a simpler version of a simpler version of an exercise which we had several years ago. So yeah, you have to start wherever you are. So there was this question about your relationship with mathematics and you gave super interesting answers. Many of you said something like this, that you actually like it a lot. Many of you, basically everybody said that, but you want to make more progress and you wonder how you do that. More examples would be helpful. Many of you, so it was really, it was amazingly interesting what you wrote, so thank you very much for that. Many of you mentioned time as an important factor, so just needing more time for the exercises, for learning. Somebody even said time is really the only thing, no amount of explanation from me, or you just need time. Several of you wrote that, that you miss background, and I agree. Only few good teachers teach mathematical or taught in the past intuitively. My love for math increases. It got and gets better with practicing. Problems with logarithm rules, these were important for the exercise too. They feel so unintuitive and yeah, that's what I said. So for me, very interesting bottom line was, many of you actually, so there's not so much of a I don't like math, please skip it. It's like I really like it, but just wishing to have a better understanding and more time, time time time. Let me not do the whole exercise too, I would love to show it to you, but time is limited, but I want to show you one thing which I feel might help some of you. So the task was to prove this, it was really, and of course the one who said that is right, I mean it was, you could just forget everything about compression and just prove a mathematical inequality, said that is right, I mean it was, you could just forget everything about compression and just prove a mathematical inequality but you can say that about everything. Of course it was implicitly part of the exercise to understand why you are proving that. I mean this whole thing is showing that Golomb encoding is entropy optimal for a particular distribution and the gap encoding but that was kind of given to you and then you had to show the math behind it. So it was like an expression and you have to plug in stuff and it looks a little bit ugly but simple, I mean it's something divided by something else. M was given which is again and then you have something rounded down in the denominator, you have something rounded up and so on. So I already wrote that in the forum and I think several of you found that helpful that you start by writing down and doing the obvious simplifications like you have a ceiling, so rounding up or rounding down, and then you want less or equal, you can just drop the rounding down. I don't want to show that to you now, but I want to show, because several of you said that, and let me see whether that pen works. I didn't test it before, but I'm hopeful that it works. So at some point, let me, so after the simple simplifications, the simple transformations, you end up with something like this. You have to show something like, I have to write a little bit smaller so that I have, you have to show, where do I write it? Maybe I write it here. You have to show something like I minus one times P over ln two less or equal than, and I think it was log2, if it's wrong, and then you have 1 minus p over i minus 1. And if you did these simple things and there was still this hint, so like a joker wants to be a millionaire for all x, you have this hint that 1 plus x is less or equal than e to the x. And many of you said, yeah all fine, so what I wrote on the forum, yeah I got there, I got to something like this, but I actually spent all my time here. And so the question is, how do you prove something like this? So you see this and you say no way. I mean what does the left side have to do with the right side? There's a P here and here's the P and the logarithm, it just doesn't look right. And I want to show you two things. So I want to, a bit of thinking process. And so let's first start on the left side. So the first thing is, and then I also want to show you a little bit about how to do this technically. So the first thing is, let's get to something from here to something we know. So maybe I write it again but a little bit transformed. So I have this I minus one here times p over ln two. And what's very important now, now I'm just doing transformations and I'm not showing that something implies something else. So this is what I'm showing here, not proof. I don't know if I can do this. So not proof yet. So this is just derivations to get some intuition. And this is I think how you should do that. And the right hand side then will be the proof. And I think it's important to differentiate between the two. So one thing you can do, okay, I minus one in the exponent of the logarithm. So many of you rightly said basic, basic rules. You have to know if you have the exponent in the logarithm then you can just put it in the front, right? So if you have something like log x to the y, it's y times log x. Of course, if you don't know that, then I agree it's hard, but that's like, that's what I would call basic arithmetic, especially in computer science, mathematics, right? These laws of logarithm, when you have logarithms, you have them all the time. So that already gives you the I minus one in the front. And then you have another thing is if you have log to the basis of B of A, then it's just log of A divided by log of B. And again, I would call that basic arithmetic operations, like plus minus divide, you should just know that. If you don't know these things, I agree, then it's hard. So what are we having now? So now we are having i minus one comes to the front here, let's just take the ln because we also have the ln here, and then we have ln one minus one over p divided by ln two. So by doing, I would say, rather obvious things, I mean the log 2, you have an ln here, so let's apply that rule, and just pulling the exponent to the front, you have this here. So now you have i minus 1 here, and i minus 1 here, and you just, okay, let's drop it, it's the same, that's fine. ln 2 here, ln 2 here. And now, okay, let's drop it, it's the same, that's fine, ln2 here, ln2 here. And now so let's just continue, so now what we have left, p ln one over one minus p. And very important, I'm not proving something here, I'm just transforming, so I'm on the left side to get somewhere, something I know and see whether it works out that way. Many of you confusing are getting this idea for proof with proving. I will come back to it at the end. Let's just do another transformation. Do the exponent, apply, take everything e to the power of, so this gives me one over one minus p. Okay now let's take e to the, let's take 1 over 1 minus p because you know you have that hint and yeah bingo I would say. That's you're already there. It says 1 minus p, 1 plus x less or equal e to the x here it says 1 minus p, that's you're already there, it says 1 minus p, 1 plus x less or equal e to the x, it says 1 minus p, that's exactly this when you plug in minus p for x. So you arrived at something which you know, so it's like bingo, bingo. And I would say these are rather simple and also obvious transformations. And now the other thing which I wanted to show. Many of you see that a lot in what you submit, but also in exams. You write something like this and I'm not even, maybe I write it and then I delete it again. Then sometimes you are writing this or maybe this or this or something. Don't do that, this is not a proof, this is just getting the idea. I mean, you would have to be careful, very careful about whether this implies this or the other way around. Actually you need the other direction, right? Because you are coming from a hint and you want to prove that. So when you do something like that, and that's very typical in this kind of mathematics, you have this here on the left hand side to get some intuition. And now that you have it, now you can do the proof. And I do that on the right hand side. So now I know, aha, I should start with this. So let's start with this. So one over, and let's, yeah, so one minus p is less or equal than e to the minus p. And this is because of the hint. So we start with this, yeah, we know this for all x, so it also holds for... And now we know, okay, this is... And now let me just do this the other way around, so it's e to the p is now less or equal than 1, 1 over minus P. So this is just taking the reciprocals on both sides and then a switching side. If I take this becomes 1 over 1 minus P, here the sign changes. And now I switch the sides, so I do two things in one here. And this is now an implication, right? I'm coming from here one here. And this is now an implication, right? I'm coming from here to here. Now I apply the logarithm, so let me just, so this is applying ln now of this, or maybe I should just write it on the other side. Yeah, so this is now, let me not write it but just say it, 1 over, I'm applying ln to both sides, can I do that? Yes, I can do that. So this is now, this implies this, this is now a proof. Now let me take it to the, let me multiply by i minus 1 on both sides, so this is i-1 times, and let me do two things at the same time here. This is like if I put it in the log, it's like this. So it's just multiplying both sides by i-1, I can always do that. And now let me divide by ln2, p times ln2, and this is again equivalent transformation, now it's ln this by ln2, let is also bingo. And now I have even proved it. And this is, yeah, this was like the hard part of the proof. So and it's interesting, so two things here, one, how do you get to something like this here, something where you wonder does it really hold? Try some transformations. And I'm arguing, you are applying here like a box, a stack of basic tricks, like maybe 20 tricks where I can do the typical logarithm rules. And yes, it's a little bit of search, but I would argue it's not too much of a search. There's really not so much you can do here. Apply these tricks until you get to something you know, but then let me say that again, because it's so important, I see it so often, this is not a proof yet, right? This is just trying to get idea for the proof, and once you have this, now you can start from what you know, and now check whether it's really going in the right direction. Is this an implication? Maybe you are dividing by zero sometimes. This is the proof. This is just forgetting the idea of the proof. Many, many of you, I see that a lot, are confusing this, this. If you write this, it's not yet the proof. It's just getting the idea. Arrows are missing here and so on. Though very often the proof is just this thing here in the other direction. Took a little bit of time but I think it's super important. One more slide and maybe then opportunity for you to ask questions. So my feeling is, and I think it concurs with what many of you wrote, that it would help very many of you to develop these super basic practical skills like these logarithm laws should be just as natural as multiplying or dividing or adding numbers. And it's not, let me say it again because it's so important, it's not that there are 1000 of these rules or tricks, I think it's maybe, I don't know, 20, 50, it's a finite number and a small finite number. And my feeling is that in the math lectures, this is just taken for granted, that's a problem. It's just, it's assumed that you know that many of you have missed this somehow and then you never have time to learn it, to learn these very basic things. And probably we would need a lecture and a curriculum which just does that. Practice this absolute basic stuff because that's what many of you I think are missing. The basic stuff and also the practice with it. And actually I also contend that math skills and programming skills are really very, very similar. If you have this basic, also programming is not so easy. You have a task, how do I do that? You have your basic toolbox and now you are trying out a number of things, right? It's always a little bit of search, but not a huge search space. I mean you have your five tricks and you try them out. And in math, it's also like this. You have to prove this, you have your ten tricks, just apply them after one of them works. Not so different from programming actually. And you also know when it works, you also have that feature. If you arrive at something which you know, in this case the hint, then it's like your program compiles. And of course with more experience and if you are super clever, then you can do it faster and more elegantly, that's also like in programming, if you're very experienced and very quickly you can write down a program that works. But the important point is in programming, you're very experienced and very quickly you can write down a program that works. But the important point is in programming as in math with just trying out and enough time you can always do it somehow. So corollary, liking programming, and liking math. So I think really the same thing and your feedback kind of confirmed that. Okay so that was a little more time than usual but I think it was important. Is there any question or comment from your side? Yes please. and I'm starting to learn what they are through these exercises, but how do I go from that to actually learning them? Like, do you suggest any places where you can start practicing these skills, or is that something you would have to google around and figure out yourself? It's a very good question, though. The question is, where do you practice these? You're saying you're learning this as part of these exercises. I think there are enough, I don't have one specific pointer right now, I think there's enough sources or videos or I don't know what, where you can, all these, proving these little things, it's like really this basic, it's like doing calculations in your head. I think there are many places probably where you can do that. The limiting factor is probably time, right? Everybody is so stressed, you have to do your exercise sheet, so it would kind of take, okay, five hours per week or so, or one hour per week, I just do this and I catch up with what I missed. Because actually you should have learned this sometimes long ago in the first semester but you didn't so I think many of you just need to catch up and that's hard catching up because daily life is so stressful. So I think what I learned from this and what I tried to do and I already tried to do that a little bit in the past, is like sneak in some basic math stuff into my lectures. I mean some of you then complain about it, you say it's not really information retrieval and it's true, but I think many of you just need it. So just sneaking in like basic math exercises for you to prove this. I think that's a good way. So maybe it would be great to have a lecture, a whole course just about this and the curriculum but there's no such course. Any other question or comment at this point? Yes please. The proof strategy for constant equal to one? The proof strategy for constant equal to one? Yeah, so there was this optional exercise to get the constant to one. That's hard. That's harder. This is I think what changes when the mathematics becomes, what's hard about mathematics, it's a very good question. So the question was, there was the optional exercise, prove this, plus one. And it's not linear, right, it's not just plus five, plus one, it can become arbitrarily more hard if you do these little changes. And what is hardness in mathematics? I think by this example, when I explain it really well, this I would say is not too hard, because the search space is small. You are given this and there are like ten obvious tricks which you can apply. You can basically try them one after the other in two different combinations and then it will just work. And when it's harder it means that the search space is larger and you have to use something maybe unexpected and that's what makes it hard and that's where experience comes in. Like you have maybe 100 tricks which you could apply in different orders and then combinatorial, it's a lot. And then it becomes like about knowing, oh, in this situation, this often helps, let me try this or that. So that's why the plus one thing is not easy, why it was really optional. The plus one is not easy because you have to, it's also still basic mathematics and I don't have it in my cache right now, but it's not as easy as this. So if you are interested, I can post it somewhere. I think many years ago it was even the task to prove that, with more hints. Any other question or comment about this very important topic? I would love to give just a lecture about basic math stuff. And it's nice to hear that you like it. Just want to learn it better. Okay, so let's get on with the lecture. Probably we'll need a little longer today, but we will last, let me just, you can already think about that. Last question on this exercise sheet is how stressed are you feeling? We are considering having no lecture on December 6th, that's two weeks from now. So you would then have two weeks, you would have a little more time to practice your math and also another two weeks then to work on the exercise sheet from the week before. Let us know what you think, whether you think that's a good idea or a bad idea. And so maybe we take a little longer today and then we have a week of fuzzy search. So it's really, it's a super nice lecture today, more mathematics in the lecture, more practical in the sheet. I hope you like that. So the problem is you're given, and let me show, we will also do maybe a little coding in the end, maybe not. We have this beautiful data set which is linked on the, I've shown you Wikidata in the very first lecture, information retrieval in the winter semester. This one, data sets, Wikidata. Here's the data set, and what it is, there is a lot of information here, maybe let me just cut the first column just and look at that. And it's just entities from Wikidata. It starts with the countries, but it's basically everything, not everything, but many things you have a Wikipedia article about or a Wikidata entity about. So here you have some dates, you have mobile phones, Scandinavia, some countries. So it's just, we will often call these words in the following but it's just strings, right? So space is also a character. It's not important for now that this is two words. It's just a string and space is one character. So consider each of these as one thing, as one word, even if it's two words. And we have quite a lot of these. Let's just look at how many by doing this with line numbers, so it's 2.6 million, 2.6 million names of things. And what we want to have today, you type something and then you want to find matching names, and you want to be able to make mistakes, because that's very important. So the simplest form is you type something, Frei, we will ignore case, you want to find Freiburg prefix search, maybe you are mistyping Breifurg, you want to find Freiburg, very often you mistype, you don't know how exactly it's spelled, and the ultimate thing, and that will be the exercise sheet, you just type a prefix and you mistype in the prefix, so fuzzy prefix search. And that's what we will do today, which is a very cool feature which you basically want everywhere when you want to select something from a long list. You type braai and you get Freiburg, that's what we want today. And you want it from this list of entities. So you type something, you're looking for, I don't know, you're looking for Sapporo mau yama zoo and you type it with typos just Sapporo with one p, you don't type everything and you find this. Okay. So we will solve these two fuzzy and fuzzy prefix, they will be very similar and prefix search just a special case if you can find something with typos you also find the things without typos just a special case. There will be two challenges similar to as we had it in the beginning for keyword search one is what does it mean that something why is this a good match? We have to define a similarity measure, Breiforg similar to Freiburg. And then we have to do this fast somehow. And this is just a, so where do these dictionaries come from here? So one source is just you take, I don't know, you have your particular application, you want to select an entity from Wikidata, then it comes from that list. Another possible origin, you have a big text collection and you just have, you extract common phrases from there and yeah, of course what people search. You have a query search engine and you know what people search and you suggest from that. That's basically how Google does it, right? They know what people search and when you type something you get completions from that. And this is what we have here, a given list of entities. What's a simple solution? When you do more complicated stuff it's always important to understand what's the trivial solution, maybe it's good enough. Well the trivial solution is what, yeah, we can actually do it. Let's just, I can do a grab something here minus, let me do minus ISK is insensitive, Freiburg and let me maybe take Breifurg and let me say I'm allowed to make two mistakes and now I'm, and I shouldn't have, okay now NFS teaching information retrieval. Oh no, this is not, I think I should have, I just want the first column. There is more information in these files which you don't need for this exercise sheet. I will come back to that so I really just want the first column. And also for this exercise sheet you just want the first column. Let me just pipe this into agrab and see what I get. Okay, I have to install agrab. Agrab is approximate grab. So find all lines in the... So grab just finds you all lines matching a pattern. A grab, ok I find Blyburg but no Freiburg, interesting. So Freiburg is not an entity in there probably because it's called Breisgau or something like that. Let's try that. Yeah, now I find Freiburg and Preiskau. You see it's actually not that bad. Agrab is pretty fast and it's not a huge list but if you have a larger list then this can be a problem. So Agrab is a tool. All matching lines will be output, this means match the whole word, this means up to edit distance 2 minus x. So what's the time complexity? You're going through all the lines and you have to do something for each line. You have to do a check and you have to do a check with an algorithm which I will show you next. It will take about one microseconds I would say to check, depends on how exactly you do it, whether two words are similar up to two typos or something like this. So if you have, here is just some back of the envelope calculation that something like this is fine if you have a smaller set, if you have a larger set you need to do something more fancy. It's a bit unfair here the comparison because A-Grap is of course a program written in C. For the exercise sheet you will do it in Python. If you would implement this with Python it would take 10 seconds or 1 minute I think. We already saw that huge performance difference. So yeah, this tool, this basic just going through all the elements and for each of them checking is it similar in Python it would take very long. So it would not be practical. And that's what all the lecture today is about, how to do this in a more clever way. It's a very fundamental algorithm. First we have to talk about, and after this part we will have our break, similarity between two words. And here's a very fundamental measure from introduced by Levenstein and also others already in 1965. There's always one name associated with these things, but actually, I mean, anyone could have come up with that at the time, but there's always one somebody called the inventor. That's just how humans like it. They always want to associate individual with things, although the opposite is true, right? Everything we do is pretty collective. So the added distance between two strings is getting from one word to the other with a sequence of the following three operations. Let me take my, we have a whole lecture about this in information, in algorithms and data structures. Here we'll just do this very quickly. So let's say I want to go from the word dorf, let's say dorf, to the word blöd, From dauf to blöd they are semantically similar, semantically, syntactically not so similar and now you can do one of the, you have sequence of operations, you can replace a letter, you can insert a letter, you can delete a letter. So let's just do a sequence here. Let's just replace the first letter so that would be a replace operation here. Replace and I replace the first character by a B. I don't have to write down what I replace. Let's, and I think I, do I, yeah I left too little space here. I'm sorry you want to go too blurred. I will write it down again in the end. Let's maybe add blue, I just need a little more. So I'm replacing the second character by, and we want to get to blue it, blue it. So that's replacing it by an L. Now I need the E and there are several ways to do it when you think about it. It's not a unique sequence. Now I'm inserting something at position 4. Insert. What am I inserting? An E. And now I'm replacing the last character and then I get where I wanted to get and now I have, I wanted to get to fluid. And that's how the edit distance, no it's not yet how it's defined. So now I have a sequence of four operations here, so now it's replacing the fifth character by B. And now important, just so that we are clear, this does not yet prove that the added distance is 4, this proves that the added distance is less or equal than 4. Maybe there's a sequence of three operations, right? I've just given you one with four, so edit distance is four or less. We haven't proven, and it's actually not so easy to prove, that's what the next slides are about. This only shows less or equal to four. X, Y, less or equal to four. How do you prove that you can't do it better? So actually in this case the edit distance is 4 and not so easy to see, that's why I like that example. And it's also not so easy to prove. So as I said in the basic algorithms and data structure lecture, there is a whole lecture just about this and the theory behind it. For today we just take that for granted but I want to show it to you quickly because we need it so you should understand the basics. A basic notation when you talk about strings you want to know have something for the empty word because if you just write nothing then it's hard to understand whether it's nothing is written there or whether you mean nothing so you have a symbol for it epsilon it's the empty word nothing substrings are denoted by this notation here so square brackets and then from here to there we're starting indexes with one, so it's not zero based, one based, just more intuitive, not how you would program it. Here's some simple properties, also popular exam questions, so when you prepare for the exam think about this. And your question practicing math things, proving something like this is also great. I always have a lot of these on the slides, little properties to prove. So proving formally that the added distance is symmetric. Why is it symmetric? Let me ask you that. Why is the added distance from x to y the same as the edit distance from y to x? Yes? Because if you can transform one to another, with some number of x you can always invert it by... Yes. The answer was if you have a sequence going from one word to the other, you can invert the sequence and every operation you can invert the sequence and every operation you can invert, replace this character by that has an inverse, insertion becomes deletion, deletion becomes insertion. That's the intuition for the proof. Now it's nice practice to turn this into a proof. We are not doing that here. So behind each of these is a small little proof, a great way to prove mathematics. I'm not asking you here in the interest of time, the added distance to the empty word is always the size of the longer word, because what can you do better than inserting one character after the other. Nothing, if you have two words and they have different size, one has size 5, one has size 8, you need at least 8 operations because each operation can increase the string length by at most one. Here is something more complicated, we don't look at the details right now. Here is a recursive formula and the whole lecture I mentioned in Algorithms and Data Structure is about this. We are not going to look at this formula, it's just on the slides for reference. And the proof is also not trivial, it's super interesting, whole lecture about it. Let's just do it together so that you see it and that's something you should understand. For this lecture you don't have to know how to prove this but you should be able to compute it. And let me do that right now. So let's we want to get from, how do we do it, from dorf to blurt and we do that by putting it into a table like this. You will understand in a second why. So we have this table here and now I write some numbers here and I will explain them to you. So here in the first row and the first column I just write this. This was also written on the previous slide and just to understand what it means, let me maybe use wonderful green here. So what this number means, it means what we see in this table is added distances between prefixes of the words. So this here is, this is my X and this is my Y here. And so this is the added distance between the empty prefix of dorf, so it's the edit distance between epsilon, because I'm in the first row, so this here, this row stands for the empty word, this row stands for d, this row stands for do, this row stands for doo. So if I go down here I'm getting all the prefixes of DORF, there are five of them including the empty word. The word itself is also a prefix, it's just a whole word and here I also have a six. So this is the added distance between the empty word and blue. And this is why you have this trivial number of, trivial sequence of numbers, empty word to something is just length of the word, right? The edit distance is just three because what can you do apart from inserting one character after the other. It's obviously the best you can do. And now applying the formula on the previous slide which I didn't explain, let's just do it together here. What you have to do, now you fill in this table one entry after the other, you look at the three neighboring entries and you basically take the largest one, the smallest one plus one. That's basically what the recursive formula on the previous slide said. You take, these are like the entries just above, to the left, to the right, to the diagonal, you don't have to understand the details. And there's a little twist here if the two characters at the current position are equal and we will come to that in a... So let's just do it together, we just apply it here and while applying I will explain it to you. Did I pick orange or red? That's the right color, okay, so it's orange. So what I do here, I take these three entries, I take the smallest one plus one and I check whether these two letters are different. D and B are different. So here it's one and let's just do it for a while and then check an entry to see if it's correct. Here I have D and L, I take the smallest one and so here two smallest one one, I take it plus one it's two the smallest one of these three is two plus one is three the smallest one of these is three plus one is four. Now if the D here I have a D and here I have a D and in that case I don't have to add plus one to the diagonal that That's just how it works. So here I take this one, four, and I don't have to add plus one. That's the fourth case on the slide before. So actually here it's a four. And let's just check here whether that value is correct. This value stands for the added distance between d and BLEUDE. BLEUDE the whole string because it's in the last column. And is it four? Yes it's four because one character is already there and I just insert the other four. So let's quickly fill the rest of the table. So it's always taking, let me just say it once more, you're always looking at these three. Up, left, diagonal, you take the smallest one plus one and if these characters are the same, then for the diagonal you don't have to take plus one. That's the simple rule, not so easy to prove, easy to apply. Here f1 is the smallest 2, here f1 is the smallest plus 1 is 2. Here oo, they are the same so I can take the diagonal if it's smaller than the others plus 1. oe, the smallest one is 2, is 3 o OD 4. And you tell me whether I'm doing something wrong. Maybe I now do a little bit quicker. Here OO, the same character and there's a 2. So I can do 2 here, 3 and 4. And now, okay, here I have four, four, here it's three, here it's three, and now it's four. Okay. And now, actually you don't want all these entries in the table, you just want the last one, but the last one is what you want, right? It's now the added distance between the whole words. And what the recursive formula does is just give you how to define this by the added distance between shorter strings, where at least one of the two is shorter. And what this also gives you, we don't need this here, it also gives you the sequence of operations because you always know where you came from, where that minimum came from. It even gives you multiple answers if you have many. I think, let me just move this a little bit down because it's more beautiful that way. And is there any question about this? Yes please. So is edit distance defined as the minimum amount of operations required to go from one to the other or just an amount? It's the minimum amount to go from one to the other, because you can do it with a sequence of 100 transformations, replace here, back, something, it's the minimum number. Yes? In the last slide, the third one, you chose, you had to choose three and then you had to add one and then you write three. For this entry here? The next one. This one? Yeah. It's two plus one. It's two plus one. If I take these three, I take the smallest one of these three plus one. So because of the two, I'm getting the three here. Okay. Yeah, it's tricky, it was a little bit fast. What you also see here is time and space complexity, you have to fill out this table and it takes squared time, it's quadratic or product of the length of the two strings. So it's not a trivial algorithm. Any other question about this? Ah, in the chat, thank you. It's highly related to what people do in bioinformatics, that's true. In bioinformatics you also have it's strongly correlated in the algorithms data structure lectures, I actually explain it to alignment between two strings when you want to align things so that the similar parts are below each other. So yes, it's very similar, you're right. And it's a very fundamental concept, you need it in bioinformatics for similarity between gene sequences. Prefix edit distance, that's what we are doing today. This is defined as... So here's an example. And let me do the example. Let me explain it by example. And then we see how we compute it. First, what's the edit distance between these two words? Not prefix edit distance but what we have seen before. What's the edit distance between uni and university? Seven, I hear seven several times and that's correct. And why is it seven? Well, the first three letters are already fine, but then this is just seven letters longer. And what can I do better than inserting these seven letters? So it would even be simple to prove. So these words from the perspective of edit distance are actually not similar at all. They have a huge edit distance. From the perspective of I type something and I want to find something, they are a perfect match because one is a perfect prefix of the other. And what prefix added distance does, itIV, UNIVIR and so on. And now I'm probably getting a bit of trouble. And then for each of these prefixes I check what's the added distance to UNI. And the smallest one is the prefix added distance. So I look at U, UN, UNi. Let me maybe just write it here. So what I'm doing is I'm computing uni, oo, I'm computing uni, un, I'm computing added distance uni, uni, I'm sorry and so on. And I take the smallest of these. So what's the prefix added distance between uni and university? Here is zero and that's correct. It's a zero because there is a prefix here, namely uni, where the prefix added distance is zero. So prefix added distance zero means it's a perfect prefix. Here's another example so somebody types univer so they use a W and there's university. First what's the added distance? Between these two? What do you think? Five, that's very correct. We have four more letters and you also have to replace the W by a V and it looks like one cannot do better. So you see computing the edit distance for simple words you don't have to do the whole schema you kind of see it. What's the prefix edit distance? One, okay I see you got it. Very good. So you just take all the prefixes from this one and maybe let me just remove this and just draw the prefix which gives you. So in this case this gives you, you take the edit distance for this prefix and here you take the edit distance for this prefix and then you get it. So yeah, this wasn't a feature, I think it started about ten years ago or so that you had it because it's not so easy to compute but nowadays you don't want to miss it. And of course very important, note that this is not symmetric, right? The prefix edit distance from the other side is not the same. This definition is not. So note, let me just write it like this, in general, PED of X, it can happen of course, but in general it's not the same. Unlike the edit distance, which is, yeah, because it's about prefix and prefix is not symmetric. When uni is a prefix of university, it doesn't mean that university is a prefix of uni. Why don't we start with the shorter words that we say the shorter words in front of? Because I think that's maybe the more intuitive way that we look at shorter words. No, I don't agree, it's a good question, but when you think about our list here, then it's, so what was mentioned is why don't we always take the shorter one first, but the thing is there is an asymmetry in what you type and what you want to search. Maybe you are typing university and there is uni in the data set. I mean it's a, ok you could argue that you then also want to find uni but that was a confusing example but then you want to find it for another reason. There is what you type and there is what you want to find it for another reason. There is what you type and there is what you want to find. So I'm typing, I don't think if you are typing Vladimir Len, then you don't want to find things starting with V. I mean you've already typed a lot, you want to find things which are longer than that. So it's just not symmetric in that way, right? What you have already typed, you want to, it wouldn't actually, yeah. If you type this far, or if you type Nelson already, you don't want to find everything that's starting with N, right? So it's really asymmetric. Do you agree or? I'm just thinking about it with the trifor and the trifor, it would be, in the list, it would only be trifor, and I would type in trifor. Yeah, but that's where the confusion came from. In this case it would be okay because you have two synonyms for a word, which is actually, it's a good way to mention, we're actually giving you more information here. Add minus two, let me just give you, so in the first column, and that's good enough for the exercise sheet, you just have the names here, but here you also have all kinds of synonyms of these names, like the USA, America, the States, so these are also good matches. So in the Freiburg line you would probably have Freiburg and Breisgau and Freiburg. But in general, not every prefix of a word is also a synonym, right? So for example, united just by itself, it's not a synonym. So it's really not symmetric. When you find a prefix, you want to find continuations, but when you want to find something longer, you don't want to match all prefixes. So, how do you compute the prefix edit distance? And let's say we have, you type, here's an example, let's say you type Fibo, and maybe you want to find Freiburg. Let's say you want to compute the edit distance between the two. How do you do it? Well, let's do our schema. So let's write Fibo here and I no proving here and let's write Freiburg here. put more spacing there. So that it's less crowded when I put the number. And what we do is we use exactly the same schema but we don't have to look at all the entries. So this first row and column is always like this for the reasons already mentioned. So it's 1, 2, 3, 4, 5, 6, 7, 8. And now let's just look at the last row. So the last row, so here we have FIBU to, the empty word is four, FIBU to, what's the right entry here? So just from the definition and without knowing what the previous rows are. Three, yes, it's from FIBU to F, it's just the edit distance, not prefix edit distance, it's three Fiebu to F. It's just the edit distance, not prefix edit distance, it's three. What's the entry here between Fiebu and Ferb? So let me just before I ask these questions, so this is now, this is the edit distance between S and the picture before, between Fieibo and F. Right, so it's just edit distance between prefixes. What's the number here? So also three, yes. And what's the number here? So also three, that's correct. Boring. What's the number here? I think so, yes. What's the number here? 4. It's between 5 and 5. Okay, I'm believing you. What's the number here? That's now, maybe we should write it down so that we see it. So this is now the added distance between fibu and and freibu. Fibu and freibu. Yeah it's just two right? Which is a bit, are you really sure? Then it's, the other one is, that's four here. That is between Fiebo and Freib. I think it's three here. I think it's also three, right? You see, even for short strings it's not so easy. And this one is actually two, right? Which is also interesting. So by adding more letters you can become smaller. And so, PED of, I don't have to write this in red, I do it in blue. The PED is just the minimum of the last row, right? Because it gives you all the prefixes of Y. PED XY is the minimum of the last row. So it's the minimum of the last row and actually there's one other trick which is written here on the slide you don't have to, and that's, I think I've, this is now a straight line and I'm arguing that you don't have to look no need to look to the right of this if let me just write it and then explain it. If you just wonder whether the prefix added distance is two or less, then you don't have to look to the right of this line anymore and think about it. Let me just write it down. to the right of this if you if you don't want to the exact number but just want to know and if PED of x comma y is less or equal than two. So that's a simple trick, I just wanted to mention it. If you're just wondering, Fibo to whatever the prefix here is, is less or equal than two. After you are two longer than the prefix, you don't have to search anymore. Because here for sure it will be three or more. Here it will be four or more just because it's three letters longer, it's four letters longer. So if you just want to know is it similar up to a certain degree where similarity is this delta you don't have to look beyond a certain point. So that's the prefix edit distance. Is there any question about prefix edit distance? Yes please. So like if we were to implement an algorithm, would an algorithm be able to calculate the last rule for calculating the rules above? No that's a good question. So an algorithm cannot, we just did it by visual inspection, but no, you actually have to compute the whole thing until here, but you can omit the right part. So if you are just interested in this, so yes, you have to fill in the table. But and this brings me to the last slide and then we have a break. It would be lovely exercise to implement this, but there's more work, second part of the lectures about that, so we give it to you. Of course if you have fun with it you can implement it yourself. And we actually provide two versions, one implemented in Python, one implemented in C. You can, if you have problems with the one, use the other. If you have problem with the C one, the C one is faster. You have to install something. You can also run PyPy 3. I mentioned PyPy earlier. Then the Python one is about as fast as the C one. But maybe just pick whichever one of the three works best for you. This one with pure Python is quite slow. Then you see that edit distance is hard to compute. This one is fast, this one is PyPy is reasonably fast. Yes, you have a question? I just saw optical illusions, okay. Any questions about this? So to continue now, you can for now not forever forget about the details but you have to understand edit distance and prefix edit distance. We will just work with these concepts in the following. You don't have to understand for now how it's proven or how exactly it's computed. Just should be able to compute it in your head. So that's what we need for the second part. So let's just make a five minute break here and then we continue with the second part. Five minute break. So let's continue. The second part is really exciting. So nice mathematics. Mathematics will be on the slide, the sheet will be practical, so don't worry if you don't. Actually it's easy to understand I think intuitively, it's very beautiful. Let's start. So what's the motivation for the second part? So let's maybe go to this egg wrap thing again, what we had here. You just go through the list and you check for every, what we had here. So just so that it's clear what the baseline is. You have this, and let me show you this one. Yeah, I have my 2.6 million words. I type something, I go through the whole list, 2.6 million strings and for each of them I compute is the edit distance smaller than whatever I want. Maybe I say just give me up to two mistakes. Edit distance, prefix edit distance. You have to compute it 2.6 million times. That's wasteful and it's also sometimes just too slow. And so we want to do this faster and here's a very nice way to do this and the idea is to do it with Q-grams. And what are Q-grams? So first intuitive idea and two examples. So do you think, and it's an intuitive question, so what is small, maybe one or two, is the added distance between Freiburg and Stuttgart small? What do you think? Is it small? It's not small. Why is it not small? How would you formulate it? Why is it not small? Without even computing it. Yeah? The number of operations to go from one to the other is... Okay, that's the number of operations, but why? You look at the string and your first intuition is it's not small. Why would you say so? They are so different. Yeah, they are so different, right? Everything is different. Freiburg, Stuttgart, they are very different, right? Everything is different, Freiburg, Stuttgart, they are very different. Now without computing something Freiburg and Breivurg, is the edit distance small in some sense? Yeah, it's the opposite now, because here it's Rei, here it's Rei, Urk, Urk, they have similar parts, so the strings look similar, so probably the edit distance is small. What exactly is the edit distance? Not so easy. If I would have asked you what's the exact edit distance here? Not so easy. Even for strings of length 8, because they have some letters in common, r, r, and so on. But just whether it's small or not. And this is what we're trying to do more, make more precise now and turn into efficient very beautiful algorithm. And actually often like this the mathematics not so trivial but also not complicated the algorithm then is relatively simple. So a Q-gram, let's start with an example, a Freiburg, and the Q is actually a natural number, so here we are talking about 3 grams, it's the index here in that definition, and it's just all substrings of length 3. So it's the first three letters, F-R-E is a Q-gram, so it's just a substring of length three. And then the next, R, E, I, RE, AIB, IBU, BUR, URK. So these are the three grams of Freiburg. How many of them are there? One, two, three, four, five, six. And it's an eight-letter word. So I have six three it's an eight letter word. So I have six three grams for an eight letter word. Important detail, you will wonder about that in your implementation, is it a set or a multi-set? We want it to be a multi-set. So if you have something like Ababa, so here you have five 3 grams, I hope that's correct, yes I think it is. So you have Aba and then Bap and then Aba again. So now are these two or one and we want to count it twice, so it's like a multi set, if you have the same element several times you record that you have it several times, that's important when you implement the algorithm. And you don't actually implement it as a set, so it will actually come natural that you do it that way, just in case you wonder. What's the number of the Q-grams of a string X in general? here's some example. So it obviously depends on the length of the string and on the Q, what's the formula? Look at the examples and think about it. So here, string length is eight, and you have six of them, three grams. So it's a formula involving the length of the string x. What is it? Length minus q plus one. I think that's correct, yes. So it's the length of the string minus q plus one. And this is also something one would have to prove, but this is also a mathematic experience thing with, right, you know that it's linear, I mean it obviously linear in the size of the string, it's also linear in Q, and if it's linear it's enough to do it for one example, right, you just have to figure out, oh it's K, it's X minus Q, you do it for one example, right? You just have to figure out, oh it's x minus q, you do it for one example and then you know whether it's plus one, nothing, minus one and so on. Okay, popular exam question by the way. So we have established that. So, similar words have many q grams in common. And here's a lemma and before we prove the lemma let's look at an example. Oh I'm sorry for that, wake up. Let's take a little bit longer strings because it's a, what's the added distance between Freiburgeren and Breifürgerin. What was it? Yeah, it's two, right? It's just two letters. I mean, they are swapped, but it doesn't matter. The F is replaced by a B. So the edit distance is two. And right now we're just talking about edit distance. Prefix edit distance will become later and it will actually be very similar. But for simplicity now, edit distance. Here, let's just look at two grams for a change. It's called q grams because the q is a parameter. You can take three grams, two grams, five grams. These are all the two grams. I hope you tell me whether I made a mistake. And these are all the two grams of Breifwerk. So it's a little bit longer. So how many of them are there? Let's just check. How large is this set? Yes, please. If it is larger than the set empty, or does it say? Yeah, it's the empty set. If it's larger than it's empty, but it will not happen for a reason, we will see. Yes? What is it? Ten? I think I also opt for 11 but yeah, one off mistake, classical mistake in coding. So 11 two grams for each, they are the same length. Let's just look at the, do we look at the similar ones or at the most of them are actually equal right? So let's just equal equal if if boo-foo are not equal equal so we have a lot of equal ones right? This equal this is equal and let's underline the different ones so this one does not occur here they are different Ip, Ip, Bu, Fu. And now something very interesting here which is important for the intuition later. I've changed something here, the B into an F, it affects two grams, right? It affects two, not just one. Why two? Because if it's three grams, one change would affect three three grams because it occurs in, it's like the three grams, it's like you have a window of three letters and you slide it over the word. So if you change one letter, it affects q, q grams. So in this case, one change of letter affects two two grams. Why does this change here affect only one two gram and not two? Who can tell me that? Yeah? Yeah? Yeah, it's nothing before it, it's the first letter, so it's kind of special, right? And we will exploit that later, or we will do something about it. So in the beginning it doesn't have to affect two, it can also affect less if it's at the beginning or at the end. And now let's look at, we didn't actually look at what the lemma says, what is set difference here? So this is now, if I take the set on the top and I remove everything that's also in the set at the bottom. So it's set different. What remains? So then what remains are just the different ones, right? So what remains is of R, I B and B. So this is what set difference is about. And so the number of these is just three. And what the lemma, so that's why I'm taking set difference here. So I'm just, yeah, I'm taking this set and maybe just a diagram. I think we will have, if I have two sets here, A and B. So this is my set A and this is my set B. This here is then the intersection, right? And this part here, this is A without B. So I'm removing the intersection. So actually A without B B and we will use that later is A without... yeah let me also write it like, you can also write it like that. The size of A without B is actually the size of A and you're removing the intersection. We will need that later and I will have it again. I'm just already writing it here so that you can use to it. So it's just taking away the intersection. The intersection here is pretty big because they have a lot of two grams in common. I'm removing all these which also occur in Y here. And what I'm left with, it's just the one which are different which were changed and what this lemma says is if two words are very similar like these two long words are then this different set is pretty small. So here it's two times two the added distance is 2, so it's indeed what the lemma says. That's what the lemma says, right? So this is what writing at the bottom, it's always a bit harder, I'm sorry. This is what the lemma says and we already got an intuition why it's three and not four because the one change in the beginning actually affected only one two grams so that's why it's actually less. Yes please. Yeah. It's a great, let me just translate what you said and it's great that you're thinking about it because you're getting that way you get intuition. Let me just make an example out of what you said. You said something like this. So I have Frei, maybe, Bürgerin. string were actually, I don't think you are allowed to use that in Scrabble, but let's just take it. So Freiburgeren and Urgerenbray. So you just took two parts of Y, that's what you say instead of taking Freiburg and Breisbau, Breisbau and Freibau or something like this. And now, what's the added distance between these two? It's huge, right? But they have a lot of two grams in common, because they still have similar parts. But now look at this lemma. That's fine, right? This here will still be small, but this is large now, so the lemma still holds. It's an inequality. I changed the way around. Yeah, yeah, but of course in your head you changed it around, but it's a very good question to ask because it points at something, these two grams are not a perfect indicator, right? This just says if the added distance is small, if they are similar, then this is small. These two sets are very similar. It's not the other way around. Here I have a lot of Q grams in common, but they are very different in added distance. And that's a very important intuition. Yes, please. If we use symmetric set difference? And what's the question? And then a variant if we take symmetric set difference then actually we will do that in a second we will exploit the symmetry in a second so maybe just wait for it and then if you still have a question ask it again because the question here is we could also tell why X why did we choose it in this direction and not in the other direction. It's a very valid question, we will come back to it. And I think we will come, first we have to prove this, then we will come back to this. Oh there's a question just to be clear. Set difference on multi-set, that's also a very good detail question. Actually for set difference in multi-set, the counts are important. So you have it three times here and you subtract it two times, one occurrence remains. So that would be the answer to that. You would just subtract the count. If I have boo boo boo and I subtract boo boo, then I have one boo left. But actually it will come natural in your implementation. Just pay attention to it that you do it the right way. Now I've given you the intuition for this lemma and now let's prove it and let's go through the proof rather quickly because it's not part of the exercise but of course eventually you should understand it and be able to prove it yourself. So how does one prove it? Well, we have given the main intuition already. Let's say the added distance is one and I'm, and here's the explanation. So let's say I have Freiburg and let's say Q is equal to three, I have three grams, this is my string X and my string Y is Frei Zsork. So now I've changed one letter and well how many 3 grams does it affect? We have already seen that it affects 3 grams. So it affects this one, this one and this one. These are now different. So eib become eiks, ibu becomes Ixu and Bu becomes Xu. So if I do one edit distance operation and that's also true for insert or delete, it affects at most, can be less, that's why less or equal, Q, 3 grams. And that's why we have this less or equal Q. And now if you have K operations, it kind of multiplies by the number of operations, which is, looks natural, it's actually not trivial to prove. Let me just give you the idea. If you have a sequence of K operations, it means you can get from X to Y in K hops, where each of them is just one operation, so the base case. So what you have now is this lemma holds for each case, now you have indices here, looks a little intimidating but it's actually simple, so you just have all these intermediate strings. Think of the doof bloot example, doof, Bov, Blov, Blov, Eev, Blut. For each of them the set difference from one to the other is just at most q. And now you need to prove this beautiful theorem here. And let me just prove it to you. Let's prove it in the abstract and then you can figure out yourself how it can be used to prove this lemma up there. This is another nice example for now. Let's just forget everything above. Just look at this and let's try to prove it. That's again another nice way to practice mathematics. I have a sequence any sets now. And now I want to know the difference. I subtract the last one from the first one and I compare it when I take these differences first to second, second to third, third to fourth and so on and I take the union of these. Let me just prove this very quickly in the abstract without understanding what it has to do with the above. So if I want to prove that something is a subset of something else, then I have to prove, take any element from here, and it also has to be in here. It's the standard proof technique here. And since this is a union I have to show I take any element from here, it has to be in one of these sets, then I'm fine. Let's just do that. So let's start with an X from here. So it's in here and now let's look at this sequence of sets and let me put this here. without, is it correct like this? A0, A1, A1, A2, A2, A3 and so on. And I'm confused myself. No, no, that was wrong. I'm sorry, made these sets like this. A1, A2 and so on. I'm going to write A3 and AK, it's actually simpler. And now I'm wondering my X, in which of these is this? So my X is, I'm starting with A0, it's in A0, not in AK, but it's in A0, so it's in here. And now I'm looking for the first time that it's in this set, it's not in that set. So I know that it's not in here, right? My X is from, I'm sorry, it's from a0 and not in ak, so it's in here and it's not in here. So at some point I will have for the first time that it's in here and not in here. That's just a nice mathematical trick. So this means by the picture on the left that there will be an i where x is in ai but x is not or in ai minus one to pay attention to the indices but x is not in the next one. This has to happen, right? It's in the first and not in the last, so somewhere here along this sequence it has to be in this one, but not in the next one. Just by any combination of arrows and x's I can put here. Which means x is in A-1 without Ai, which means x is, so it's in one of these here, which means it's in the union. Ai-1 without Ai, which is what I wanted to prove. Okay that's just, if you didn't fully follow it, it doesn't matter, it's just a very nice little proof, you can try it yourself at home, but I think the intuition is still easy enough, it's, yeah, if you have one change in a string, affects at most Q, Q grams, and if you have K changes, it affects at most Q times K, Q grams. I think that's the very simple intuition of the proof. It's harder than expected to make it mathematically correct. And maybe one meta comment which I think is important. It's important not to confuse intuition with mathematical proof, right? What I said first, you can easily take this as a mathematical proof, like, okay, one quogram effects, so I have this word effect, I put it in quotes, what exactly does it mean? You would have to prove it, and that this here is also, okay, now I do this K times in a row, so it's kind of K times Q and not Q. Maybe it's right, maybe it's not, you only find out when you actually do the math, because maybe it's something you missed, some some small detail and I can tell you from experience with having done this like a thousand times, there always, if you do the proof then you find oh it's actually not true for this border case about which you didn't think. Because when you think intuitively you always have an example in your mind or something. And then, yeah, so this, what we see down here is not dispensable, it's not like, well, why isn't that the proof, this intuitive thing which I said initially. You have to do the proof to see if it really holds. Again the similarity to coding, eye content, and I think it's just a fact, you have a complex piece of code, you don't write a unit test, it's not correct. For sure it's not correct. And you see that when you start writing unit tests, some input which you didn't think about and it fails on this. So this is like the equivalent of mathematical proving. The intuition you just think about the general case. Okay so we have proven this lemma and now forget about the proof again it was just a simple intuition that it's written up here, similar words have many Q-grams in common. Now it says in common but it talks about the set difference. Let's actually talk about in common and that's what the next slide is about. So now let's talk about the intersection between the two. That's more like in common, right? So this is the Q grams of X, this is the Q grams of Y, and this is the set they have in common. And here's an example, so if we take these two words from two slides ago, Freiburgeren and Breifurgeren, then let's go here. 3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3-4-3- 8, C, slide 20, that was two slides ago. And what this here says now, it's greater or equal to 7. And now you have this formula, now all you have to find out how does the intersection relate to the set difference and we have, I've already shown it to you, I just show it again. I have two sets, so that's now Meggenlehrer set theory, you have set A and B. This here is the intersection between the two and you can, it's visual proof and this here, this thing here is now A without B. You can see from there that the size of A without B is you take all elements of A and you remove the elements in the intersection. All these, if you remove them or not, it doesn't matter, right, the ones which are not in A makes no difference. The ones you are actually removing is the ones from the intersection. And if you plug this in the previous lemma, and let's not spend too much time on this now, let's just take this formula for granted. But that connects to the comment asked earlier, if you take this lemma here, you can just swap x and y, right? If this is less or equal q times ed of xy, then the same is true if you put y here and x here, because ed xy and ed yx is the same thing. So you can actually apply the lemma twice. And if you apply the lemma twice, then once, so here I'm doing this, so this is in this direction and this is in this direction. Both are true and for both I can use this formula and by that I get two inequalities and both of them hold which is why I get the maximum here. Let's not go into the details now, it's just by symmetry you get actually two inequalities from the lemma and it gives you a slightly stronger inequality now. So it helps your algorithm. Let's not go into the detail, let's just continue so that we see the algorithm. And now we have talked about the edit distance all the time. Let's also understand what's the difference for the prefix edit distance. Well the difference for the prefix edit distance is, so that's what the lemma says for the prefix edit distance, here's an example. And let's also not bother too much with the detail. Now I can apply the lemma only in one direction and I can apply it only to a prefix of y. I don't know which one. Now it's not symmetric, right? That's important. Let me just write that here. Oh no, it's written here already. In general the prefix at a distance is not symmetric, which is y. Here in this formula I have x and y and I can take the maximum of the two, here I just have x. And apart from that it's very similar. And let's also not go into the details here. But it's a formula which you want to use for your algorithm. Back to the intuition, similar words have many Qgrams in common for ED and prefix edit distance, let's turn that into an algorithm. And now, this was going into the details, not so easy, but now the algorithm is actually pretty easy again. So what I do, I've typed something, this is my x, now I have my big dictionary and now conceptually what I do, it's not what I will actually do, but conceptually for each of these words I compute how many Qgrams do x and y have in common and I wonder is it above this threshold here which I've seen on the previous slide. And if it's not, so if they don't have a lot of Q-grams in common, this is what they say, then I know that the added distance can't be small, right? There are very few Q-grams in common, then I know that the added distance cannot be small. And you have to check it out for yourself that this is actually the right direction. If they have many Q-grams in common, I cannot be sure that the added distance is small. This is the example we have seen, a very good comment earlier. Let me see it. Where was it on the slides. Just here was this Freiburgarin and Urgarin-Brey, many two grams in common but they are actually not similar at all. So in this case I have to check. So it's a one sided test, I can exclude it if there are a few Q-grams in common then I don't have to compute the prefix edit distance. But if there are many in common then I actually have to compute it. So what I'm doing this to save edit distance computations. And the algorithm for PED is just the same, I just use a different formula here on the right. So that's the algorithm, I just use this bound, I compute the number of common q-grams, and then if it's very few, I don't have to check if it's few, I have to compute the prefix and the distance. This is also important for you for the exercise sheet. You will actually pad the strings. We will just add, so if we have three grams, we will add dollar dollar to both sides. And this gives the same result because if I'm now doing edit distance with this string or with these strings where I have dollar dollar on both sides, it's the same thing. And for the prefix edit distance I do the same thing but I'm just adding dollars to one side. I can't add them to two sides because otherwise the prefix edit distance is not the same because I'm taking prefixes of the words on the left side. So you will add this padding. And now the... So how do you actually implement it? That's the last slides now. And I need to take a sip. And we are a little over time already. So instead of, yeah we should get to an end because my voice is failing, I'm sorry. And it's cold in here, it's cold in here, at least I'm freezing. We have now reduced, I should use sign language, we have now reduced the problem of computing the edit distance for 2.6 million strings to computing common q grams on 2.6 million strings. This doesn't really look simpler, but actually it is by the following trick. How do we do that efficiently? Now we can use something which we already have and this is a very beautiful connection to what we have already seen. We compute the following index and it's called a Q-gram index. That's like an inverted index but the things which matter now are not words and documents but Q-grams and words. So here I have an inverted list for a Qgram and I have incorporated the padding here, so that's the 3gram $fr and what's in the list is all the words matching that Qgram and because there's a dollar it's all the words starting with fr and here I've just taken words which are names of towns. So I have an inverted list for, and you can use the exact same code from lecture one. I wanted to do it together with you now, but for reasons of time I will not. Anyway it's the first part of the first exercise to just compute an inverted index for Q-gram. It's almost the same code. This is the inverted list for IBU. It's all the words in your dictionary which contain the 3-gram IBU anywhere. Chichi-bu, Freiburg, Ibu-zuki, Malibu. As usual you will not store like we did for the inverted list in lecture one, you will not store the whole documents or words here but just IDs. So it's just like the inverted index we have seen in lecture one except that now these are not documents but words and words IDs, these are not words, these are 3 grams. Yeah, I wanted to do this together with you but in the interest of time I will not. Anyway it's part of the exercise sheet. How do we use that index? It's actually, I will show it to you by an example and then that's the last slide. Oh my, my voice is failing. So let's say the user has typed braai and we will do prefix edit distance and we are interested in matches where the prefix edit distance is at most one. So similar things and we can make one mistake, deletion, insertion and so on. So what I'm doing is I'm padding to the left with this dollar, this was this one trick. It doesn't change anything but it gives me more 3 grams, it helps my algorithm to be more efficient. And now I take 2 grams here for example, so I have 4 2 grams, dollar B, BR, RE and I. And now I get the following list. And this slide will make it very clear how the algorithm works. So now I have the inverted list for this 2 gram dollar B, which is also in what I typed. And what does it contain? It contains all the words which start with B. Bangalore, Beijing, Berlin. I also have here the 2 gram BR, and here I have my inverted list, all the words in my 2.6 billion which have BR in them. Brisbane, Brussels, Gibraltar, it doesn't have to be at the beginning. RE, I have RE here, Bangalore again, Freiburg, RE, Singapore, RE, A, I have Beijing, Freiburg, Marseille. They are not similar at all, right? These are just inverted lists. Point is, I have typed this, I have these two grams, I only have to look at these words. I can ignore all other words because they have no two grams in common with my X. I know that the set of Q-grams in common is zero. So I can exclude, especially if I type a lot of words, I don't even have to look at them any further. I know they have zero programs in common so they won't be similar. If I now take my bound from the corollary, we don't have to look at the details now, the bound will tell me if I want a word where the prefix edit distance is at most one, it must have two grams in common. So I will now compute the union of these lists and you have already done computing the union of lists. And now you just check and while you do the union for each word you count how often. So it's a variant of the union, maybe you already implemented that, depends on your implementation for exercise sheet 2. You just count how often each word occurs. And which words occur twice here? It's Bangalore, why? Because it shares two two grams with X. Beijing, because it shares two two grams with X. And Freiburg. Only these two have two two grams. And by our lemma we know that we don't have to look at the others because they don't have enough two grams in common with my X. So these three are left but I'm not done, that's the one-sided thing. I know one-sided in the sense all the other words I don't have to look at them anymore, there's no way the prefix added distance can be one or less. For these I have to check, it's not guaranteed yet that the prefix added distance less than one and here we have an example, obviously did it on purpose, where you share two grams but the prefix at a distance is actually pretty large. Why is that? Because the RE Q-gram appears at the end here, at BR. It starts with B but the RE comes at the end. Just because you have two grams in common doesn't mean you are similar. So now I have to go for these three words, I actually have to do this pretty expensive dynamic programming thing with a table which we have seen, which is an expensive calculation. I only have to do it for three words in this example, for all the other words I've already excluded them. Let's do it for these three and that's our last deed for today. What's the prefix added distance between Brai and Bangalore? What's the prefix added distance? Seven. Seven. Are there other opinions? It's prefix added distance? Prefix so we have to take all prefixes of the second word and take the one which gives you the smallest added distance. Zero. You only have to take the prefixes of the second one, not the first one. Other bets, I know it's late but it's the last slide. Three. I hear the number three and that's correct. The number three and it's actually this prefix here, right? So it's bang, braai and bang, you can't get any better, you get an edit distance of three. So out of the race, no match. Right? So no match. It had two grams in common but when we actually compute the added distance, prefix added distance is too large. Braai and Beijing, and it's the last thing I promised and we are done. What is it? One. I agree. So that's a match. One and what's the prefix? The prefix it's BAE. Yeah, why not take BAE? It's the three BAE and BRAE it's just the R missing. So that's a match. If we type BRAE we expect to get a match for Beijing. Maybe we just mistype the R. So that's a match. And the last one, one, that's correct, it's one and the prefix is just Frei, in this case, the engine does not know what I mistyped. So it's actually, so it's Frei. And that's another match. So that's the algorithm and you see how everything came together. We had to do all this beautiful theory to get these bounds here. In the end we can just use them without understanding. But then it gives us this and then we have just few candidates to check. There we do the expensive computation and we are done. Yes please. I think when we are describing the inverted index and the G-grams we are using a regular set because we do not pay attention to how many times the G-grams occur, but all our proofs were for the multisets, so isn't this important? Yeah, that's very true. So the comment was mult-set or not, and the thing is, I think what you have to do, all you have to do, and this is actually what, it's the simplest algorithm from lecture one, if a word occurs RE twice, you need to have it twice in the list, and then you have satisfied this. Because what you want in the end when you compute a union, you want to know how many Q grams from this list does it contain. So if you have, I don't know, RE, RE, I can't come up with, if you have RE twice in your entity name or in your word, you would have it twice in that list here. But you can just do that. And not only that, if you implement it, it that list here. But you can just do that. And not only that, if you implement it, it even happens naturally. You would have to do something for it not to happen. Like in the very first lecture, if you remember, if a document contained a word three times, it would just appear in the list three times if you don't do anything. Very good question, just pay attention to it, but I contend that if you do the natural thing, then you don't have to do something for it to be right. And of course, here's the sheet. It's a long sheet, but it's only long because we're giving very detailed instructions on how to do it and also their unit test which helps you to see whether you really did it correct. So it's just the algorithm, it's all in individual parts and then you do it. And it's a very beautiful algorithm and very useful in the end. You do this little search engine like in the beginning. You type a word and then you get matches. Sorry for the overtime but maybe we have a week off in two weeks. Is there any other question for now? Okay, now we go to our heated homes. Have fun with the sheets. See you next week. Bye.Welcome everybody to Lecture 6, Information Reheal in the winter semester 22-23. Today's topic is web applications. It's a two-part topic. Today is the first part. First I will say something about your experiences with the last exercise sheet, which was fuzzy search, in particular fuzzy prefix search. There will be no lecture on December 6th next week, which means you have two weeks for the next exercise sheet and I will have a slide about it with more information and the contents today is how to build a web application so in particular a search web application but what I will show you in the lecture is mostly a web application, so in particular a search web application, but what I will show you in the lecture is mostly web application and we will build a whole web application from beginning to end and not using some library which does everything for us but really from the grounds up. So with socket communication, so talking between two machines, and then you will learn all these concepts, HTTP, media types, HTML, style sheets and so on. So each individual thing which you see today will be relatively simple but actually understanding how a web application works with these two machines and a browser and stuff being sent back and forth and the interaction that's actually quite intriguing and I believe that it's very good to understand this from the ground up because if you just use some framework or library which does everything for you, you don't really understand it and today you will understand it. And the exercise sheet will be to build a web application on top of the fuzzy search which you have done for the last exercise sheet. And what that means, I will explain it to you, it will become clear during the course of the lecture. And this will also become clearer for this exercise sheet you will do something static, so for those of you who know JavaScript, you will not yet use JavaScript, there will be enough other stuff. And then for the next exercise sheet in two weeks you will add JavaScript to it and make it dynamic. You will be amazed by the result already of your sheet. For this week you will be even more amazed by the result of your sheet in two weeks. So how was your experience with the fuzzy search? It was a good experience, so most of you, almost all of you liked it a lot, found it quite doable. You also liked the topic, the explanations and you were happy to be coding again. Here are some quotes, really cool exercise. Many of you said that it was impressive how much this data structure could speed everything up. Most interesting exercise. So happy to be coding again. But as usual there's a spectrum of opinion. I miss the math after the last two sheets. Coding rather easy. Following the ideas was less stressful than the math sheets. Somebody wrote that I wonder how many people feel like that. There was a very detailed explanation on the sheet how to do it. My impression is that most of you like that. We already had it in the past in different ways with much less instructions. But my feeling is that many of you like it if it's more detailed but then there is always the opposite opinion that's life, somebody says it's too much. And here we have a very competitive comment. So how stressed are you? That was one question so I would say overall the stress level was normally distributed, slightly skewed towards the stressed side I would say. But not especially so. Most of you opted for a break next week so we are having a break next week with more information on the next slide. Here are some quotes. Would like a little break to revise all the topics. Told good call. I could use more time to get Christmas presents. Not stressed at all. Best time of the year cookies, Glue-Vines, chocolate. I'm fully relaxed at the moment so there was the whole spectrum. I always liked several people said that. Oh no, yeah a break is fine but I don't want to miss if we drop interesting material because of that please don't have a break. So there were certainly ten people saying that. Some of you mentioned that the workload so far was just fine and some people are also very stressed because it is a normal distribution. And somebody said it would be amazing because it's their birthday on the next date. So these were your comments. So there will be no regular lecture on the next week which actually fits because this exercise sheet if you do it nicely it's a little more work so you have two weeks for it. But what we will do, there will be a Q&A session and it will not be, let me just show a little more. There will be no Q&A session this Friday and the next Friday, instead we will use the slot from the next lecture, the usual time, but let's start at a quarter past on December 6th by me, because Natalie is at a conference and then we can take as much or little time as you need, it will not be a lecture with new material, it will be likely online only, not recorded. Here's an agenda but it's preliminary, you can tell me what you wish. Several of you said that you would like more, know more about mathematics. I mean I'm doing this a little bit in the lecture but I can't give a whole math lecture if the actual topic is something else. So in this Q&A session I can spend a little more time on proof, how does one prove these little things, where should one pay attention and just all these math, these very basic math questions we can talk about that and as usual I will do it with a lot of examples. So I think that will be very interesting for many and then of course any questions you might have about this lecture, the past lectures, exercise sheet, past or current and life in general. And I hope to see many of you at this Q&A session. And there will be also a separate announcement in the forum. Okay so that's it for the organizational part. There are no questions about this, you know you can always use the chat or your mouth or whatever. So let's start with web application stuff. So web applications, we have two machines talking machines talking with each other, very futuristic. How does that work? So it can be the browser, and if you have two machines talking with each other you do that with something called a socket. So each of the machines has to create a socket, it's like the endpoints which are then, that was too loud, you can lift the chair more quietly, thank you. And the socket always has two components because a machine can have, unlike humans, many conversations at the same time, and how is it done, each machine has ports, so there are different ports on the same machine, so if two machines are talking, then each machine has the name of the machine at a particular port, talking to the other machine at a particular port. And that's usually, no that's also too loud, you have to lift the chairs up, then move them in the air, then put them down again, and then it's quiet. It's possible, I know, I've tried it. So the port is an integer number and we will see that. So, but that's just, you will see it live in a second. How does it work? How does communication between two machines work on the socket level? You have to create a socket like here is something from which I can talk to another machine you have to say this is now port 8,847 on this machine and then you listen for now other people can call you like a phone call on that machine on that port. And then you wait for something, you listen for what they have to say, then maybe you do some computation and then you send something back. And on the other side it's very similar, so they also, no it's not, it's asymmetric, I mean the server has the socket, the other side can now call this machine. It just needs to know the name of the machine and the port. And then if it does that we will see that you automatically get a port on that machine which calls the other machine and then you send something, you wait for the result. Okay but that's just high level stuff, we will do it low level in a second and then it will all become very clear. This is just, so in Python we will use socket that's built in library just for those of you who want to know. In Java the standard library for this is JavaNet service socket and C++ we use Boost.Asio. You can do it even more low level but anyway we won't do it in the lecture and I don't think, is anybody in the room using Java or C++ for the exercise sheets? I don't think so, right? You're doing it all with Python. And we will provide this code as a starting point for your exercise sheet and it will also be the starting point for our code today. Actually in the lecture today we will do a lot of live coding and we will do a lot of stuff which you also need for the exercise sheet. But we will not give you that, we will just give you the starting point because especially for this topic if you just copy the code and then ah yeah it, you don't understand it, you have to do this yourself. At least once you have to program a web server yourself and see in all these little steps what can go wrong and then how you fix it and then you understand it and you have some aha moments. So it's really, of course you will have the recording where you can see how did I do it but also there I would not recommend to just blindly copy but try to understand, integrate it into your code, do it step by step like we do in the lecture today. This is just for reference, this is where we will start from, we will see it, yeah, and this is just for reference now. I will go to the coding now immediately, thank you Frank, everything works fine so far. And I've prepared something here, this is a Python script. It's a little different set up today. So this window is more narrow. It's also one font smaller. I hope you still have good eyes in the back row. We tested it before. It's still visible. And this is, I prepared some code to save some time. So this is the code which was also on the slide. Let's maybe briefly look at it and go through it. So it's a class which is called Search Server. You initialize it with a port which means this program will There's a pretty strong draft and I don't think It's much colder now so we dangle. It's strong draft and then, oh it says, okay. So we are creating this socket here, you don't have to understand the details, it's now we want to start a server on the machine. Let's not go too much into the details of these commands. There's comments here, it's also on the slide. We want other machines to be able to call us. So this argument says how should they call us? Maybe this machine here has several names. So it's the machine called Tura, but maybe they also call us under the name localhost or a fully qualified name, it's in a particular network which also has a name and so on. This just says this address means any name is fine and I will answer then binding which just means I will listen on that port and then I'm listening. Okay so now the socket is created we can just let's just see how we will so this is our search server which so far does nothing really I have to tell, give it a port, that's down here in the main program, in the main program, the usage is I have to specify a port, then it will create an object of this class with a given port and then it will run, it will call this function. And right now it does nothing, it just, I created the socket which listens and that's it. So now let's go a little bit further. So now we have a socket, so how does this continue? The first thing one does with such a socket is you listen for, wait for connections in an infinite loop. That's the typical server loop so we have an infinite loop and I'm just, yeah so let's maybe first write something so that we see it and we are starting slow but there will be, so let me use a formatted string so I'm waiting for connection on this will be for our log on port and this is self port, let's see, it's a little bit narrow so I don't have, shouldn't write so much. And how does it work? So server, let's look at it on the slide, server socket, accept, yeah it's accept. So what this does, accept means my socket is now waiting for phone calls from the outside on this port. And this will block, so the execution will block at this point. Actually we can check this if we run the program right now. So now the program is in very sad state, so it's waiting for somebody to call it and nobody calls it. So that's where it is. So this is blocking, except it's blocking. And when somebody calls it, it will return two things. It will return, it's also written here, an object which we will use for the connection and it will also return the address. So we can afterwards write connection incoming client connected from, let's see what's on the... Yeah, let's just, I don't know actually what kind of object this is, this address object. But let's just print it. Okay, but anyway we will not get there if I now run it again because nobody is calling us. So how do we call this? And let's just, I will do this as follows. So in my editor I can just actually let me first do it like this. I can just exit the editor for a moment going to a shell. There's a program called Telnet which is very old which essentially just phones other machines on a given port. So let me just do that. I just say please phone this machine to her. I'm actually on the same machine here, so I could also use local host, but it doesn't matter. I could also phone from another machine and please call it on port 888. And now something happened, right? You see here client connected, so somebody called me and it says here the address, okay it's fine. So this was this client address was a combination of an address, so the person, the machine that called me had this address 10.8.150.108. You can see that it's an internal address, it's not a public address but a local network address, doesn't matter. And it also called me from some port. That's what I explained earlier, that's just a port that's created when the client wants to talk to another machine. Okay and then nothing happens because we haven't done anything. So now here I'm on the phone now, I call this machine on the right and now I'm, hello nobody. So you see nothing happens right? Because this machine has already moved on, right? It didn't do anything. So let's just change that. So it's actually not so easy to exit Telnet because everything you type will be sent to the other machine. So I think, yeah. And let's go back to the editor. OK, so the first thing we should do, now we have accepted the phone call, we should somehow process what we have accepted the phone call, we should somehow process what we get from the other side. And this was also, and this is done by this command, connection receive, it's called receive, called different in different libraries, I think in Python it's called like that read a batch of data from the client. So the client is saying something let's call that a request and you have to, when you use that command you have to say how much you're going to read. You don't know how much they're going to talk, right? So let's just take, I don't know, 64 maybe less. So let's just do that. Let's maybe have a variable here, request data data and now here's another thing that's important what do I receive? Is it a string? No it's actually not a string and I'm not sure whether I, so this is stuff that I did it's just from the slide for references. This is something important for the whole lecture bytes versus strings. It's actually easy to understand, you just have to be aware of it. So whatever I sent across the network is sequences of bytes. It can be anything, it can also be an image or whatever or something compressed, it can also be text but fundamentally it's bytes. I get bytes from the other machine and I send back bytes to the other machine. So in Python you can use bytes objects, there's also byte array if you want to change the bytes here, we don't want to change the bytes, so we are just using bytes objects. And how does bytes object work in Python? You can use bytes, the keyword, but you can also just use the string notation and write a B in the front and every other programming language has this too. Ok so this is now just bytes so really individual bytes then what's the difference to a string? Well in a string you have characters? So here's a string, knödel, wonderful German word, knödel, and each letter here does not necessarily have a representation which can use one byte, two bytes, three bytes, we don't know. And for example in UTF-8 which we will talk about in the next lecture, an Ø represented, this German umlaut thing, represented in two bytes, some characters need three bytes, four bytes, even more bytes depending on the encoding. So what I've specified here is an encoding saying how are characters represented in bytes. You don't need to understand that in depth for this exercise sheet, you just have to understand that there's a difference between bytes and strings and you have to convert between the two because I'm sure, and also we today, you will get these error messages where it says look I'm expecting a string here, you're giving me bytes or vice versa. So the two commands which Python provides for this is encode and decode. Why is it called like this? Encode means I have a string and now I have to say how should I turn this into bytes? How is these letters encoded? So here I'm saying encode it using this correspondence of characters to bytes. And then I get the sequence of bytes. So that's string to bytes and the other direction is the same. So here I have two bytes, I can specify them like this. We don't need this today if I want to specify bytes explicitly. I can do this in hex notation here. And this would actually be the German umlaut in UTF-8 so here this would have as contents one character length one although it's two bytes. So encode and decode we will, for this lecture we will always assume UTF-8 actually in Python if you omit the argument here it's automatically UTF-8 because that's just the standard encoding for nowadays because it's very universal and just standard. And we will have a whole part in the next lecture in two weeks where we'll explain to you how UTF-8 works because it's really interesting to, important to understand. So for now we will just, our request is just binary data and let's just let's just print it batch of yeah anyway, data batch, let's just print it, I'm not sure whether, and here I have to, yeah maybe I do it like this, this was just request data. And I'm going slow in the beginning because it will get more complicated very quickly. So let's see what I get now. And now I'm not going to the shell, actually I can invoke commands from the shell by just using nvim. I can just use the exclamation mark and I can here go to her 888 I can call it from here. So now I'm temporarily in a terminal. So you see now it's waiting here it's not going to the next except because it's actually calling receive. So if I now type hello. Okay but now after the hello it went to the next one. Ok if I now type anybody talking to me, nobody talking to you. So I'm doing this low level so that you see what kind of complications can arise right, it's asynchronous now, this already moved on to the next listening for the next connection. Here I'm still talking. And if you imagine this side, I mean this side doesn't know what's happening on that side, right? So all kinds of complications arise and you will be confronted with that when you try to solve the exercise sheet. So it's something one has to understand when you have these two machines talking with each other asynchronously a lot of things can go wrong. So let me show you one more thing. How do I exit this? Quit, yeah. Now I'm back here. So let's actually do this again and now let's maybe type something more, I don't know. Okay this didn't send everything now, right? It just read 64 bytes and I read a lot of things here. So the next thing that we should do and that's just important to understand, that you have to read in rounds. You can't just read everything at once because you don't know how much the other side is going to talk. Maybe it's a very talkative relative and sending, I don't know, gigabytes. So what you have to do is read in rounds. And there's no way around that. So there's no receive everything because you don't know when they're going to stop talking. So you have to do that. So let's do it like this and now we need another while data batch, that's just the batch now. Batch, let's call it data batch because we don't have such long lines today. Data batch. And now let's just append this to our request data. And we'll remove all these debugging output request data. I can just, this is append, I think there's also extend, but let's just do it like this. Okay. And now the question of course is how do we exit this loop? But let's just try it, let's go slow in the beginning. And here of course I have to restart the server. Now let's call again. And now let's, hello? Yeah. Are you talking to me? Apparently not. So the other side is, so I'm doing something here, the other side gets something there. And I'm still sending something here. Hello? Yeah, nobody. Ok, it goes on like this. Now let's see what happens when I close the connection. Let's quit, okay. What's happening now? This is still running, what's happening? You see that's also an interesting effect and I wanted to show you that this is now running on and on and on and what's happening? This side has now put down the receiver, not talking anymore and when this happens, that's just by definition, then what you get here has length zero. So when it's length zero, you can break the loop. This is what we just saw. So if, let me just put it here, so if the length of this data batch, and this is something which you can rely on, so when the length of this data batch is zero then you can break the loop. It means the other side stopped, ended the connection, they hung up. And note that it happened, let me also show that again, let me start the server again. And this is, there are a lot of subtleties here and there are even more subtleties which I don't talk about. So even with this very simple talking to each other, there are so many things which can happen and which can go wrong. So actually I'm, okay now I'm down here. How do I, let me, yeah. The interesting, what did I want to show when I, now I just stopped sending something, I haven't ended the connection, this is not putting down the receiver right, this is just here when I quit, now I'm putting down the receiver, now the connection has been closed from this side, even says it here, connection closed. Okay, so now let's send something back. What would happen if Telnet would crash if you would type in control C or something like this and crash it? Yeah, you can't type in control C because in Telnet everything you type will be sent as a character, but you have another sequence. Yeah, what will happen? That's a, I mean the connection will just end, right? It will be, no it's a good question. The question is how, then I need a third window I think right to, let's maybe save this for later but it's a very good question or you can try it out yourself. The question is if this somehow crashes in the middle of sending a request and I'm actually not quite sure whether this site will, I think this site will also receive a connection and it will not be hanging. But it's very good that you asked this because there are scenarios where something just hangs. The other side just doesn't say anything anymore and then the other side doesn't know what's up, whether they have ended the connection or they have just died or whatever. So it's a very good question. But I don't know the answer right now. We could try it but I think it would cost maybe too much time. I would need a third window. handle the request in a separate function and this is the function we will, yeah let's which returns, I wanted to, which returns the, which is we send back. Okay so now let's, so there's a handle request and this is, this and now we give it the request data and this should return something. So when we handle the request typically we listen to what was sent to us, we do something, we send something back. So let's write that method now. And let's, so this is now handle request. This is a member of the class so, and then we have the request data. Yeah and let's just write a comment here, let's maybe put it to the top. So do handle process and return a message. So let's just do that, let's do something very simple for now. So let's, so message is thank you for, let's also use a format, thank you for, let's just, I think it's always very polite to say thank you. Thank you for this and let's put a new line here so that you see and then please send more. I think that's a very polite way. So I'm totally ignoring the other person but I'm saying thank you, please send more. But it's polite. So what do I do now? No, now I have the message here and everything else we will be doing in the rest of the lecture will happen in this handle request function. Let's just see how that works. Now I have to send it back, that's actually easier than the receiving because now I know what I have, I can just say send all and now I have to pay attention because now my message is actually a string and I want to send it back as a byte sequence and for this I need encoding. So I, this is what I already explained. So let's just see whether it works. So I just received the request here, this method just constructs a simple message for now and then I send it back. And then I should send. Shouldn't it be in the while loop? Oh definitely not. No it should not be in the while loop? Oh definitely not. No it should not be in the while loop. The while loop is reading the request data in bits. It could also be very small bits like one byte at a time. This is just a random parameter. I'm just reading what I, in packets until there is no more to read, until the batch size is zero and then I have the full request. Look, I'm extending the request here. The request just comes in batches. This while loop is not many requests but it's one request in many bits. Is that clear? That when the batch size is zero the connection is closed. That means you can't answer, or am I understanding your concept wrong? Yeah, that's a very good question. We will come back to that. When is the connection? Yep, very good point. Let's come back to that in a minute. You are having a point there. But let's first see it in action, and then I think it will become clearer. So we should close the connection when we are done. So our communication, so this is now very simple communication, I get something, I compute something, I send something back and then we are done. Connection close. So let's see how this works now. For now let's... Actually, that's good. Now I get an empty screen. Let's do... So, hello, are you talking to me? So actually I have to end this now. So nothing happens, I'm not getting a result yet and that's actually exactly what you said, because the connection is not yet closed. So it looked natural but you spotted it, but what we did is wrong. I stopped sending something but I haven't closed the connection yet. But when I close the connection connection then I'm not there to receive something but okay now I've made a mistake so that's where's my mistake self handle request request data I mean this is this is unrelated the error message here can you please tell me the line number? In 72 what did? Can you say that again? The indentation is wrong. The what is wrong? The indentation. Oh, yeah, thank you. The indentation is wrong. You are absolutely right. But why doesn't it complain? Yeah, it's just, yeah, it was just stupid. Thank you very much. So we do it again. Maybe I now do, reset just resets the screen so that it's at the top. So telnet-tura-8888. Blah, okay. And now let's try again. Hello. Blah. Okay, now I'm saying okay, let me stop sending something. I've sent something. Now I quit the connection. Ah, now I have another problem. None type has no attribute encode. What did I do wrong now in line 66? Can you say that again? Line 79, I'm going to do it again. Yeah, thank you. You can say it right away when you spot it, thank you. So we also need to... So we try again, but you will have of course the same problems when you do this, but it's all very instructive. So let's do it again. And now we are not so talkative anymore, blah, blah. OK now let's quit, you get to know Telnet very well. Nothing happens because the connection is still on and now we have closed the connection and now the result is sent to something which is no longer there. So apparently this doesn't work. And there's a good reason for this and the reason is the following. The connection is not the same as sending a result back and forth but a connection is something which can last for a longer time. Actually we will switch to web browsers in a second. You can just have the line open and now you want a back and forth. I say something, you say something, and then I say something again. So I have to signal that, okay, I'm done now with my message, now you can send something. And this is just something you have to negotiate. So for example when you talk to each other on the phone it's typically a short break or something which indicates, okay I'm finished. Usually you don't say end of message, you can speak now. You have these culture dependent hints. But here we just have to agree on something and let's do something which the protocol we will talk about in a second also does. Let's just agree if I send an empty line, a line with nothing on it, and actually you can see here that the new lines here are two characters, so line so if our if I have a yeah if my request data and how do I do that find match what's the right let me just so I always have a new line and this I think means I have an empty line. So if I have something like this or I could even say probably ends with, now let me just say it like this, then I break the loop. Let's just try this, whether it works. So let me try it again. Hello. Ah, you see this is the error message you will also get a lot. Now I have request data is binary and this is looking for a string. This doesn't work and I don't get this at compile time because Python doesn't know the types at compile time. I have to do binary here, right? This is binary so this has to be of the same type. Anyway we can speed up a little now because we have seen these things many time now. Oh, okay. You see, thank you for hello, please send more connection closed by foreign host. Done. Okay, that sounded polite at first and it was a bit harsh and maybe I also should send a new line here. Ok so this is now our protocol and let's do it again one more time. So I can, but actually I'm a bit surprised why did it break even though it didn't find, I didn't enter an empty line here. So I'm not sure I understand why I did that. Does anybody understand it? I'm confused myself now, right now. Do you have an idea? Because this was not in my data and still. Maybe you have to encode the screen when you already encoded. No, no, I'm getting, this is the data I received here, so I don't know why it received. The return value of oh of find it's minus one yeah I think I did when I tried this I didn't use find but yeah. So it's, oh yeah, find just gives me the position right? Yeah, yeah, okay. But is there another, is there a match or is there, yeah I don't know. Yeah but I think that's the reason. Okay so this was a Python thingy. Hello, okay. Are you talking to me? Okay, now it works fine. I'm sending something, it's collecting the data. Now I'm done sending. Now I have to, okay, now I typed in an empty line and now it just appended all this and now it, okay, but now it worked. Now I can do it again. Actually I don't have to restart my server, right? I can just do it again here. So now I'm, yeah, hello, blah, blue, please. So I can talk, it will just collect the data and once I type this then it's done. So my convention now is I type an empty line and then the other side knows and then it just responds with what it has collected so far. OK, great. So we have established something that works, a basic back and forth. Our server is still running. And now let's do the following. Let's go to a web browser and this is the first, the next step now. Let's just enter this here. So I just enter the name of the machine and the port with a colon in between. And you see something is happening. So now I also got something. And actually I can look at what I just got by looking at, I'm typing F12 here which tells me what the browser was doing. So actually was sending something here to the other machine. So what happens in a browser, and this is something you should understand, in case you didn't know it yet. I type a host name and a port in the browser, and maybe something else after it will come to that in a second and the effect of this is the same as for Telnet, it's just sending something to that machine on that port, it's just calling that machine on that port and as you can see it's sending a lot of information here. And maybe let's now to understand this a little bit better, let's now print this a little bit nicer. So here I was just showing you the batches actually I don't want the, and what's my request data? That's why I collected it and let's now just decode this in, how do I do it? Maybe like this. Request data decode UTF-8. Okay, now I will not show it in bits, but I will just show the whole thing after I've received this. Let's just see how it looks like. Let's start again here. Let me go to my browser again and let me, bam. OK. And as you can see, by just typing this in the browser in the address bar and pressing return it actually sent me this whole thing here and here I have the new line and another new line I probably printed. And let's just see actually here if in this browser window the browser also tells me what it did, actually it tells me that it sent this to the machine and if you look at it, it looks suspiciously similar to what we have here, right? The browser just, it just tells me if I open this console here with F12, it's on the slides, it just tells me what it sent to the other machine. So it sent this and we will not look too much into these individual headers but that's sent to the other machine. So it sent this, and we will not look too much into these individual headers, but that's just the request the browser sends. And here we receive this. And for the rest of the lecture, we're actually only interested in this very first line, the rest of the line we can ignore for now. And what's the very first line? What's the typical thing you enter in a browser? Typically you enter a host, you typically don't enter the port, that's implicitly 80, if you don't type something here we will specify the port, sometimes you see it, and then you enter something like this search atml. And what's the effect of this? The effect is that it sends this line here and then a lot of other stuff. So it just sends this and so I get this path here after the, I get it in my request. And for the rest of this lecture let's just print this first line so that it's a little more easier to see. So for this lecture we are just interested in the first line of the request. So how do we do this? We just take our request and we, yeah let's just decode it. We want to decode it, yes I think we do. And let's just split it by these characters and then we want, I think that's one way to do it. We want the first entry. Oh my. Oh this was a key combination which exits, I'm sorry. And now I'm not sure whether to be a bit more, does this work in Python or do I need a continuation now? I'm not sure. Request data received, let me just put the request here. So now what I try to do is just extract the first line, I'm not sure whether it will, I'm a bit unsure here whether that, let's just try it. Yeah that looks good. So strangely enough it also asked another something here, which this is actually the icon which it wants to display in the tab here. We will come to that later. So it's actually sending two requests here. And let's just look at the network tab again. So here, and if I look at it in raw form. What I now get here is just the first line and it's called a get request and what it just contains is the keyword get and then it contains everything after the slash here and then some information about the protocol which for today's lecture is HTTP 1.1. Okay, so let's see if there's something else on the slide for... Actually where are my slides? I left them. Okay so let's quickly go through something which we have already seen now. We have not yet spoken. Okay we have seen the first part of HTTP, so the browser is now talking to us and it also has to follow some protocol. I already mentioned that you somehow have to say when you are done with sending something, you do that with this new line and it sends us GET requests. And a GET request is just a string of this form, so it's just the keyword GET, then whatever comes after the slash and the address bar and the rest we ignore. And now we are supposed to send something back. And I think we will do that now or after the break. Maybe we will send something back. Let me just go through the next slide. Let's do the sending something back and then we have a break. Ah, okay, thank you for, okay, I'm already sending something back actually. Let me not send it request here, because it's, yeah. So just the first line, I'm just, now I'm just processing the first line and sending it back. Okay. So now what we send back, and you can see it here what happened in this network tab of the browser console. So it sent this here, these headers of which we also only ignored everything except for the first line. And then this is the response which it received and which will then just show in the browser window. And actually I didn't adhere to the protocol right now because I'm supposed to send some headers to tell the browser what this is, what I sent. So I'm a bit surprised that I don't get any error messages. But let's do that now, and after that have a break. So what you are supposed to do when you send something via this protocol, you get it like this and you should send it like this. So you do not just send your content, so the browser could have just ignored what I sent, but you have to send some headers with some information. So the first line should be whether, yeah, just a status code, so 200 means everything works fine. Then you have to send the length of the data you're going to send, so it would be the length of our message. And then you have to say what type it is. So let's just do it. And we will talk more about that later. Let's just do it for now. So actually here, let's just our handle request return two things. It's the... How do we... Yeah, let me first write the handle request and then you... Okay, let me... Let me first write the handle request and then you... Okay let me, so here I have a status and let me just say the status is 200 OK. I have a type, let me call that media type, I have a slide on this which is for now text, text plain, I have a slide on this, I will explain it in a second. And let me just return all three, I'm not sure. Maybe in the order in which they appear in the header. So it's the media type and the message. So I'm just returning three pieces of information now which I have to read here. Status media type message is equal to... And now let me send it properly so I have to send HTTP 1.1 and then the status and then a new line, a new line should always look like this and if I concatenate string in Python I can just do it like this, just writing strings next to each other. And I have to say content length. Now I need to, okay, that's the length of my message. So it's len message. And actually my message, I have to make sure that my message is data, right? So let me do the following if... So actually now I'm returning a string here. So in case my message is of type string, I think that's the way you do it in Python. I'm not quite sure if you know it better. Please correct me. Let's just turn it into a message into a byte sequence. This is what I explained earlier. So if it's a string encoded, if it's already data, then I don't need to encode it. It's already okay. So now I can just do the length of my message here. I send this and the content type, sorry, and this is the media type. This will be the text plane for now, I'm just sending the browser, this is plain text and nothing else. And then what's not here on the slides maybe I should write it with my pen. comes here is there must be an empty line here. Ended by this sequence again. And this is just so that the browser knows, OK, these are the headers and now comes the contents because there can be many more headers. So you somehow have to say, OK, when do my headers stop? And you just do this by having an empty line like we already did before. So let me just do it like this, let me also write it in a separate line here. And now let me just, ok the message is now, this is now a string, I have to encode it in, I also have to turn it into bytes again. So let me just do it like that. And now I add the message. I have a subtle question here. Why did I do it differently? Why did I first encode the message now and edit here? What would have gone wrong if I would have done it the other way around? Like encode it here and not encode it there before? What's the difference? Do you see the difference? Yeah, the length might change exactly. That's a very dangerous mistake here, right? If I would have done the encoding here, then this would have been the length of the string. And the length of the string is not the number of bytes. So many pitfalls. Yes, so this is actually very important. So let's try what we get now if it compiles. Okay. Oh, nice. now if it compiles. Okay. Oh nice. Now we have some response headers here which weren't there before. So this is actually what it received. So this is what the browser expects. So it was a miracle that it showed something before. It wasn't supposed to. It did it anyway. Now it received proper headers. It now, ah, you're following the protocol, you sent me something which is 57 bytes long and it had this content type. So now I actually talked, the browser called me, I talked back to the browser according to HTTP protocol. And as you can see, it's actually very simple, right? It's just reading strings, it's just a line and then you send back some lines in the proper format and then the browser. So it's a very simple protocol. And I think that's it for the first half. So we make a break now. Are there any questions for now? Maybe you can think about questions in the break and then we do more sophisticated stuff after the break. So see you in five minutes. So what we have seen now, what the browser sent a GET request and that's what happens when you type something in the address bar and send something. The HTTP protocol knows a lot more request types like post request, this is when you went to send data along with the URL. This does not happen when you type something into the browser but you can also use it in different contexts, we don't need it here. Many other request types. There are also many more headers, in this case result headers which you can send. For example we just sent 200 OK, you can also send not found, forbidden. Actually you can send any header you like, let me just make that clear. For example here I can also do send, let me just take 200 blah blah, let's see what happens then if the browser complains. Now it just says status 200 blah blah it's also okay. So actually you see this is a very basic format so actually I can just write there whatever I want. It's really just strings, some of it are given meaning, the string after the number actually the browser doesn't care. And it's just a convention that you use certain codes for certain events. And for the exercise sheet you should implement these and what this means will become clear in the following. We have already seen these media types or content types they are called in the HTTP headers they used to be called MIME types because this first came up in the context of mail. When you have mail, in the mail you have text but you can also attach all kinds of things, images, PDF, what not. And then you have to say this piece of data is of this kind of data. And that was called a mime type at the time. So nowadays one calls it media types. And there's just convention how these things are called, they typically consist of two parts, it's actually a bit more complex but that's the most fundamental part here, the first thing tells you what kind is this, so is it text for example, and then the second part is just more specific. So that's the semantics, that's the format or the kind of data. So actually we'll use several of these today, not all of them. Yeah, and so far we just said we were just sending text. And this tells the browser how it should interpret the data. We will see that in a second. We have already seen the console, that's very useful. It will be very useful for the exercise sheet. Usually you don't have this open in a browser. You just type something in the address and you enjoy the content. But here, and this will also be valuable in two weeks, you can see exactly what's going on. It's a very, very valuable weeks, you can see exactly what's going on. It's a very, very valuable tool when you want to understand what's going on, when you want to debug and so on. Actually here you can see everything, the raw data which the browser sent to our server here and what it received and all kinds of interesting stuff. There was a timing information, how long did it take and so on. Here it's on the same machine, so it's very fast. It's usually F12 is the key combination on all the browsers, so here I'm on Firefox. It doesn't really matter, the browsers are very similar in this respect. The most important section for us now is the network section here, which just tells you what is being sent back and forth because that's the topic of today's lecture. There's also this console here, this will be important when we do JavaScript, not for now and elements, which if you want we will come to that if we have HTML and we want to look at the elements of the page. Okay, let's now go to HTML. What is HTML? Well usually you don't see contents like this in a browser. It looks more fancy with a bit more layout. And so you need a language to specify please show this. This is a header, this is an image, please show this in this way and the language for that is HTML. I will not give you an HTML tutorial now but we will just write an HTML document together and just I explain some, I mean it's a very simple markup format. So let me start by writing the body here, there's also a head which I will write in a second. For example I can, this is a title, so it uses this tag notation here, so it comes from the XML word. So this is just I'm enclosing this piece of text and these H1 tags. And what H1 means, this is a header, a first level header. And this just gives the browser a hint how it should display it. And you already see an abstraction here. It doesn't say here please show this in Helvetica 14 point. It just says this is a header and then there is another layer of extraction with default values in the browser which says okay header. I will show it in this font, in this size and so on. So let's just keep it that simple for now. And so that's a simple HTML page and there should also be a head with some meta information for example the title. Now what's the title in the, where does this appear? Yeah let's see where it appears by just writing something there. So that's metadata and we will see in a second where that appears. So that's a very simple HTML page which has some metadata, the title, we don't know yet where it appears and then to show it's just heading our first search engine. Yeah, thank you very much. So it's an interesting question what does the browser do when you mistype stuff here. Typically ignores it, it will try to display it as best as it can. Okay now how do we return this to the browser? We haven't done this yet so that's what we will do next. What we actually want is, I mean for now we have just programmed something which just says thank you to whatever, thank you for this request, please send more. Now comes another element of extraction, how do browsers work and maybe you haven't realized this before, maybe you have. What this says here, please on your machine to which I'm talking now on Tura 888, please look if there is a file search HTML and return its contents to me. That's how this protocol started. But that's semantics, right? That's just the meaning I'm attaching to this. So I just take this as a file name, I look for it on my machine and if I find it I return it. So let's just program that now together in our request handler. So let's just check. And the first thing we should do actually we get this request here. Let's just extract the part in the middle. So let's, yeah. So yeah, we only handle GET requests. Let's do it like this. And let me be a little bit faster now. So if this request does not start with, I'm not sure starts with, you have to correct me if I'm doing something wrong here, it should start like this. Then we just say return the status is, I don't know, 403 forbidden, you can also write something else, go away. Media type, text plane and then a message maybe we only support get requests. Okay, let me just write some, yeah I have to be, the path so after this get thing, oh my I've mistyped here, how do we do that? Let's yeah I mean there are some spaces here, let's just try the following path is request, let's just split by space, actually the path does not contain spaces and let's just take, yeah, let's maybe print it to check whether it works. Path, path. path. Okay let's just, that will do it. I think you have understood the backup setup by now so let's just yeah so the path is, I have now extracted the path and it actually let me check whether the other thing also works by just another way to talk to such a server is with curl or curl. I can just send it a request via curl. So I'm now on the same machine so I can also do it with tura888 search HTML. See this also works. So this is just a command to send something to another machine, a command line, cURL, I think you, who knows cURL, who has heard it before? Only relatively, ok. You have at least heard it before. Now cURL I can also use it to send other, so here I can send it as a post request, not as a get request, let's see what's happening. Now I get, we only support get requests, so I get that information. And here you can see it just said post instead of get and there will also be a companion data if I want. But that's just on the side. Okay, so the path was extracted successfully here so now let's check whether a file with that name exists and if so return its contents. So let me move that a bit to the top. How do I check if now I have to. OK let me just remove the trailing with that name without the trailing slash. It always starts with a slash so let's just remove that. And now let's How do we do it? I think the easiest way to do it is Just let's just open the file right. Let's just try to open it for reading and Do something yeah, and then okay message is Nothing, yeah and then, okay, message is, and how do I read from a file? Actually I'm not sure, let me just, file read, does this read the whole contents? You can check it if it's wrong. Say it again. It reads only one line. I'm not sure. Are you sure? Except and if this doesn't work let's just yeah let's just deal with that case. Okay let's just do it like this then the code continues if something doesn't work we can just return yeah let's just say not found not found we already have this nice interface here and let's just text plain actually I don't need and I don't have to send a message actually. So now my message is actually the contents of the file and maybe we can revert from blah blah to the more correct OK. So what did we do? I now just checked whether this thing here which I received after the get is a file without the trailing slash. I'm looking for that file on my machine. If I find it I read it. And that's now my message which I sent back. Let's see. Is there a mistake in here? If you see any mistakes in here you should tell me otherwise we just try okay there's already a mistake oh that mistake was that I typed something extra yeah okay let's see I'm curious myself oh yeah it worked wow it did work let's look in the network tab here. It worked on the first try. It sent something and it received, oh no that was the wrong one. It received the HTML. Now this doesn't look like the typical web page, right? So, strange. What went wrong? It shows us the source code, what did we do wrong? Any ideas? Media type, I hear the word media type, yes, exactly. This was on the previous slide, so somehow you have to tell the browser how to interpret it. Maybe you wanted to show this, right? Maybe this was the request, show me this file in its raw form. But somehow the convention in browsers is that when you have HTML you don't show it in raw form but you interpret it as HTML. And this is what the media types are for. So how do we do that? Well, I mean we can just use a convention and the convention is let's just look at the suffix of the file if it's HTML then we just take that as the media type. So let's just do if and let's maybe do it like this, if media type ends with.html then the media type is text HTML. It's still text we are sending, we are not sending image data or something bytes, I mean this is text but it should just be interpreted as HTML. Can you say it again? One hundred ten? I think that's correct. Thank you for paying attention. Let's just start our server again. Let's see what happens. Wow amazing. So our first search engine, okay it works now. We have sent HTML, it shows it as HTML. It's amazing. So let's see this again and here I can also, here actually I have a switch usually which shows me the raw data or the interpretation. So that's the raw data, the HTML or the interpretation. Okay now I will not give you an HTML primer, just I mean this I've already explained, there's this meta information. Actually we were wondering about the title, and we just wrote something there, where does this appear, where does it appear? Here in the tab right? That's the thing which you see if you have several tabs, and actually the whole window because it only has one tab also has this, where does this appear? So the title is what's shown, it's meta information about the page. It's up to the browser how it does display it, but that's how it does display it. Okay great. So there is more stuff here. Of course it's a whole language, HTML, I'm not going to give you a tutorial here just a few elements a level one header if you want a paragraph input field we are going to do that now an arbitrary logical section something which belongs together but without semantic information it's just a diff you can look that up when you design your page. Let's just do the following, we want an input field, we want to input something, so and let's give it a certain size and let's just see how it looks like. And we can do it like that, Input size, let's just do that. Oh, and actually now I've changed the HTML file. I don't need to restart the server, right? Our server, I haven't changed something in our server. I've just changed this file and I do a reload. My server is still running and now I get this wonderful input field. Let's also add a... Actually I'm... What's the type of this thing? Text? I'm not sure. Is it type? I'm not so firm in... Here I want another button and I want it to be a search button. I think that's the way to do it. These things after the tag are called attributes. Let's just see. Oh yeah, it's missing a quote. We could see what happens when you miss a quote. Nothing probably. Okay great. And let's also add a paragraph in 1990s look and feel. So that's a very, that's how the web page is, the first web pages look in the next. Why does it, the empress end is shown in red here, let's just show, see what it does. Yeah, so that's our 1990s web page, but it is a web page. So for the exercise sheet, it's part of the task to have some styling, some design, we will come to that in a second. Why is the ampersand shown? Actually the ampersand has a semantics, the browser chose to display it anyway, you have to escape this. So this is the proper way to do this. This is just in HTML if you want to show this special character, it's just amp, special characters, HTML entities just on the side, it will do exactly the same thing. And note how I didn't restart our server here right? So now the next thing of course is, I mean the exercise sheet will be to connect this somehow with your, what you did for the last exercise sheet. Before that, okay one more thing, yeah we want some styling information. So that's where the, let me just quickly show that to you, that's also something for the head. Here I can say if I want to associate certain styles like colors with certain elements I do that here. And this has, this is a style and okay I have to give it an address. So I have to specify a file which I call search.css and let me write that file here as well and in that file I can now say for example this paragraph or all paragraphs let me show them in blue. So as I said earlier HTML these tags are just semantic information this is a first level heading now I could say here first level heading please show them in 40 point size and in red and in bold face. The browser has default settings but I can override them here. So here I'm saying please show paragraphs in blue. And let's just do that and see what happens. And it actually did it. It showed it in blue. And look what happened here. Now I have search CSS here and I have it twice, actually I don't know why I have it twice, but why did that happen? And you see a lot of magic happening here, I don't know why this appears, why is that there? Now it's gone. The browser gets this, it interprets this as HTML and in the head it says look I want to use a style sheet and the style sheet has this name. So what it does is because of this line when it interprets the HTML it issues another GET request. And because it's so excited for some reason it issues it two times. Who can explain to me why it issues it two times? I don't know. Let me see, there's a question. Why does the port of the client change? That's interesting, right? The port of the client change. Does it change all the time? Yes, it changes all the time. And the reason for that is, that's a very good question and that's also why I output it here. That's because the way we implemented our server is we get one request like search HTML, then we process it and then we close the connection. We don't have to do that, actually we could keep the connection open for longer and just okay, you ask something, I send you something, you ask something and I just keep it open. And actually that's what the browser would like and you can see it here. Then when it's here, when I look at the headers, it says connection, keep alive. There's this header, but we just ignored it. It says please, I want to send more stuff over this connection. But I'm just closing it after each request and then it has no choice. So the browser, depending on the browser, it could also have problems now, but this browser is intelligent enough to then just start a new connection. And whenever you start a new connection on this machine, this is also running on a machine, it has to find a new port. And as you can see, it's using a new port every time. It's not reusing previous ports, although the previous ports would be free again. So it's interesting. Interesting also that it increases by two. I don't know why. And it's not reusing the old ones. After some time it will, but I think the operating system needs some time to free them. But you see a lot of interesting things going on. It's really interesting to understand them when you code it yourself. If you just use a library you don't see all that and you don't understand it. Ok so what's the next thing? So now we have styled our HTML page now, yeah, CSS there's also it's a whole language so to say, you can do everything, you can even do animations, CSS is very powerful so it's just styling a webpage and yeah every website has heavy CSS nowadays. And why? Let me just explain one more thing. It's a very nice abstraction that in the HTML you just show the contents and its semantics so you say this is an input field, this is a button, this is a header and how it looks like is in a separate file right? It's a nice separation of semantics. Here's the contents, here's how it looks like and that's why it's in two separate files. And it will be part of the exercise sheet to write to a nice style sheet. It's not hard to find documentation, it's a very easy language. Actually when we are done with this then we are done so it's not too much more content. Now what's missing? Well right now, I'm getting a bit confused with all the windows here. Let's write something here, I mean nothing is happening, we have a search button here, we can click on it, but nothing happens, we haven't given it any functionality. Now the usual way nowadays would be to have some code in the HTML, JavaScript, which says when you click on the button do this or that. For this exercise sheet we will do it the old fashioned way and the old fashioned way was to have a form. So for those of you who were already alive when the web was started you would have all these pages with form. Let's just see what happens when you, and then actually for the button I should have a name, let's just give it a name, that's our query button, it's for querying something and let's just see what happens when I, I've now just added these form tags around the input field and it actually looks the same. Let me just, oh, you see now something happened and what did happen and that's again semantics, I mean that's just how HTTP works and how HTML works in conjunction with HTTP in this case. When you put this form thing and you have a sub button here with a certain name, then what it does is it loads a new page which has the same name, but now it puts a question mark and afterwards now come parameters which are just key value pairs, something equal to something and the query is just what we wrote here. It's just the name of this button and then it's a search, okay. Did I type search? Now I'm a bit confused why it says search. I wanted blah blah here. Now what did I do wrong? I mean that's the... You have value search there, maybe that's the reason. Yeah, but the value of the submit button is what is this thing here that it shows. Yeah I have to tell it that it should, so how do I do it? I forgot it right now but this here needs the, I confused this. Okay it's on the slide, this needs the name and not this one. So I have to put both of them into the form, this will trigger the action here, the submit button, that's why it's type of submit and then what I, yeah, this here. It will take the contents, the value of this input field which is what I typed into it. I don't have to restart the server, let's just do it again, search and now I get query equals blah blah. Okay, very nice. But nothing happens. Why doesn't anything happen? Well I mean the server receives this and now it tries to find a file with the name search HTML query blah blah and it will just say actually yeah it says 404 right? That's what our server does, 404 not found. We implemented it that way. But actually that's not what we want right, what we want if you receive something like this, that's the semantics in HTTP of the question mark, it actually means this file and then these parameters so these are now parameters which do something and let's just implement that and for the exercise sheet you need something very similar. So how do we do that we're actually very close to this. Oh and here I went over the line, this is terrible. So where are we? Where do We read our file. and remove the leading slash, let's do that right here and you will, we can just remove that here and you tell me whether I do something wrong you think about yourself how you would do it Okay, now I first want to see is there check whether there are Command line whether there are arguments After the path these are called URL arguments the path, these are called URL arguments starting with question mark. Okay so I just do the following, how do I do it? If path find, we already had something like this, if I find a question mark. And let me even be more specific here, there could be now several arguments, actually there's a whole syntax for this, you can look it up, whether there is a URL argument query. Let's just deal with this special case here. So I'm just checking is there path find query equals. So that's just I'm just checking whether it's something like this and what comes after this. I don't know. And now what's the best way to... Now I want to remove it from the path and write this what comes afterwards in a variable query and I wonder how I best do this. Maybe I do it like this. Pause, I just find the position of this. Let me do it like this. And if it finds this, so if this position is greater, is not minus one, then I just for the path, I just take everything until that position. Now I'm not sure whether that's the right way to do it. And then the query is everything starting from that position plus seven. That looks like seven letters. Until the end. And let's just print this now. And where do I print the path? I think I should print. printer Now let me print the path here afterwards And let me print a query here just to check whether I have found it query found query, query. Does this look correct? So if I find something there, let's just see whether it happens. I will just extract it and remove it from the path so that I return the search HTML and now the query. I don't do anything more with the query. Now I have to restart my server. Let's just do it. I have blah blah. I do something. Yeah, that's fine. Now I get the page again, right? Actually I should find it here. Ah, but it didn't show me the query. But it did remove it properly, but the query I did not extract it properly. What's wrong? Why didn't I? Oh yeah, do you see the mistake? Classical rookie mistake. Yeah, exactly. classical rookie mistake. Yeah exactly, I should first I mean first take the part after this and then remove it. Otherwise it's, yeah. And so, yeah, now I have the query blah blah and I have search HTML, wonderful. And by the way, what I'm also showing you on the side without saying it, now I say it, do it this incremental, right? That's really don't write a lot of code and then nothing is really hard by itself here but everything in combination is super complex and so many things go wrong so don't make the mistake of writing a lot of code and then something goes wrong and you have no idea where it went wrong. So also do it in this piecemeal fashion, that's the right thing to do. So now if we have a query, now what do we want to do? Here's another trick and that you can also do for the exercise sheet and I think that's on the last slide. What do I want? What's the desired behavior or the typical behavior? And Google did it like this in the first ten years. The desired behavior would be, now if I type blah blah here, I get the page, the entry page again and it's empty. What would I like? I would actually like to have the query here and then I would have the result at the bottom. That's what I would like. So let's just do that. So I want to modify the page. How can I do that? Well I can actually do that pretty easily in my code. So here at this point, And let's just do it totally hard coded, if there was a query and the path was search HTML, yeah, replace the templates in search HTML. And you will understand in a second what I, so what I do here, actually I can specify an initial value here and let me just write some placeholder here. And this actually percentage, this is just something which the server will replace. So that doesn't have any meaning, it's just something I give it meaning. And let's have here a logical section without any formatting where we'll have the result. And now what I will do, I will just replace these placeholders or yeah let me just call them placeholders. For search.html my, yeah it will just replace them, my code will just replace them. Let me move this a little bit to the top. So if my page is, and I will only do this for search.html, I mean this is now hardcoding some behavior. Let's just do the following, ok I need regular expressions now. In python that's called re, let's go back to the, where am I in my file, I'm sorry, here I am. So now what do I want? In the, I've already read the file, in the file to call it message, I should better call it, yeah, let me call it file contents. We called it message initially but now the message is always the contents of the file so let me call it file contents here. So if that was the, then I just take the contents of the file and I do this in the file, okay I need contents, yeah. And I think I need a new line here. Okay so if it's such HTML and I had a query then I just replace, and actually it's also fine if the query is empty then I'm replacing that by empty so it's actually fine. Let's just check this I need to I've changed something in the server so I need to restart this. So let me just oh now I have apparently that didn't work. Okay I didn't change the result, I just, okay but let me just, this placeholder is still there, let me do blah blah here. And now the blah blah remains. Actually it didn't remain, it just produced a new page where it had percentage query percent in here and replaced it by blah blah, exactly by that argument. That's why I did what I did here. Now let's also replace the result by something. So what do we do? If the query is a simple arithmetic expression, evaluate it and show the result. So let's just assume we can type something like 6 times 7 or any other expression. So how do we do that? If, yeah let's do the following. If query matches, what's the regular expression, what do I have? I have the numbers from 0 to 9, I have plus, minus, star, this and maybe spaces and I have this just a regular expression this and I don't check whether it's well formed. If it matches this This then I do result is, yeah let me just do eval of query, that's very dangerous, now I'm executing, I mean there could be Python code in here right so this is very dangerous, calling eval in a program which gets input from somewhere else. But we have a check here, so I think that's safe, but who knows, but I'm here in my own network. Otherwise a result is not invalid. Let me just show it like this maybe with a format string result And otherwise I just say invalid expression And now I do the same thing here I just use so just that you understand this Now I also replace this result it's really just a placeholder it's something I did to somehow realize this and I just replace it with a result. Let's just see whether it works and how it works. And then we are I think almost done. Unable to connect. Strel object has no attribute match. This was wrong. You didn't tell me. No I think it has to be re match, that's strange in Python you can't do string dot match you always have to use the name of the regular expression, if regular expression this matches query. I think that's the syntax. Let's see whether it works. No, it also doesn't work. Bad character range plus minus. Are you still paying attention? There's some. Okay, this is not a proper regular expression because there is a minus here and it's taking it literally I think. Maybe if I put it at the end here it works. I think that's the reason. Ok fine. Ah, blah blah, invalid expression. Let's do two times four. Oh my. Invalid syntax. Okay but now there's a real reason and this is the last thing we're going to solve and then we are done. Do you see something? something. You see in the URL actually what I typed, let's do it again, I typed two space star space four. Let's maybe do two plus, that's even nicer. And what it did, when it put it up here, and that's what the form, let's go back to this once more, that was the magic that happened because of this form. When I click on the submit button, what this here does, it reloads the page by appending query equals to what's written in the input field. But what's written in the input field can be anything. And in a URL you can't have anything. So there's a slide on this. And it's, so there's only a limited character set, it's very limited, allowed in a URL. And in particular spaces cannot occur in a URL and also, why is the plus okay? So if you have other characters you must somehow escape them, you must somehow, and you see how they are escaped here, you must somehow represent them in terms of the allowed characters and actually you can see it on the slide, let's just implement it for now. What happened here, so all the spaces were turned into pluses and the plus, because the plus stands for space, is turned into %2b. And I will now just hard code this now and you can do a similar thing for this sheet because that's the topic of next week. Let's just, if the query, where did I write the query? Oh yeah, here. Let's just do some basic URL decoding of the query and let's be more specific very basic URL decoding of the query. So let's just substitute any occurrence of a plus and probably I should escape this, I'm not sure, let's see, by a space. So all pluses, so yeah, it receives 2 plus percentage, 2B plus 4, the pluses should be turned back into spaces and the 2B, and again I have to pay attention to the order, 2B should be turned into pluses and let's see whether it works now. Maybe nothing to repeat. Oh my. Oh found query. Maybe I should... The plus is the problem. Yeah, I also think I should have to escape it here in the regular expression. Yeah, let's just see whether it works. And then we also... I've moved the... Okay, wow it works. Blah, blah, invalid expression, 6 times 7, 42, not bad. So now I think we are done. Yeah there's just, this I mentioned earlier if you want to reuse the connections or not, you can look at this slide if you have any problems with that. Let me just quickly summarize, quickly go to the exercise sheet and then we are done. So we have now done one whole cycle. We have explained how socket communication between machines work, that the browser is also just one such machine which talks to our server here. And if you follow the right protocol then you get something which is familiar from you using browsers. And here with these forms we pressed on the button and then another request was generated with this format and we processed it and I think you can imagine how that fits with the exercise sheet which I will show you now. And the exercise sheet will just be, you want a web application, let me go back one more time, this is your page, you style it a bit nicer way, maybe not 90s look at field but at least 2000s look at field, maybe also more modern. Now you can type something here, a prefix, and then you get fuzzy search result of Wikidata entities, formatted in a nice way. So the first thing you have to do is like redo everything I've done in the lecture, but do it yourself, only consult the lecture if you have to or to understand it, but don't just copy blindly from what I did. So much of these 10 points is essentially what I did today. And then, yeah, also implement HTML, you should call it the same, search HTML. We've also basically done this, but now of course you shouldn't just evaluate an expression, you should call your fuzzy search, right? Now you're not calling eval on what's written in the field, but you're using your fuzzy search to compute a list of results, which you then show you then show and then you should style your page. And let me just remind you, let me maybe go to this page here, that there is a lot more and the Wikidata file that we used for the Wikidata entities TSV that we used in the last lecture. Yeah maybe just let's just show the first one. So for the last exercise sheet you just use the entity names but there is also a description here there is a link to the Wikipedia page there is the name of the Wikidata entity which also you can use for linking a page. There are synonyms which you can use you could also already have used those for the last exercise sheet and there is a link to an image if an image exists. This column is empty if no image exists. So all kinds of things which you can use for your... And by the way the images you also incorporate them just so to be clear. You don't need a library or anything but if the web server requests an image from us we can also just, oh in this case I think the browser will load it from somewhere else. I don't think you have to serve an image here. But what you can do actually is this FAF icon here which we never read which is why no little icon here shows in the tab. This you can create too, that's also a little image. Yeah, so there's a lot you can do here to get your points. The minimum is described here, just have to make it a little bit nice and everything. But yeah, the sentence is important. You have two weeks. Give free rein to your creativity and there will be another iteration in the exercise sheet afterwards, exercise sheet 7 where you add JavaScript and make it dynamic and everything. So this will be with us for three weeks. I think it's a great exercise sheet where you will learn a lot how this works from the ground up. That's it from my part. Is there any question from your part? I have a question about the program. Is our program capable of having two connections parallel because we only have one main loop and if we are serving another computer are we able to, I don't think so but. Yeah it's a very good question, it's actually what the last slide was about. Our server, it's a very simple server in many ways, it will only serve one request at a time and this can be a problem because serving the request with your fuzzy search, maybe many ways, it will only serve one request at a time. And this can be a problem because serving the request with your fuzzy search, maybe you type something where it has to compute a lot and then it takes five seconds before it gives the result. If other requests come in in the meantime, they will be ignored or who knows what the browser does. So yeah, there may be problems but if you handle it properly, the browser will just get no reply for those, or it has to wait. So yes, we are not doing... I mean, if you would do it for real, you would have multi-threading, right? Whenever a new connection comes, you would put this in a separate thread, and there it is being processed, well you can immediately accept new requests. But yeah, that's way too much. You can do it if you like, but I think for the application here it's not needed. It's also not so hard I think. But our, as you correctly said, our script is very simple. Now something changed. Any other questions? There's a question in the chat. Where do we send topic suggestions for the next lecture? You mean for the Q&A session? Ah, Baba, Natalie just remind me to send, let's just do an announcement post and then you can just reply to that announcement post with suggestions. So if you have any wishes for the Q&A session on next Tuesday, yeah, you can just write it there. So in the announcement forum there will be just, I will just write a little post and then you can reply to that. Any other questions? For now? Okay, then I hope you have fun with the sheet and see you next Tuesday for the Q&A session. Bye bye.Welcome everybody to lecture 7, Information Retrieval in the by now Freezing Cold Winter Semester 22-23. Where is my mouse pointer? Here it is. I will say something about your experiences with exercise sheet number 6, which was part 1, Web Applications. You had two weeks for this and it will come as a big surprise that the contents of today's lecture is web applications part 2 and we will talk about dynamic web pages and multi-threading vulnerabilities and unicode and I hope we have time for all of it and the exercise sheet will be to continue your exercise, what you did for the last sheet and make it dynamic. I think it will be less work because you have already done all the foundation work and I think it will be fun. If for some reason you haven't done the last exercise sheet or you were not happy with it, you can also use the master solution and start from that. So let's start with your experiences. So most of you liked the sheet and also the lecture a lot. A minority was not too happy with the low level web app programming. I will say something about this. For those who like the lecture, I'm happy about it because it's quite challenging to present this stuff live, but I think that's the way how you learn it best. If you just teach web stuff in theory, it's like teaching software engineering in theory. It goes into one ear and out of the other ear you have to see it live and then do it live. This exercise was a lot of fun, didn't seem like a core, corresponding lecture was great and well structured. I understood everything but the small details were tricky and strange errors, which took me quite a while to solve. Many of you wrote that, so the small details and it was exactly what I said in the lecture, so in principle everything is simple but when you put it together so many things can happen and that's of course the point of the exercise to make that experience and learn it. Closed my knowledge gap on how this stuff works in practice. Many people said that you had done some web development maybe, but you never did it on a low level. Nice to see the algorithms of previous lectures coming alive, so alive in a web browser. Making, yeah, design is not as easy as it seems. Several people said some of one wrote, damn it's hard being creative. Yes it is. I'm worried about my ratio of time spent to credits earned. So several of you I think spend a lot of time making everything nice and really understanding everything. Let me assure you there's also questions about this in the exam. So there will be an exam questions which maybe says write a small HTML page, write a little JavaScript. So it's not that this is only a practical lecture and not relevant for the exam. It is and that's usually a question about this. And every year, so I'm doing this for many years now, although the contents always changes a little bit and updates, there's always someone, one or two or three people who say, this is not information retrieval, this teaches more about information retrieval. I respectfully disagree, over the years I've built a lot of search engines or information systems or worked at company building these systems and it has always been and still is an integral part to also build a web page and write the code which communicates with the server and the server code which does the right thing with the web page. It's just an integral part and it's so important to learn this and to understand how it works at the low level. So whenever you will work, in case you work with us later, maybe project thesis or whatever, you will see that this is an integral part of everything we do. And you have to understand it. If you just use something out of the box, it's not enough. So I think it's really important to learn this at the low level, how we do it and that's also why we have two lectures about it. So I'm pretty convinced by this. And a lot of you have done quite some effort and made some really nice demos. I will show them at the beginning of the next lecture when they will be dynamic and even more fun to play around with and look at. What's your experiences with 19 degrees in public buildings? When we came to the room it was actually 16 degrees because I don't know, sense of overachievement, but it felt warm because if you come from minus 6 degrees then even 16 degrees is warm. Interestingly most of you, not all of you, don't mind the lower temperatures. Many, also not all, think it's a good idea. Some of you weren't aware that there's a temperature limit but that explains why I was so cold lately. Hasn't bothered me at all. Cool in the sense that it's okay, it's a good idea. Wearing, you can always wear a jacket, many wrote that. Sufficient blankets and fakshaf. When I pass by the fakshaf, they even have their windows open, so they have a lot of heat inside of them. Annoying, I'm freezing all the time. I'm also feeling like this, so for me it's pretty hard. When sitting for longer I get cold anyway, so everything is as usual. That could also have been me. Okay, so that was the organizational part and as usual at the quarter past we start with the contents. And it will be again a lot of live coding and doing stuff together, which we then don't give to you, but you can look at it in the recording if you need to. And the slides are for reference, so the most, the core stuff is on the slides, but I will not look too much at the slides, but we will code together. But let's start with slides. So what is JavaScript? That will be a main part of the lecture today. JavaScript is a programming language for code that runs inside a web browser, interacting with a web page. And we will write JavaScript in a second. Nowadays almost every web page contains JavaScript. When it started 20 years ago, I had colleagues, one working in security, they said, no way, you should turn it off JavaScript, it's evil to use JavaScript in web pages. If you turn off JavaScript nowadays, you will not have a very pleasant experience with the web because I think no web page will work anymore. It's also used nowadays as a programming language for anything, stand alone outside of the browser. In principle, JavaScript is as powerful as any other programming language, Turing, complete and everything. However, of course, when you run it in the browser it has limited access, right? You can't write code in the browser which then reads some files from your computer and that would be strange. So it's kind of running in a sandbox but when you use it as a stand alone programming language it can do all these things too. And of course that's the purpose of it, it can interact with the web page, somehow get input, get notified when you type a key, we will use that today, and then do something and then change the web page dynamically as you do stuff. That's exactly what we will do today. So we will start any minute now. Of course on these slides I will not give you an introduction to JavaScript, you could have a, it's a full programming language, I will just explain some basics to you and differences to other languages. And please when you work with it, the main stuff is on the slides, but there is the reference manual, so on the last slide you have references, just Google it and ask on the forum if you need to find specific things. So I don't have full reference on the slides, of course not. It's an object oriented script language, which means it's interpreted like Python line by line. Syntax is similar to Java, not so similar, hence the name, actually doesn't have a whole lot to do with Java. Speed is also similar to Python when interpreted, but nowadays JavaScript is not interpreted. We have seen this for Python and PyPy where you can do just in time compilation. Oh this code is executed in a loop a thousand times. Let me compile it to machine code and then use the compile code. You don't get as fast as C++, we had this in one of the last lectures but you get speed similar to Java with this. Variables are untyped, so like in Python it's typical for the script languages, whether it's a number or a string, you just say a sign and then the language somehow remembers what type it is. So here you see the most common types, a number, a string, so it's as you expected, also syntax-wise, an array is just square, brackets and hash map associative array is these, where you have key value pairs, it's curly braces. Okay, and then we will use much of this variable declarations, and I think that's the last theoretical thing I show, and then we go to the coding. There is var. So in the beginning of JavaScript there's just var like you see here. You just, here's a new variable. I declare it. I don't give a type. I just say this is a variable declaration. Actually you should not use var because what var does, when you declare it outside of a function, that variable is visible everywhere. When you declare it inside a function, it's visible everywhere in the function, even if you declare it just in a for loop or something where you would expect that it's just visible there. So var, I should, yeah, maybe I should just as a symbolic act cross it out, don't use var because let is the way to do so. What let is whenever you have curly braces and you write let, var also has a number of funny side effects because of this which are very surprising. Let is just you have something code in curly braces and then that variable is strictly visible only there and then outside you can use it, you have to declare it again. That's how you would expect it. And there is const which is like let when you assign a value you know this is not going to change, you have to assign it at declaration and then it's fixed. And if you change it again you will get a compiler error. So let and const, it's more like modern programming languages do it. As usual with these languages this started somehow as some idea by someone and then this stuff gets used, already used when it's not really mature and then over time it gets more mature. Okay, so let's go to the coding now and what will we do? So I'm here hopefully in lecture 7, code from the lecture. And the first thing I will do is I will just copy the code from lecture 6 and see whether it works. So let me just copy it. That was wrong, OK, that was wrong I think. OK, that was the code. I'm sorry, I copied one. I think now it's correct. Yeah, there it is. That's what we wrote in the last lecture. We didn't really use the makefile. Let me just run it. I think I still have the command line here. I hope you remember it, this is what we wrote in the last lecture, it was this web page. By the way, this here, maybe that's the first thing we should do in the editor. So now I should, our search HTML from last time. I need a bit of time to, yeah. There's an attribute which is autocomplete equals off. Let me do that and let's check whether that has an effect. Now I shouldn't get this drop down with stuff which I typed earlier which is a bit annoying here. So 6 times 7 is 42. Can it also do 6 times 8? That's what we did last time, right? Just to recapitulate. If you type something here, you press search, then it sends something to our server and then the result is sent back to the server, doing something very trivial, it's just checking, is this a mathematical expression and if yes, evaluate it, send back the result. You could have done this in the webpage of course, but for the sake of example we do it on the server side. So that's what we have. We have a lot of warnings here, which I think, maybe that's the first thing we should do, we should fix these warnings. Let's fix these warnings. What kind of warnings do we have? Maybe it's a little small, the first thing it says here that it's using Windows 1252 encoding. So the first thing we should do, I think it's also somewhere on the slide, we should say, and I already talked about encoding, it's the last part of the lecture today, we should say, this is just a sequence of bytes in a file and we want this to be UTF-8. So let me just write that there and now, yeah, now this is gone. Then it says this page is in quirks mode. We already had that last time I think, I don't know, it can be made a little bit larger. And it says please add this at the beginning and let's just do this and see what happens. So this is just warming up stuff, doesn't have anything to do with JavaScript yet. If I don't write this, then the browser will be very relaxed with all kinds of mistakes I make and we did a lot of mistakes last time and I think we will see it in a second what kind of mistakes and this and I think we will see it in a second, what kind of mistakes. And this says, please be strict now. If anything is not quite right, don't do it. Let's just see the effect of the doctype HTML. If I reload, you tell me what's different. First the warning disappeared. Anything else that was different? The blue color disappeared, yeah, why did the blue color disappear? Let's just look here. Why did it disappear? Oh here it says, the style sheet was not loaded because its mime type is text plain and not text CSS. We didn't do some, so if you remember, so this is also too, a bit just to remember what we did in the last lecture. We had this MIME type thing, so when we return something, where is it, where is it? Where is it? Here. Ah, okay. Yeah. If path ends with HTML, send it as text HTML, that was important, otherwise we just see the raw. But we didn't add other MIME types, for example, if it's a CSS, then a MIME type shouldn't be text plain in quirks mode before the browser just accepted it because it said it looks like a style sheet. And now I'm saying, OK, if the file I return, the search CSS, then the media type should be text CSS. Let's just see if that fixes it. No. OK. Why didn't it change? Why didn't it fix it? Yeah? Yeah, I have to re-run the server. So if I just change something in the CSS or HTML, I can just reload. If I change, this is always something to keep in mind. Now I change the server code, I have to run it. Ok, so now it's back to 1990s look and feel. Let's also add a warning here, because that's really a don't do this at home. Please don't do this, you should do something nice for the warning warning exercise. Here is also something strange, some formatting disappeared. What happened there? Oh, because it was a diff. Maybe we should, and I think, yeah, let's do the following, let's have a proper paragraph here, which is both with the query and the result. I also want to see the query here once more and now I just have a span, span we will use it later, that's just a piece of HTML which I then can address separately. So let me just do it as follows. So just a little making it nicer and more correct. So the span as I do it right now has no effect. It just says this inside there is now an element on its own but you will see in a second why it's useful. You can ignore it for now. I have some hotkeys here which do strange stuff. Okay, so now it should, let's see, I don't have to rerun the browser. Okay, so now when I run something here, oh! Now this is also blue, I don't want this blue. Let me also, oh that's now because I said every paragraph should be blue. CSS can do a lot of things so this just says the first paragraph, p colon first of type. If you want to play around with CSS there is a whole documentation on that too. Ok, now it's just a warning here and so now... Ok, so I get the query, I get... ok, the result also has result in it again. That's not something to be returned by the server I think, but by the...written by the web page. Let me do it like this, run it again and...okay, great. The form was submitted in the Windows encoding. I thought we got rid of that. Maybe some of you know that. Maybe my... This is not the right way to write it. Meta? I thought it is. Maybe we can just google it before meta. Encoding tag. Oh, it's carset set I see. It's car set and not encoding. Okay let's also now we should get rid of that warning too. Oh and now we have the fav icon which is always it's trying to load the fav icon. This is what appears here in the tab. Let's just also try this and I think let's just get the one many of you, I think I have it somewhere in my history. Oh yeah, let's just get the one from University of Freiburg. I mean the University of Freiburg, it's an image which I can download. So now I have a FAF icon here, FAF icon IKO. Let's just, what happens now? Okay, yeah, now I have no server running. So in case, let's see what happens. Yeah, it didn't really work. Why didn't it work? Say it again. Yeah, that's true. So that's the first thing we should do. We should probably, yeah, so if it ends with echo, and I think the right media type, I think you can also use some image type, but xicon, I think that works, let's just try it. Okay it's still complaining, why is it complaining? I think it's complaining because we are not reading it properly, right? I think we are reading it as a, oh yeah, actually I think I know what's going wrong. We are, it returns a not found although it's there, right? It says, wait where were we? If I just, yeah it says not found although it is on my machine. Let's just check whether it's here. Yeah it's there. Faficon.ico but still it says not found and I think the reason is that I'm reading the file here as a text file with R and this doesn't work and what I did here is when this doesn't work then I have exception and then it says not found for some reason. So let me just do rb here, read it as binary. But now I have another problem because now I've read it as binary. And now I have to... The other files... Here I have the search HTML, when I load it I do all kinds of string manipulations. So I think I should just do if, oh my, how do I do that? I don't know, probably I should, yeah since I read it in binary I should probably just do file contents, I should make a string out of it. How, I should probably just do file contents. I should make a string out of it. How do I make a string out of it? I should decode this. So it's now binary and I want a string and that's what the function decode does. Let's see whether it works. So we are still warming up. Okay, it worked. You see? How did it work for the CSS now? how did it work for CSS? Very good question. Well, with the CSS there is no problem because it doesn't really matter how I... Here the problem is that I'm reading the file and then I'm doing all kinds of operations with strings, right? String replacement. It says look this is a sequence of bytes, you can't do a string replacement. I don't do any of this with the other files, I just read them and then I pass them on. But somewhere I think I have a line that saved me because when I send it back I have to be, where do I do the send all? Yeah, here, we did this clever thing. So when we get the message, sometimes we get it as byte, sometimes as a string, and this ensures that whatever our message is, a string or a byte sequence, it will be converted to a byte sequence here. So that was a very smart move to have this line. That's why it now worked with. And I think it's a great sequence here, so that was a very smart move to have this line. That's why it now worked with. And I think it's a great example of, so if you know what's going on, two small changes did it, if you don't know what's going on you can spend ten hours on this, right, to what's going on here. You really have to understand bytes, strings, and yeah, so, and that's typical for this. But when you understand it, then you can do things very quickly. So now it works, looks like we don't get any more warnings. We even have this FAFE icon here from UnifriBook, you see it. Actually if we just, if I just, I can do this now right, and then I just get, yeah, now I just get this picture here. So this also works great. So I think now we can start with the JavaScript. Now we have something that works and we are warmed up again. So how do I, let's go back to the slides, what do they, okay I think that's not the, that's I think the first thing. So I want JavaScript in my HTML, how do I do it? The first thing I can do is I just have script tag and then I put JavaScript code in there. And the second way to do it is have a script tag and say please include this file. And this is wrong because there should be, there are quotes missing here. And let's just do the second one and see what it does. So just in the head, and it's important to note, we will see that in a second, this is now code being executed and wherever I write it, it will be executed. And this will become very relevant in a second. So let me just, so it's also important to note that I can write relative paths here. So here I'm not adding any code in between, but I'm just saying read this file search.js and then execute it. So let me create a file search.js. And I don't know, let's just write something here like alert. It should produce a pop-up, simple JavaScript code. I haven't changed anything in the server, so yeah, it works. I get an alert now. I'm a bit surprised that it didn't complain because I don't have a... It's text plain, right? So for some reason you never know what the browser accepts. It didn't complain about the media type here, but it should have in my opinion. So let's just say if it's JS, then it should be application JavaScript. You see once we have everything in place, it's very easy to make small changes. And now we should see it here. I'm sorry, that was the wrong one I think. It's a bit confusing. It changed the order at the moment I wanted to... Oh, I didn't rerun the server, confusing. Now it's application JavaScript but it also worked with text plain. Ok, now we don't want annoying pop-ups, that's an absolute no-no. It's like using comic sans in doc, when you write work documents, old people do it, you shouldn't do it. So let's, what else do we do? There's a very convenient thing which you will also want to use for debugging, console.log. This will write a log and where will it write it? Well, let's check, we don't have to rerun the server. It will write it here. There is a tab here in the development console, console, where you can just write messages. That's super useful for debugging and just saying what your script does. So this browser is nice enough every time I reload, it deletes everything, otherwise I have this trash can here to delete everything. So what's next? What's next? Oh yeah, now comes the previous slide. Now we want to change something in our, we want to change something, we want to use the code to change something in our site, in our website, and for that the document object model is relevant. What is the document object model? So many windows here, I'm sorry. Let's see how this works. This is an HTML page. Where was it? Here. It has all these elements, a head, an h1 header. Each of these things is a thing in the document object model. And actually you can, if I go to inspect here, then I, okay that's not so, I think I will do the following for some things, I will just, I have created a proxy here. This now is slightly different, I just created something so that when I type this in the browser it's actually going to that machine and just forwarding it here because it's sometimes a larger screen here. So here now I also have the's screen on the... So what I can do here, I can do inspect and now here you can see I get to the HTML code and I can do that and I can in fact do that, that's very useful for any webpage. Let's just do it for the University of Freiburg webpage and I go to this thing 19 and I get the element here and maybe I don't like it, so let me just delete it. Ok, here is misspelled. So let me just type university of Braheforsk. Okay so you can change web pages as you can see. So it's also if there are annoying pictures which you don't want, you just go there and delete them. Picture is gone. So that's the document object model and there is this very useful also for debugging, you can just do inspect and you see, ok, what kind of element is this actually and understand that this is by design public, a webpage can't hide it, I mean that's the HTML sent to the browser, so it has to be like that. Even if you are on some strange sites, you can always do this, you can always do inspect and change the elements and see what's in there. It can't be hidden, it's just no way. Okay, that's the document object model, so just everything is a thing here and you can do something, manipulate it, and now we do this with JavaScript. And the way you do it with JavaScript is as follows. You can give things IDs and that's all we do. For now you can also address them by other things like their names, but the easiest thing is to give every element which you want to manipulate an ID. So here, just as an example, result. And then if you want to manipulate an ID. So here, just as an example result, and then if you want to manipulate it, you say in your code, I want that element and I would want to change its contents. Let's just do it. We do everything and see what it does. So let me just say document dot, and what element do I want to change. Maybe I want to change here, here I want to write no result yet or something or in the result thing let me do that. Document query selector and now I have this thing called result. I want to change the inner result and I want to say no result yet. And semicolon. So let's execute that and see what's happening. And it's not working. Document query selector is null. And this is an error message you will see and who has an idea why this is happening. We didn't declare the ID, ok let's check. Ok yeah, we didn't declare the id, you are absolutely right. So maybe let's add some id's here. Maybe let's call it input, you can call it like the, and this span, that's why we wanted the span, for this part we now want to give it a name, life. Giving things an identity. Span query id result. Okay, these are the elements we want to do something with. Let's check, we can actually check whether this is, let's maybe get rid of this now. I don't want to, okay let's inspect. It's not, it's strange, I don't know the inspect doesn't work so well here, let's maybe do it here, let's do the inspect. It says span ID result, but somehow on the console it says cannot set properties of null. Any idea what's going on here? Maybe make this a little bit larger, yeah? Maybe it would be easier to type the same line in the console, because then you can usually use the IntelliSense to see what you're doing. Yeah, okay, that's a very good idea. And that's also great for debugging, thank you very much. Let's just take this line. So debugging-wise, JavaScript is just great, it, much more convenient than other programming languages because you have this console where you can interact with it. I can just write JavaScript here and execute it. And now it works, so it's no syntax error. So it did what we expected it to do, but when I do this, it doesn't work. Any ideas? You've achieved just the script file before the HTML. Yeah very good. That's what I said earlier. The script is executed at the moment you include it in the HTML. It's just top to bottom. Here script, this is loaded, as soon as it's loaded it's executed, when it's just top to bottom here script this is loaded as soon as it's loaded is executed when it's executed this part of the page is not yet there. So how do we address that? Well one solution would be to just have it down here then it would work. We could do that another solution I will choose that for today, is just write another script that down here, there are actually many solutions for this, I will just take this very stupid one but it actually works fine. Let me just have another script tag down here which says, which just calls a function. As soon as I'm down here, now all the elements are there, then I'm calling this function documentReady and now you also see our first function here, documentReady and now when this function is called then this gets executed. And maybe we can comment that out because I mean we could leave it there but we don't have to and that's just, now it works. And now I don't have any more error messages and everything. Okay, so that is settled. Let's go back to the slide. What else do we want? We also want to do something in response to user action. What we want to do today and also you for the exercise sheet, every time I type something I want to trigger an action. And that's the way to do it. Event listener. Let's write that now. And it's amazing how simple it is. So let's just do the following. And this actually, yeah we can leave it there, why not. Now I want to do the following, there is a query selector here. I have an input element, I gave it an id input, so let me look at that element. This just means id, I can also address elements by type. And now add event listener. Now I have to say don't ask me about the syntax, I find it a bit strange. This is not the type of the element, it's an input element. And now I want to tell it, okay, when something happens there, call this function. I write the function here, and now I write the code for that function. And when that function is done, then I do this, and then I do this. So this is another thing you will see in JavaScript a lot, that you have these anonymous functions. And that's very natural if you think about it. There is another thing you will see in JavaScript a lot, that you have these anonymous functions. And that's very natural if you think about it. So what this code is saying, I've not written anything here, something happened to the input field, somebody did something, when this happened, then execute this function. Of course I could write it in a separate function and then add the function name here, but I don't really need this as a separate function. I just want to write it right there. That's why I can do this anonymous function thing. It's very, it's like lambdas and C++ or anonymous functions which are defined right there are just very useful. So let's write a function and the first thing we want to do is to just say look something happens. Let's use console.log for that. Yeah, input action detected. Let's see whether it works. Value is input is and let's just write the input here. Okay, I need a variable and let me read the variable. How do I get the variable? Well, that's another query selector I guess. It's just the input field and I get the value of an input field with value. That's just something you have to know but you quickly find the references. Let's just check whether this works. Let's see what happens. Yeah, it works. Simple enough, right? So now I've already made my page dynamic. Whenever I do something, and you see I have this nice console output. And that's also how I recommend that you do the exercise bit by bit. And that's great with the development console. Make a little change, see what it does, make the next change. Don't write a lot of code and then everything goes wrong. You don't know where it goes wrong. Okay, so we have done that. That's great. Now, what's next? Well now, what do we want? We want to make our web page dynamic, so what it should actually do, it should take this, send it to our server, the server does something like before, gets it back, and what it gets back should be written here. That's what we want now. So now we want our JavaScript to communicate with our server. And if you remember, how did we do that last time with this anachronistic ancient form thing, that's how web pages were in the beginning. You have a form, you have this submit button, when you click it, then a request is triggered. So, let's go to the slides and see whether I have something to say. Oh yeah, what a coincidence, there's a slide about how to communicate with another machine. And this has developed a lot over the years, so actually when I started this lecture ten years ago, this set of slides looked very different. So nowadays you do it via Fetch and Fetch looks like this and I will explain these briefly in the following but let's first just do it and see what it does. What Fetch does is pretty easy to understand, I mean you write a URL here and then it sends to the server and then you get a response. Let's just do it and then let's understand the strange syntax and async and await and then because that's not easy to understand. So listen up because you need it for the exercise sheet and it's not, so let's do it in the JavaScript. So what do I want? I want now, I want a response from my server. Now I have to write a wait. I have to wait, makes sense, I'm sending something to the server, might take some time before I get response, right? Now how did I call my, well actually I did it like this, right? So we had it like this. That's what the form did, query equals and then value, let's do a, no I think we have to do it like this, let's just do it like this and then we have this funny and when, why do we need this? We don't know. We just do it for now and I will explain it in a second. That's a bit strange but that's a lot of concepts hidden in that line. So what I'm doing here, I'm sending that to the backend and for now let's just... In the end it will be very few lines but there's a lot to understand behind these few lines. You don't have to understand everything for the exercise sheet, but the more you understand the better. So here's the result. And let's just... or response... let me call it result. Why not response. We're doing time wise. Let's just see whether it works. So now I'm doing the same thing, I'm mimicking what my form did, I'm just sending this query and then let's see what I get back. Let's just see. I'm typing something. Document ready is not okay. A weight is only valid in async functions. This is something I will explain in a minute. What's missing here is that this function, I'm doing something asynchronous here, which means I have to, I will explain this, I have to declare this as asynchronous, otherwise compiler error. Page does not compile. Now it's gone. Now let me type something. 6. Okay. That worked. Amazing. It actually communicated with our server. Here it's found query 6, search HTML, so it actually did send search HTML query 6 just by typing 6. I type more. So all these things are sent without the search button and everything. So the first thing we should do, we should get rid of the stupid search button and the form. We don't need it anymore. Goodbye. We don't need that stuff. You don't need a search button anymore. Okay, now we... Yeah, it also works without search button. Okay, so what did I get now from my... Okay, this stupid server is sending me whole HTML pages, right? That's how we did it last time. That's how you have to do it with a form, right? You're reloading the web pages with a result in it. That's how the web worked for ten years at least. Now, my web page is there, I just want the result, I don't want the whole web page. So let's make a change. Actually I'm, and this is also surprisingly small change, let me not ask for search HTML, but let me, typically one calls this an API call, API is just programming a programming interface, so I'm just talking to some, in a particular protocol with another machine API, let me call it API. So now if I do it like this I will not get a response, right? Because now it says if I type something not found. I asked for API query equals to 6. So now let's change our server that it doesn't, that it can. And it's actually very simple. I don't want search HTML and as you will see it will become simpler now. You see it here, it now asks for API, I have no file API, that's how it worked so far, it's looking for a file. I recommend to listen by the way, for those who are not listening at the moment, just a hunch that some of you are not listening. So API, so what do I want? Actually I don't need to read the file, I don't have to replace I can also get rid of this stupid percent stuff here, that's also anachronistic Let's get rid of this, we don't need this anymore You see, things become easier when you have more advanced technology Let's go back to the server, I don't need to read a file, I don't have to replace anything. I just have to, okay, I have my query, this, the code above did it already for me, and now I evaluate the query, okay. And if it doesn't work, it's invalid. And then, okay, I, okay, let me just file contents, okay, I can call it, I think I shouldn't call it file contents anymore now because what I'm returning now is sometimes the contents of a file, sometimes let me just call it result, and here I also call it result. And then in the end I send result. I think that should work. So it's much easier now, so if this is API, then I'm just computing this and sending it back. Let's just try it. What can go wrong? Let's just see what happens. Let's look at the console. 6. Okay, something did not work. Result is the empty string and here it said something is wrong with my call. Let's go there. Not found. Okay, it didn't find it. Okay, I didn't rerun the server, that's a very good point. Thank you very much. Being attentive. Let's try again. Six. Still doesn't work, not found. But a FAF icon now works, here's this. Found query 6 API, why not found? Let's go to the server code and see whether you... It's on this page I think the problem. I don't know what, let me get rid of that, yeah. Why is this? Why do I get a not found? Any idea? Yeah? The drive log tries to read the file. Yeah, yeah, it first tries to read the file and then says not found and then this should now simply come earlier right? That's all. I mean if they handle API query. So before we read it as a file we first check is this an API query. If it's like this then it's like this and otherwise, and maybe the path thing we just put up here, that was just a debug message anyway and otherwise we, okay, if, else, if, I'm a bit surprised by this. So either it's API, then we are not actually reading a file called API, but we are just computing the query and sending it back, or we try to read the file and then we do this. Okay, let's see whether that works. Ok, what happens now? Result is 6. Ok, that works. 6 times 7. Oh, now we, ok, this works now. Result is always invalid because now we get a different encoding here, we get %20 which is the space. The form encoded this as a plus, now my, the fetch now encodes it as %20. But since we understand what's going wrong, so the form somehow turns this into a plus here. So let's just do this %20 equals to, let's try, oh we have to rerun the server. Let's see, 6, okay, 7, 6 times, oh the server crashed. Unexpected EOF while parsing. Okay this is an incomplete expression. We didn't have that problem last time because we never sent a query that was incomplete. Here it just does the following. If it contains these characters then evaluate it otherwise. And now it tried to evaluate six star, it did the eval and the eval doesn't work because six star is not. How do we do this? I think we could do the following. We just put this in a try block, that's my recommendation. And otherwise if this doesn't work, then... No I think we should, I think we do it as follows. I mean there are several ways to handle this now, this is justise now. If then we do this, otherwise we just raise an exception and then in the accept block we do this. And maybe we just don't write invalid but we just return nothing because maybe it's just not an expression. So now if it has this form, we don't want to run eval on any expression, so I want this if here right as a protection. I don't want to eval arbitrary python code here, that would be a security problem, we come back to that. And yeah, if it's not, there is an exception, if this can't evaluate, get an exception, I go to this part. Say it again. Thank you very much for paying attention. Okay, we run it again, we are very used to that now. Okay, that's six, seven, six times. Now nothing happens, it's just the empty string, seven forty two. Great. Now one more thing. If we, we shouldn't just send this as plain text, we should send what we send back as a JSON object. A JSON object is just a variable which you can use in JavaScript right away. You're sending something back, you shouldn't send it back in some obscure form, it's just a number but as a proper variable. So if you send an array, you send it like a JavaScript array. If you send a map, you send it like this. So let's just do this in a very simple way for now. So we just send is as follows. Oh now I have to pay attention, I have to escape stuff here. You will see this actually because this is a formatted string I just write two curly braces that will become one curly brace. So let me just do it like this. Let's see if it works and if there is nothing then I will just send, yeah I just do it very and just send the empty string here. Let me do it like this and see what we get back. Now we would return it as plain text but let's do the following. If none of this, if path, it's maybe not the cleanest way to do it, if my path was API, then I send it back as JSON. Let's just see whether it works, what happens. Okay, let me just do six. Oh yeah, now I get it as a nice, this is now a JavaScript object, it's a hash map, right? I just constructed it. There are whole libraries in Python for constructing JSON object This is now a JavaScript object, it's a hash map, right? I just constructed it. There are whole libraries in Python for constructing JSON object, but for the purpose of this exercise, it will, for the exercise sheet, a simple map will do. Let's see what happened here. It said, yeah, I received it OK. And actually, yes, application JSON, when I look at it here it will even display it in a nice way right. It knows it's a JSON that's how it looks in raw form but since it's a JSON it will even show it nicely as if it has several fields, if it's one field below the other. Ok, that's great. Now we can go back to the JavaScript instead of just, yeah, now my result is just, this is now response, it's now a JavaScript object and it has a member result. Actually maybe let me do it like this. So let me just get this. I should check whether it's really there but let's just do it like this. And now, okay, so result is the result. And now let's write it into the result. But that's easy now. We just innerHTML is equal to result. And let's also write the query so that we see it again just for checking that was our query. Maybe we shouldn't call it value here but query, input is query, query and query. Let's just check. So now I'm six. Okay, undefined. What went wrong? What did I do wrong? Maybe I was a little fast. It says, andivia. It's because you look for the result in the response, even though it sometimes isn't defined. Is result sometimes undefined? It isn't in this case, right? I'm a bit... here that's a good point, but actually it's very I'm a bit... maybe it's some result, maybe like this. Maybe it was not as equivalent as I thought. Okay now I'm a bit surprised myself, but let's just, okay let me just, so my response object, let me just output it again. It looks like this, it's a JSON object, now I would expect it if I, OK, this is unexpected. Oh, yeah, yeah, yeah, yeah, thank you. We haven't talked about that. Yes, very good point. It should say JSON here. We haven't talked about this at all yet, but yeah that's a good point. Let's just see whether it works now. So if something doesn't quite right or doesn't work, you just change a few letters and then it works. That's what you should take away from this. Ah it works. Nice, now we have, so yeah, not bad, right? Okay, we have a fully dynamic page, so that was the main part of the lecture. Let me explain a few more things and then we go into our break. And thank you Natalie for... And it's because of this part which I have skipped so far which I will talk about now. But let me first make the code a little bit... Okay, now it works like this. Result, result. Result, let me just check again if everything now is as 6. It makes a new line here, that's because it used to be the HTML page. Let me do it like this. 6. And actually no result yet. I'm sorry, I don't need that anymore. That was just for showing you something. Okay, yeah. Nice. So it worked now. Let's see what haven't I explained so far yet. This I explained. Fetch. Let's look at the fetch again and then let me explain a little more. There's a lot, lot, lot to understand about this fetch stuff. I will not go into all the detail. I will just explain enough that maybe you are encouraged and inspired to research more for yourself. Of course you can just use it and it will work, but as you can see with these few letters you need to understand. So we just wrote, what did we do here, fetch, we said please send this to our backend. Note that it's a relative path, so it will be sent, so what's actually sent is a tura888 slash api and so on, because I, you can, relative paths are important, you don't have to write the full URL here. And then it somehow said, okay, when you get your reply, then somehow interpret it as JSON. So what's this assume, await, fetch, then and what's this funny thing doing here. That's a lot of strange stuff. By tremendous coincidence there's something on the slides about this. So let's briefly explain those. What are asynchronous functions? Now it may come as a surprise, but JavaScript is not multi-threaded, it's single-threaded. What does it mean single-threaded? Which it means at every given point in time you execute one line of JavaScript code. It can never happen, never ever in JavaScript by design that somehow that's what you have in multi-threading, right? There's one part which executes this part and there's another part which executes this part. Sometimes this can speed up things. Maybe you have two different functions which are independent. Why not execute them in parallel, right? And it looks like your browser would do that. Right here an image loads there, you type something, here something else changes. But that gives a whole lot of problems. If you execute this and this at the same time, a lot of things can go wrong. JavaScript doesn't do this, it's single threaded, full stop. But it supports asynchronous functions. Now this is strange, how can it show things happening at the same time? How can it be single threaded and things happening at the same time? Well this is how it works. Some functions, you can write them yourself, I have a slide about that, or functions like fetch or set timeout set timeout is this command, please execute it in three seconds from now. Very useful command, maybe you want to use it. Or just something like fetch, it does something, they are async, or if you write such functions yourself, you can declare them as asynchronous functions. What do these functions do? They spend part of their time outside of JavaScript, so fetch is actually not written, you have to imagine it like this, that there is a layer on top of JavaScript. It's called the WebRp, it doesn't really matter how it's called. But when this is done, this now delegates action to the browser. Think of it like the browser. And the browser does now this talk to the server and get back to the result. And while it does this, now my JavaScript engine can continue elsewhere. Maybe now it will just continue another line of code. And here it will just wait. That's what the await basically says. It will not continue here. It will just do something else. Execute another line of JavaScript code somewhere else. And only when this layer on top is done, it now has a result, then it tells the JavaScript engine, if you have time, you can now continue the code here. So that's how single threaded and asynchronous goes together, right? You have something asynchronous, now something else happens on a different layer. My JavaScript engine can execute other code. When this is done, it tells look here, now you can continue there and it continues there. So it's not... So for example, fetch weights. That's what I just explained. So what do these keywords mean? So all of three of these pertain to have something to do with asynchronous execution. When a function does something asynchronous, you have to declare it as a compiler error, otherwise this whole function here which does something when I type a letter contains fetch. Fetch is asynchronous so this function has to be asynchronous. Await just means this is an asynchronous function, I want the result, please wait for the result to be ready. While this is doing something else you can do other stuff in the JavaScript. That's await. And what is then? Well then means this function here first tries to talk to my server, see if it's even there. And when this is done, then this is called this then function. So this means wait for this result, when the result is there, then call this function. This is actually a function, I have another slide about this on this result. And what this does is take the response and interpret it as JSON. I will explain this in a second. So that's what the then does. You need all this for the exercise sheet. If you're really interested I will not explain this level. You have a question? Yes please. Okay so the question is. Yeah. Yes. Yes. What you could do is instead of the then, you could, let me very quickly do it maybe, I could do the following. First I get a fetch and now I do, I don't know whatever this is, we don't know it, response is x, x, I think it's, I think I could do this instead of these two lines and it would be equivalent. So the fetch first gives me something, oh and this is also an await thing, this is also asynchronous. And for some reason this is a two step communication, it first asks for the header and then it asks for the JSON. But this is now going more into detail. I could explain a lot more here but I don't think it will take too much time. But I want to explain one thing. So actually let me just let it here in the code. I'm pretty sure that's equivalent. But let me explain one thing. I think that's useful to understand. What does fetch actually return when it's a wait? Well, what it returns is a so-called promise. That's a very useful concept and it's very easy to understand. Promise is an object that has just two parts. One is the current status. This is something which is now, there is something working in the background. I immediately get back an object and the object contains information about whatever is happening there is it still running, that's called pending, is it successfully completed, that's called fulfilled, or did it fail, that's called rejected. So these are the three states of a promise. And then it also has the result once it's done. So that's just, so what actually happens here, if you call this function, you immediately get back an object and what this wait basically does is, ok let me wait until either this is successful or it failed. If it's successful just give response the result, if not error message or whatever. A lot more I could say about promises, you don't need it for the exercise sheet, you can explicitly work with promise objects by creating them and just saying, and if you create a promise you have to say, you have to pass two functions, resolve and reject, which say, what should I do when this is successful and what should I do when it isn't? It's much easier to just work with a sync await and then. And one more thing, because you should understand it in the code when it has these funny symbols here, this is actually very easy to understand and very convenient. In JavaScript it's very typical, and let me show that once more, you add an event listener, please do something when the user types something. And what you pass now is an argument, is a function, an unnamed function. So you have these anonymous functions a lot in JavaScript, that's just very natural. So for example here, when the fetch returns the first round, you get a response and please interpret this response as JSON. So this is really a function, function on the response that's actually maybe a bit confusing because it's the response of the first call to fetch and then do something with the response. So I have a function with an argument and then return something done with that argument. And this here is simply a shortcut for this because you have it so frequently function, argument, return, something done with that argument. This here is just a shortcut for this. That's it. Maybe underline it to make it very clear. So this... and it's on the slide in case you are... So this is just nothing else but a shortcut for this. A function, an unnamed function, and it also works with multiple arguments here. So this is just an unnamed function with two arguments and then I just say, it's just so frequent that you have these simple function bodies. Okay, but let's not go into any details, this we already had. We have a working application here, we now go into the break, do you have any questions right now? Maybe you have some in the break also, yes please. So let me just... JavaScript now waits, it doesn't wait, it is free to execute other code now. It's not blocking. But that would be a different script then, so this script is not... No, no, that's not... it's a very good question. Let me explain it again. I didn't explain it properly, I think. Let's assume you have other code in exactly this file, which says if this button is clicked, do something. So here it's waiting, maybe it takes one hour before this comes back, it could happen, right? Now the JavaScript is just free to do other stuff. So if I now click on the button and the action there says if you click on the button, make everything red, I will just do it. So this does not block my whole JavaScript. That's exactly an important point. But it will only do one thing at a time. So here it says fetch, wait for the result, for one hour this is now doing something else. In the meantime this very JavaScript can do anything at once. And at some point this outside thing says my response is here, as soon as you can, please continue here. But also not immediately. Now the JavaScript maybe first executes, has to finish some other stuff. But at least the function is after that. Exactly. That's very important. The function never continues until this function is blocked, but other stuff in the JavaScript could be executed. Right, there could be ten event listeners which do something. It's a very good question, it's actually tricky to understand what it means in practice, this combination. There's another question in the chat, Does it interpret response as JSON or convert it to JSON? That's a very good question and I don't have a very good short response because it's tricky. This here, you might think that this gets the result and this just converts it to text or JSON, whatever it does, but it's not like this. This is just a prefetch here, what the first thing does, but the details I can't explain now and I also don't fully understand them. You do a first probing of the server and this is actually getting the response here. This is not yet doing the whole communication. For some reason, which I haven't fully understood myself, yet this is a two-step communication from the side of the JavaScript. So the actual action happens here. This is actually reading the response from the server. At least I think it does. Any other questions for now? Okay, we have a five minute break and then we continue with, but this was the heavy stuff now. Thank you. So, any other new questions that came up in the break before I continue? Yes, please. I think I heard that you said that the function pauses, but do you mean that the entire function pauses so no other code of that function can be made or just the instance of the function? Oh, it's just the instance of the function, that's a very good point. So if for some reason I type a second character, then it's this execution of that function will not go on. But if I type another character, that could go on, and will go on. So yes, there are a lot of tricky things. The thing about asynchronicity is that it's a very simple concept, but when it happens in real life, it's very, a lot of tricky things can happen, because now things are happening at the same time. Okay, multithreading. That's surprisingly simple and I just wanted to show it to you. Let's do the following, let's just mimic, and I think that's easy enough. And now my function is doing something very simple, but I can just, yeah let me just sleep here for some time and let me sleep for, this gives me between a random number, this is a random delay, two seconds and I think I need to, that was very effective time and random. So let me just sleep for two seconds and let's see what happens. A random period between zero and two seconds and now let me, and you see what's happening. Yeah, this is no fun. And if you don't pay attention your exercise sheet will be like that. I think it's still doing stuff. Did I sleep for too long? Yeah, that's no fun. I stopped typing like a minute ago and it's still the requests are coming back because they need so long. And that happens pretty quickly. You type quickly for each request something happens. Your edit distance computation took a few hundred milliseconds per thing and that's just no fun, right? But actually on your server side, any machine you will have, so let's just look at the machine here, it has 24 cores, so this machine should in principle be able to process things at the same time. Let's just see how we do this, actually surprisingly simple, but again you have to pay attention to a few things. The first thing is we should have a function which does. Here I somehow have handling the request in a function, but then sending back the result I have it still in my loop. Let me also put that in a function but then sending back the result I have it still in my loop. Let me also put that in a function. So let me just do it very simply. Handle request and send result and for that I think I also need that connection object. You will see it in a second. Okay, and this now does the following and this will just do what will it do? I think it will do this part here. Let me just copy it over there. That should work. Okay, it's a bit too, ok this should now be like this. Ok what I'm doing now is, it should be function invariant, so here I have a, and now I should just handle the request and send the result. So just that everything I'm sorry I do here is in a self handle request and send. And I have to give it the connection because that's needed for sending the result. And I have to give it the connection because that's needed for sending the result. And I have to give it the request. Okay, let's see this should have exactly the same functionality as before. No, I didn't want H-TOP. Let's just see whether it does. Oh yeah, it's now a little bit slow, but yeah it works. Maybe I made it too slow. Okay it works, but with a delay. And now all I have to do, and it's written on the, I mean the motivation is obvious, we already saw it. Yeah, it's very sluggish if you do one after the other. Actually, Python has several packages to do threading. That's actually the... but I will not talk about this. Threading is the native Python package for doing threading. But it doesn't really do multi-threading, confusingly. So multi-processing is the way to go. I could explain more about this but I won't. So what I have to do, it's actually surprisingly simple. But it has a snag. Let me just comment this out. So that you send the result in a separate thread. Oh my. In a thread on its own. Why doesn't, I think I didn't set my text width to something smaller. Yeah, in a thread on its own. Now what I want to say to the operating system, look this part you can just, and now in the server not in the JavaScript, this is now real multi-threading, I just say multi-processing, process and now I say which function, well it's just function here, it's called self handle and you do it like this, you say which function it is and then you say what are the arguments and my arguments here are, and you have to do this as a tuple or as a list, I just do it like this. Request so this is now creating a request and I can immediately start it like this. This just says like the below but do it in a separate thread. And now there is one very important thing and I do it right away, I don't show you what happens when you don't do it. You really have to be careful now to close the connection. We had that in the last lecture already I mentioned it. In principle when you open a connection you can reuse it for a lot of back and forth, but we are not doing it here, right? We are just opening a connection, sending a request, get back the result, close the connection. We want to tell the browser that it's really important now that things are happening at the same time. We are getting the result and now it's done. For the next request, a new connection, I mean it's not very, could be made more efficiently but we keep it simple here. I have a slide on that connection close. Let's just see whether it works now. I mean a very small change, I just replaced this call with this call. Let's just see whether it works. Okay, I think it's 6 times 7. You see now things are happening at the same time, right? It's not, now I've launched a lot of requests and there, this is now happening in the layer on top of JavaScript. No it's not happening in the JavaScript at all, I'm sorry, it's happening in the Python server. It's now using all 24 cores. Now I have another problem. I'm typing, now a lot of things are happening. Now a lot of things are happening. Okay. Why don't I get six times seven, but six times 77, what happened? Yeah, they finished in a different order. Now I deliberately made it so that not all my requests take the same time. So now what can happen is I send something earlier and I send a later request but that for some reason is very fast and then a request from earlier took a very long time and now it comes back and overrides my later result. This can happen and this will happen and of course you don't want it, but that's something I leave it to you to do for the exercise. It's actually very simple, you just give every request a number, right? You give every request a number, you can just do it here, you just have a count, you pass the count here as an argument and then in your, you just pass it to your web application and then it sees, oh here's something from earlier, I won't use this to overwrite something from later. So it's actually relatively easy to fix and you want to fix it. Otherwise you have to, now if I'm slow then it works right? It only happens if I do something in fast succession. Okay, yeah that's what I wrote here. You have to, if I don't do this, the server will not work, because now if you don't tell it to close, now it will have two things open at the same time, so it's very important. It will not work, we tried it. Here's another important thing, I don't have time to show it in the lecture, there will be a forum post about this, it's important for your exercise sheet. Think about this code. Here it's very simple, I'm just computing the result of a mathematical expression. For the exercise sheet you want to do something with your Q-gram index. Your Q-gram index had a big pre-computation, it's actually a pretty big data structure, several megabytes. And now for every process you are here, one of your arguments will be the Q-gram index. And now if you don't pay attention, for every thread it will copy these, I don't know how many megabytes to the thread, that will work, but it will just be terribly inefficient. I don't know, maybe take a second or so and you don't want that. And what multiprocessing can do, it can say here's a global object, your QGram index will only be red and not change. All my threads should be able to use it and you can just set it up so that there is this global object and then the global object will not be copied, it and you can just set it up so that there is this global object and then the global object will not be copied but every thread can just access it. We will write a forums post about it, how you can do that. In case you encounter that problem, I think you will. It's also just two lines of code but it's important. So you have heard it now. So now on to some more fun stuff. So the rest is now relatively simple. Oh, something we haven't seen yet, but it's actually, yeah, this is serving all kinds of files. So let's, here's our, yeah, we haven't paid attention to that. So now I read etc.passd.vd from our, that's not good, right? Should not happen. I mean, on this machine it doesn't really contain any password, so it doesn't matter, but yeah, somehow we didn't think of that, right, when we wrote our server. You can just ask for arbitrary files here. Of course, yeah, somehow we didn't think of that, right, when we wrote our server. You can just ask for arbitrary files here. So I just specified an absolute path and it was so nice as to give it to me here. You see it here. Let me just not show the, yeah. Okay, let me read this file and return it. How do you solve it? It's easy, the problem is more to be aware of it. So yeah, that was the first one. Yeah, for example you say, oh this path starts absolute with a slash, don't deliver it. The safest way is to have a whitelist of files, like only these files, only search, HTML, JS and so on is served, everything else gets forbidden. It's easy to solve but you have to be, it's easy to forget it. Here's another one, that's fun, let's look at it. And let's maybe look at it over here, I think that's easier. Let's make the page a little bit larger, I think that's easy enough. Let's make the size 50 so that I have a... Now it's a little bit larger. Okay, I can type here, this works, 6, 7 out of order problem. Ok let me maybe reduce the time again in my server, that was just for demonstration purposes here. Let me make it 0.2. So then it's not so bad with waiting and everything, six times seven. Okay, now it's more responsive. And it's showing me the query again, that's nice. Let me just do the following. Let me do I also want, okay, grinning, horn, things, how do I, oh I don't want video, it's actually not called devil, it's called smiling face with horns, but I think, there's the HTML entity, let's just copy that for fun, oh actually I just want to copy it here, yeah OK. Click me, now if I click on it here. Click me. Now if I click on it, I get to, let's just see what I get. It's devil.com. It's funny, this joke I already had it three years ago, I thought it was a coincidence, but devil.com actually forwards to myvival.com. So they probably paid a lot of money for this, I don't know, they found it very important. Okay, so this shouldn't happen, right? But you kind of say, yeah, there is now a link and you click on it and unexpected things happening. I hope it's clear what happened, I mean, I can do inspect here again, and I see inside this span where it's just supposed to show my query like six times seven, it now inserted HTML, it's the document object model, and it's interpreted as HTML, right? So yeah, and this is now a link with whatever. Okay, so let's, I think I have a keep, I have another one here. Let's see. Now it's, can you see it? So now it's no link, I didn't click on anything I can just, now it's snowing here, I have to remove the okay, how did I do that? There's something in the exercise, the page is now dysfunctional, it doesn't work anymore, it will snow and all I can do is remove some of the snow. But I basically have to, and there's this funny image source, I don't know, yeah, you can figure it out for the exercise sheet. And now I didn't have to click a link, right, and this is something that could happen invisibly, who knows, and yeah, okay. So this was here, the slides are just for references, code injection. I will not talk about the same origin policy, it's just on the slides for reference. For our application, everything is happening on the same machine, right? The webpage is on Tura, and this is also talking to something on Tura. So the server is on Tura and the API calls are on Tura. But in principle it could be that the serving of the web pages is on a different machine than my API calls, right? Or I have to do a search. It could be that the pages are served on this machine, but in your case, the Q-gram index queries are processed by another machine. And now you have two different machines talking to each other, and the browser will not allow it. And so we don't have that problem here and that's of course very important. So here's one example, you could get an email with a link which looks like your favorite bank, but just a little different, so post a bump for example, you don't think nothing about it, you click on it, I don't know whether we also get to mybible.com. No, okay, this... Actually interestingly they bought it, right? So they are taking care now that they are buying these domains and it's actually I think a fraud if you just buy a domain which is very similar. But what you could do is, now on this side, you could just write your own page, which now belongs to you, which looks like the postbunk page or whatever, and now people log in and enter their credentials or whatever they do, and now you send it to the actual page, which has a slightly different name, and then the browser will say no this is not post-bunk, this is post-bunk, I will not communicate. And now you have a whole protocol which says when is this okay, when is this not okay. We will not talk about this, I will not show it here. The whole concept is called cross origin resource sharing course. And what you can do, just very briefly, I could add a header here, which says this result you can use it on any other site if you want, or this result you can only use it on Tura or you can only use it there. So I can send a header here which controls which website is allowed to do something here. And of course a bank, when it sends back confidential information, it will have a header here which is very restrictive and say only this particular page can use it. If you are interested you can look at it on the slides. This is just to tell you that something like this exists. The only exception is actually JavaScript. Still a JavaScript library can be loaded from anywhere. So you can do that even, for example jQuery library which we used in the course until a few years ago. This you can always do, you can script something else, it will always load, because otherwise like 90% of the web pages would break, because if you look at an arbitrary web page, it will have a lot of script, include this library, include that library in the beginning. Okay, here's the last part, which is not very hard, but let's, ten minutes I think enough. Bit of overtime, but I hope that's okay. We haven't talked too much about funny characters. We have seen them, the question is what happens, I actually don't know what, yeah if we type this or, it actually works here, we already, there already was something about the encoding, I'm sorry, here we had to specify UTF-8. What is this about? It's about the difference between bytes and strings, right? If you have bytes, it's just bytes, whatever they mean. Oh, actually here you see, right? I typed A and now I get %C3%A4. So it's just something is happening here. That's what the last part is about. So we need a standard for how to represent characters, especially of funny alphabets, right? For a very long time it was ASCII, but that's of course very limited. You have one byte per symbol, we can look at the ASCII table here, okay we can look at it here. That's an ASCII table, just for every number it gives you a... that's a very old ASCII table. I don't think we have to look at the ASCII table now, it just tells you for every symbol. There are more than 256 symbols in the world, Chinese alone has tens of thousands of characters. How was it done before Unicode? You had like the character which everybody, it's also not true, depends on the culture, uses and they were in the lower part of the ASCII character set, so half of it. And then for the upper part you had many different variants, so like you had the lower 128 codes, they were always the same and the upper 128 codes, there you would have to say, okay now I want this character set, now I want that character set. Very important one for this part of the world is ISO 885. This is like 2 to the 16 for computer scientists, you should know ISO 88591 by heart. It's just the typical characters from the European languages, at least this part of the word. So, aumlaude and the Danish ou and stuff like that. You have it in this character set. So here are some of them. The problem is, if you do it this way, now you need maybe Chinese characters or acrylic characters or whatever, then you have to switch all the time and you can't really mix it. So the Unicode solution is you just have a huge set of numbers and everything has a number, everything. So the common things have small numbers and they should remind you of encoding and the more weird it gets the larger the number. So for example the Greek alpha here has 945, the Euro symbol which came into existence like 20 years ago has this number 8364 and also all kinds of icons have numbers. So a smiling face has the Unicode number 128,512. And the question is now how do you encode this in bytes? Now the simplest way would be, well these can be large numbers, always take four bytes. And now this should remind you of the encoding lecture, but always four bytes, that's very wasteful, especially since most characters only need one byte. And so there are different standards now how to use this, and the most common one is UTF-8, which you have heard a lot now about, and I will the most common one is UTF-8, which you have heard a lot now about and I will briefly explain it and that will just, that's a form of variable byte encoding, we have seen that in the encoding lecture. It always uses a full number of bytes, so it's not messing around with bits, but it's either one byte, two byte, 3 bytes or 4 bytes. And typically 1 byte for most of the, so if you just have ABC and parenthesis and comma and something like this, just 1 byte, so very efficient format. And this is life changing knowledge, it's something which you don't usually know, but it's very good to know, it's actually pretty easy how it works. So if you have a character, one of the frequent characters, which has one of these codes between 0 and 127, then this is a number with 7 bits, so this is just 7 bits here. And the UTF-8 code is just you start with a zero, and now the concept of prefix-freeness should come to mind, and then you have the 7 bits. If you have a code point from 128 to 2047, and you will see in the second y, this needs 11 bits. So you can't use one byte for it, you need two bytes and in UTF-8 you just do it like this. The first byte starts with 110 prefix free, so if you are now coming from the left and you see this, ah 110, this is a UTF-8 sequence with two bytes. So it's just you count the number of ones until the first zero and then you know how long this sequence will be. Here you are nothing, so it's just a one byte sequence, when the byte starts with 110, then you know there will be another byte and you know by the standard then it will start with 10. That's just how UTF-8 works and in the remaining bits you have place for exactly 11 bits. This is 16 minus 5. 3 for this and 2 for this. And now we have a larger code point, this is now 16 bits, goes up to 65535, 2 to the 16 minus 1. Now you need 3 bytes and now the first byte signals, this will be a 3 byte UTF-8 code 1110 and these are just the continuation bytes, they always start with 10 and then in the rest you just write the 16 bits. So it's very simple and very well thought out and 4 bytes you can do up to 21 bits and this will be the UTF-8 code and in principle this could go on with 5 and 6 bytes but the powers that be decided some time ago that 21 bits is enough for all, you don't really need 5 or 6 bytes. And understand some clever things about this, I mean this is one encoding, you could have done it in a lot of ways, this is very clever for a number of reasons and I have a slide about this. One reason is the typical characters, not only are they one byte, but it's exactly the ASCII code, right? So if you have like the space is 32 in a byte, it will also be 32 in a byte here, because the leading thing here is zero. So you take ASCII from like 127 years ago, it will also be valid UTF-8. Now that's very useful, that's the one clever idea. What else is clever? Well it's prefix-free, we already had that. You come from the left, you see immediately, oh, a three-byte sequence starts here. Just count the number of ones until the first zero. What's also great, you go in the middle of a string and just look at that byte here and you see, oh, it's one zero, you know it's the continuation byte of some unicode string. So you just go to the left until you find something which looks different or if you want to skip this one you go to the right until you find something. You can jump right in the middle and just by going a bit to the left or to the right, you can understand what's happening there. That's what I said. That's another nice thing. If you take this standard, which was used for characters like the frequent Western European ones, of course strong cultural bias here. Let's look at that, for example the German umlaut R in ISO 88591, it has this code here, 11100100. This was in the upper ASCII range. And in UTF-8, this is exactly the code point. So you just take the two byte UTF-8 sequence, which is 11010, and then you just take these eight bits from the old ISO code, and you just pack them in there, and and that's the UTF-8 code. It's just the same code, you don't need to translate. So if you want to write a little program that translates ISO-88591 to UTF-8, you just take the bits and shuffle them around a bit and then you have the UTF-8 code. You don't need a lookup table like A is now here and this character is now there. And of course like in encoding the rare stuff needs a lot of bytes, the frequent stuff needs few bytes. I think since we are over time already, I just want to show you one more thing. We have time at the beginning of the next lecture, I think I will finish this. You don't have to understand it right now. I will show you, tell a little more about it in the next lecture. This is important, when you, this you need for the exercise, we have seen it here. If I go to, and then we are done, maybe you have a question. I sent the A across the ether here, and now I don't get an A here, I just get a sequence of bytes and this is encoded, this is URL encoded. These are characters which in principle could be included in a URL. And the character set which can be included in a URL, we already saw that in the last lecture it's very limited, it's basically only the Latin characters, lower and upper case, the numbers and a few other characters. In particular, not even the space is allowed in a URL. Which means, now you need an encoding of this and the encoding is just as follows. So this is why the meta was important, this car set UTF-8, because now it says, first of all, how do I turn this into bytes, and this is now a two byte sequence in UTF-8, namely C3A4. And now I have to turn that into ASCII again, into normal characters, and you just do that with this percent encoding. And now you see why the encoding is important, because depending on how I encoded this into bytes, the percent encoding will look different. So if I use UTF-8, and actually we can, that's the last thing we can do together. If I would write here ISO88591, let's just try that. And now I type in a, ok this didn't work, I'm sorry. So there's something else going on apparently, I was hoping that I would get a different sequence here. So apparently it's using, I'm not 100% sure how it decides in which encoding to send it here. Apparently this was not relevant for how it sends this across. But doesn't matter, there's another screw which you can turn and which will decide it. So if this is converted to bytes like this, then it will be sent like this in percent encoding. If it's encoded like this, it will be sent like this. And this is of course something you have to specify, so implicitly it was UTF-8 here. And on the other side, and this you have to do for the exercise sheet, of course you have to decode this now, right? You get C3A4, if you just send this to your QGram index it will not work. You first have to turn this back into the character again, which was an A. It's actually simple, even for the exercise sheet you are allowed to use the built-in function but it's also very simple to do it yourself. And the rest I will show you in the next lecture. But it's relatively, I just wanted to mention this at least. So the exercise sheet will be, let me just show it to you and then we are done. And maybe you have a question before we... Okay, there it is. Very important, you have to, we updated the Wikidata entities file, please download it. And we made sure that you can see it in the first line, it's different. This is a new version of the data and we made some fixes and the first number, it in the first line is different. This is a new version of the data and we made some fixes. And the first number is the score of the United States, small variation is now 398 and not 400. So yeah, you see it here. It's now 398, the old version. So if you somehow copied it from two weeks ago, it's still on your machine, just take the new file, throw the old one away. It's important for a number of reasons. And the exercise sheet is just to make your web page dynamic. It's all written there, it's usually in quite some detail. And some fun stuff. Any questions about this? It will be a very nice sheet. Make an effort because we will select the nicest web apps and show them at the beginning of the next lecture. Yes, please. I have a question. So you show how we do encoding and decoding in Python. Do you also show the same in JavaScript because we also get stuff sent back that is probably encoded? Do you also show the same in JavaScript, because we also get stuff sent back that is probably encoded? What we send, yes, what we send back is bytes to the server, so we don't have that problem, right? It's just that, okay, let me answer that important question, but it's a quick answer. So in this direction, two things are happening. First I have to turn this into a sequence of bytes, for that I need to know UTF-8 or whatever. And then, because this could also be sent via a URL, this is how we did it in the first lecture, it has to be URL encoded, which is why I get it in this percent encoding here. So there are two things happening here, and I have to handle both things on the side of the server. But when I send it back, I always get a sequence of bytes. And then I only need the one encoding. Then I just need to know, OK, this is UTF-8, and then JavaScript. At least I'm pretty sure that that is so. But you will... Yeah, otherwise this wouldn't work here, right? Because otherwise I don't see C3A4 or something here, so it seems to have worked. But it's a very good question and it's tricky, tricky to understand what kind of encoding, which is it two encodings, and I have some slides on this, but more about this in the next lecture, when you have gained a bit more experience with that. Yes you can, yes, there is a URL decode and you can use it. Just to make the exercise sheet a little bit easier, we allowed it. Is there any other question for now? So I think it's a very nice sheet. Have fun and see you next week. Bye bye.Welcome everybody to lecture 8, Information Retrieval in the Spring 2022. We will talk about the vector space model today, so change of topic. It's a very short lecture but before that we will talk a little bit more about the last lecture. And in particular your experiences with the last exercise sheet that was Web App Part 2. I will show a demo of some of your web apps. I will talk a little bit more about encoding because we didn't have much time for it. In the last lecture I just told you what you needed to do the exercise sheet but there's a few more interesting things to say and I will be happy to explain them to you. And then the vector space model and now we are switching to linear algebra mode. It's very nice and I'm sure you will like it. And the exercise sheet 8 will be go back to the roots, to the beginning, exercise sheet 2. And then you just do everything with linear algebra now and we will see the connection today and let me just check whether the pen works while Frank is still here. It works, great. Okay so your experiences and let's also look at some of your demos. Most of you like the sheet and the lecture a lot. It's not easy, some problems with multiprocessing, here are some quotes, very nice and very informative, some problems, yes, because there are so many different things that was also said in the lecture, but I really liked it anyway because of exactly this and I like this comment because that was kind of the purpose of the lectures and the exercise sheet. All these things pretty simple by themselves but the combination of so many things is really hard and so many things can go wrong and the only way to learn this really is by doing it yourself. You just listen, you don't learn it. Can you increase the line limit for these exercises? This is a recurring theme over the year. We have a pretty strict limit of 80 lines, 80 characters per line, actually only 79, which doesn't come from us. It's the default by Flake8, our Python style checker. And I've wrote a post about this in a forum, which you can read. The essence is every software project, every company, every organization has such rules, you just need them and there are good reasons for them. Yeah unfortunately we also made that experience afterwards. Python is just not made. Python and multiprocessing does not go together. And actually by something I explained in the last lecture, one can even understand why we learned that JavaScript is single-threaded. You can do asynchronous stuff, but it's inherently single-threaded because the interpreter, Python, like JavaScript, like Python, is an interpreted language and the interpreter always looks at one line at a time, never two lines. So the interpreter is not multi-threaded because then you would have all kinds of problems with shared data structures and so on. Python is the same thing, so there is a global interpreter lock. When the interpreter looks at this line of Python, it can't at the same thing, so there's a global interpreter lock, when the interpreter looks at this line of Python it can't at the same time look at another line, which means you can't really do multiprocessing in Python. The only way to do it is you fork the whole Python process, which means you have two interpreters, three interpreters, and then if you have three threads, you have the same thing running three times, which also means the same amount of memory, which was the problem for the exercise sheet, because your QGram index was quiet, it used four gigabytes or so, if you worked with a whole dataset, which means then you have two times, three times, five times four gigabytes. Has kindled an interest in building web interfaces, nice. And many of you wrote that it's great to see right in these last two exercise sheets, everything we did so far came together from the first lecture to the fuzzy search and the web app stuff. So let's look at some of your demos. We have prepared some of them. So here we have Freichel, ok so many of you were inspired by the, let's just, ok so what do we have here? We have a drop down menu, let's see, oh we even have, ok, so all kinds of interesting stuff you could do here, arrow key and then I okay, I can click on something and I get to the page. So what else do we have here? Okay, an arrow message. Interesting. Here we have some Star Wars animation. Nice. But if I have a file name which doesn't exist, more Star Wars animations. Okay. That was one of the engines. Here we have another one, Goose Goose Go. Let's try that one. Let's try. Okay. So I think here the innovation was the logo. Natalie is that right or is there anything else? Okay, nice logo. And it's also responsive, it's also nice. Here's another one. This is, so what else do we? ChatGPT, okay, it's already, yeah, we have an up-to-date language, ChatGPT, ok it's already, we have an up to date language ChatGPT, no image available, you can click, you also get to the site, that's also just very nicely, neatly made and also quite responsive. And what do we have here? That's another one inspired by the, also pretty responsive with a nice table. And let me use this, let me, this is a good example of a, OK this one stalls sometimes, right? I think we, OK but it's also interesting, it's actually not so easy to not make it stall because sometimes can happen that requests, I don't know, they start and then they don't end, you have to deal with all that. Here's another one, let me try that one, okay. Here we have a little delay, but otherwise we also... Okay and let me maybe go back to this one where I just type a name. So I completely messed up the name, which is interesting. Let's go to Wikidata where you also have this search thing, and let me also just type Arnold. And I don't find a match. So here, so apparently it's not so easy to make a nice search as you type interface with fuzzy search, because otherwise they certainly would have done this. So we have in this exercise sheet managed to do something with relatively modest effort, which the Wikidata website has not managed to do. So that's quite nice. So there were also some Easter eggs. So let's first look at your reactions. I was impressed by what one can do. My favorite one was the asteroids. We will look at them in a second. Matrix. Scared that I was hacked, many of you wrote that, so when you saw it for the first time, it didn't work for all of you. We also have a common Harlem Shake brought back some memories. Disappointed that I had to disable the gorillas even though they did nothing wrong. Did not work because of the parsing I do of the JSON object. So for some of you it didn't work because you did something different or you did something special. Some of you then realized that there were these hidden code injections in the data and then fixed it. Some of you were not able to fix it. Then you couldn't see the wonderful animations, but I will show them. Yeah, very cool what you can make a browser do, but also worrying. Many of you wrote that. I feel violated. The data betrayed me. Okay, let's look at them. So yeah, and let me also use the opportunity to show you. This is the master solution. Let's type the matrix. Okay, this is the matrix. This is the matrix animation, so how does it, yeah. So apparently, yeah, I mean, and it's interesting, so for those of you who haven't thought about it, what is happening? We gave you a big data set by typing the query and then locating a few matching entities and then you are extracting part of the data, transferring here and putting it in your HTML. But it's a big data set, you haven't looked at every byte and somewhere in this data set stuff is hidden. Apparently in this case JavaScript and even pretty involved JavaScript which can then do anything. And it's not so easy to, yeah, then we have the gorillas, let me see if that is, ok that was too grossly mistyped. Ok now we have a whole video game here which, no I don't want to see the intro which you can play, so a little JavaScript which plays a whole video game. Okay angle here I think, I don't know, 10 velocity, 14. Okay I can throw banana and try to, okay that was, okay maybe 70. Okay now. Maybe 70. I think I won't play it. You get it. That was Gorillaz. Can you hear the sound, people on Zoom? Can you hear it? The next one was the people on Zoom. Yes, you can hear the sound. Okay. We have the Harlem Shake. Okay. I think, yeah, let me check. The Harlem Shake is from 2013. Okay. So who remembers the Harlem Shake? That's nine years ago. I'm not sure if everybody in the room knows the meme. Yeah. So a nice one. Then we have Windows. Okay. So it's so nice. Actually, I had it yesterday, Frank, on my notebook. But the new blue, the real one which was scary. We already saw the snow. In the last lecture it's also in there. And yeah, and some disabled and then we have the turnaround. Okay, another one. It's all just JavaScript, right? Doing some interesting stuff. And the last one was asteroids. Yeah, that was also interesting because some of you said nothing happened for asteroids, just a triangle appeared on the... It's not just a triangle, it's a rocket and it's a game. Maybe that's due to your age. I think it was one of the first video games and of course you can shoot and delete your page. Okay, so that was that. So some nice injections there. ChatGPT, there was a question about ChatGPT. Let me first show you what you wrote and then show you a little bit about ChatGPT because I think it's quite significant. It's a language model designed in trade by a company called OpenAI, a pretty small company famously known to be called OpenAI, but to be not open really, that can have conversations in writing with humans. I really find it interesting. It's amazing how powerful it is. I don't know what chat GPT is, you should, you absolutely should. It refused to prove that the earth is flat, good. Initially used the metric system when I asked it to use a proper unit system, it switched to the metric system. Very nice. So it's not just, it can really have a conversation. We will see it. Was amazing, scared about my future as a software engineer. That's true. ChatGPT I think will change a lot. It's quite revolutionary. Someone built, check out this link, it's amazing. And I'm fascinated, this feels like a very large step. And I agree, so I will show it to you in a minute. The last time I had this feeling that something so significant happened when I, in 1994 I still remember it in my office, I tried the first web browser, Mosaic by Netscape. Who knows Mosaic by Netscape in the room? Have you even heard it? Okay, 1994, it's 1994, not 1894. So you saw it, you saw the web browser, you saw the idea of a web and it was clear. You were wondering why is not everybody using it, why is it not changing the world right now? And interestingly, it took a few more years, like five years. That's quite interesting really. You see something, you know, oh wow, this is a huge step, but then nothing is happening for a while because it somehow takes a while for these things to spread around. And then four or five years later, the web of course changed the word. This is very similar I think. It even tells jokes. Why can't the bicycle stand by itself? Because it's too tired. And this happened in a conversation by someone where they talked casually about bicycles and stuff and then a person said do you know a joke and then the joke was even a reference to the conversation so far. So I've logged into chatgpt here, let me have a little conversation related to our lecture. How is the, and let me make it a little bit larger, how is the edit distance between two strings defined? And this is not prepared, I'm not doing anything special, I'm just asking chatgpt. And they are having a pretty heavy server load, so. And it's a language model, it's producing one word at a time depending on, okay between two strings is the minimum number, okay the set of allowed include insertion, deletion, substitution, I get an example, kitten, sitting, three, okay. The important thing to understand, we will talk a little more about this in other lectures, it's just the language model, it's just producing the next word that's likely based on everything it has seen so far and what you have typed so far. Is the edit distance symmetric? Let me just ask it that. It's generally considered to be symmetric, meaning that this is the same, generally considered, it's a bit strange. Okay the operations are reversible. So that's a good argument, they are reversible. And it's important, this is not text which is just copied somewhere, it's made up on the fly as I do this. Let's see, can you write a function in Python that computes, yeah this will not work probably, write the compute the, and called, let me also say how it should be called, called edit distance that computes the edit distance between two given strings x and y. So let me even give this stuff a name, right? Can you write a function in Python called editDistance that computes, this is like an exam question, right? Let's see what, certainly, here's a simple implementation of the 11th in Python, def editDistance x, y. Oh, it's writing code, it's using the names which I did. So this is quite amazing, right? And it's a language model. This is now not going to some module which somebody specifically wrote for writing code in response, it's always just producing the next word based on the words we have already seen. Now you can, is it correct? It's not necessarily correct. It's actually not so infrequent, also wrong, sometimes subtly wrong, sometimes totally wrong. But, okay. And this is now the iterative solution, right, doing the dynamic programming approach. And it's also a bit of smarty pants, kluckschleiser, it's always giving explanations you didn't ask for. Yeah, but why not? I mean, that's what you, okay, wow, I am impressed. So let me refer to, that was the dynamic programming implementation. Can you also give me the, you can also do it recursively which is much simpler, I think we mentioned it in the lecture. And please call it edit distance recursive. So let me give it a name so that it can't just produce something which it already knew. So now I'm saying can you also do it certainly, very confident, right? It's a very, here's a recursive. Edit distance recursive x, y, i, j, the first string. So now let's, it should be a shorter function, right? Okay, the base cases are correct. If one string is empty, it's the length of the other string. If this is empty, it's the length of the other. And now you just should have three cases. Okay. Yeah, and it calls itself recursively three times. If you haven't looked at chatgpt or played around with it, you have to because this is it's quite revolutionary and it's just a language model. So should you take, you tell me, while I type the question, should one take the recursive implementation or the dynamic programming one and why? Please just tell me whenever you know it which one is better, the recursive one or the iterative one? It's also an exam question. Typical, the dynamic, what's the problem with the recursive one? Yeah it needs my, what's the running time of the recursive one? If you recall it recursively three times in every, let's see what chat CPT said. So which one should I take? So it's like more efficient, especially for Ron. This is because the dynamic programming stores the results in the table, reduces them rather than computing. This leads to a significant... Okay. Levenshtein, so the dynamic programming was this M times N, that's correct. It's like the... it's quadratic. And the other one is 3 to the n, in contrast that's also correct. I mean if you don't find that amazing then I don't know what, we could spend the whole lecture about this. And you have to understand, I don't know how much you know about long language model, this is just a language model. It was trained on a lot of data and all it does is given what it has seen so far and what you have typed so far, predict the next word, predict the next word, predict, just predict the next word or part of word based on what you have seen. That's also why, so what you see here, this animated thing is not just a fake animation, but that's actually how the thing produces it, right? So while it produces the first word, it doesn't know what, so it's like I play a game, I start a sentence and then you continue it, the next words, and then you continue it. Just, I mean, people work like that too, like they just talk what comes to their mind and then they just form words and go on and on. So this thing is, it's amazing. So play around with it. So now before we go to the vector space model, let's have a bit more fun with encoding. And then we have a break and then we go to the topic of today, which is a small topic and just, we need only half a lecture for this. So where is my lecture seven? Yes, here it is. So we talked about Unicode. Unicode is a variable byte encoding. This was fairly easy to understand I think. You have like every character, let me go back to that slide, let me recap that really quickly. Little bit of history here. Every character in Unicode has a number, right? Unique number. So 128,512 is a smiling face. And the question is how do you encode that number? Because now you have, and of course what you don't want to do, there is such an encoding, spend four bytes for every character. You can do that and that's actually called UTF-32, but that's extremely wasteful. Java sometimes does that. There's UTF-16 where you say I don't need all characters, 65,536 is enough, then you always have two bytes, but it's also extremely wasteful because usually for the typical characters you need just one byte. So of course you want an encoding which uses few bytes, when you only need few bytes and more bytes when they are needed. And this encoding is called UTF-8, which for a long time now is the standard. And the nice thing is that you can understand it easily. We also had that in the last lectures. It connects to the lecture about encoding, so you have these leading bits which tell you okay, is this a sequence of two bytes, three bytes or four bytes, then you have these continuation bytes which always start with 10, and in a simple one byte case it's the same as the ASCII codes. And we also looked at some nice properties, this is also a popular exam question like do this yourself or write a program about this. I give you a code point, for example 228 and then tell me you should be able to do that. It's a pretty easy exercise. I give you a code point, I tell you 2011 and you give me the UTF-8 code. So first you have to turn 2011 into binary, then you have to figure out, okay is it one byte, two bytes, three bytes and then you have to do the bit magic and then you have to figure out, ok, is it 1 byte, 2 bytes, 3 bytes, and then you have to do the bit magic, and then you can shift that back into decimal numbers. One trick I should maybe mention here, in case you didn't know that, if I have this binary number here, what's the hex representation of that number? Let's first do it in, let's do, just in, is 16, 32, 64, 128. So let's make a decimal number out of it, so this is then 192, 98, 208, 209 I think, right? This should be 209, you should be able to do something like this in your head without a calculator. Something is not quite right with the pen here. So now I want it in decimal. Now how do I, what's the hex value if I want hex, first how many hex digits does this number have? If I write it in hex, hexadecimal yes? Two, and what are the hex digits? D1. D1, okay. It's D1 in hex. And how did you do it? Just look for the first four Yes. Okay, what you can do, hexadecimal has a range of 0 to 15, 16, so you can just take four bits, that's one digit, the other four bits it's another digit, so this is one in decimal, which is also one in hexadecimal then, and this is, what is it in decimal, what we see here? 13, yeah it's 8, 4 and 1, so it's 13, and 13 is, A is 10, B is 11, C is 12, so this is 13, yeah, so it's 13 13 which is a D. And this is the kind of thing which is important to know because if in an exam you convert this to decimal and now you do modulus 16 and divide by 16 you will also arrive at D1 but it will take you 5 minutes instead of 5 seconds. So it's important to know that. It's just a simple trick. And here's also something worth knowing and important to know. You can certainly have a UTF-8 sequence for every possible Unicode, but it's not the other way around. Not every multibyte sequence, for example this looks good. It starts with 110 which means it's a 2-byte UTF-8 code, then I have a continuation byte, but still this is not valid. And why is it not valid? Because the code point here, so the thing actually carrying information, is just 7 bits, the other bits here are 0, which means you could have done it with just one byte and that's true also for all the higher ones, right? If you use a long sequence and put in a code there, which could have done with less bytes, then this is actually invalid. That's just how the standard is defined. It could have been defined that, okay, when you use that sequence, it's just the same as the corresponding one byte sequence, but it's not valid. So that's important to know. And whenever you see this funny character, that's because when some encoder goes through a sequence, tries to... And invalid can also happen when the leading bytes are somehow messed up, right? If you have a zero, yeah for example you can't have one zero as the leading byte, right? That's also invalid. So that's also an interesting exercise to think about how to characterize all invalid sequences. So if it starts with one zero it can't be a valid UTF-8 code. Okay and here, so we had that. This you already needed for the exercise sheet, URL decoding. So what you, and for the URL decoding, because in the URL you have this limited character set, you first take the code in hex, which of course depends on the encoding, so the German umlaut R in UTF-8 is C3A4, which means when you URL encode it, so in a URL with limited character set, you just put a percent before the hex code, it looks like this. But when your encoding is this older ISO Latin or ISO 88591, then it will just look like percent E4. So it depends on the encoding, how the URL encoding looks like. And here is one more thing, it's a skill which is really useful and which you should learn and I want to show it to you, I'm not sure right now where, let's maybe, let's this funny thing here in Python, which maybe you didn't fully understand so far. Let's play around with this so far to see how complicated this is. So let me do the following. So I'm just specifying this, which is also the default. Let's see in a second what happens if I change this. Actually you should do this at the beginning of every Python file. This tells the Python interpreter how it should interpret the bytes here. And now let's just, I don't do main or anything, let me just print an a here. And let me just here, I just, let me just here I just, let me just execute the program here, xxxpi. Okay, and now it looks like this. This doesn't look like an a. And this stuff you see pretty frequently. You see it in emails which are quoted by other emails. Let me, you can see it right? This is a tilde and some funny symbol and the question is why do you get that funny symbol? And now the interesting thing about encoding is, right now I think there are five, I'm not even sure, five or six encodings going on here. There is, how's that string encoded in Python? And the basic thing you have to understand, there is sequences of bytes and then there is how you interpret sequences of bytes. So here we have, how is this file even stored? How is this a stored in the file? This file here is a text file with bytes. How is it stored? Let's look at that. That was actually on one of the slides. The slides in many ways are also references, although I'm not always going through all the details. I use them myself a lot for that, so I know, oh I did that in one of my lectures. And I just go there and how does it work. This is a very useful tool for looking at a file in its byte view. So let me just do xxd and maybe let me not call it, let me call it encoding.py and let's look at it again and let me also, yeah. So that's what, let me look at the file, just at the contents of this file, which is a small Python program. And there it is. Okay, so this is, and let me, there are a few parameters here which I can choose. Maybe in each column, show me four bytes and show them in groups of one that's on the, okay. So this is just the contents of the file. So here I just said show me per line four characters and group them as one. So if I do group four then it will group four together but I want to see the individual codes nicely. But xxd is the thing to remember. So now what I see I just see the characters here, right? So I see the hash. And what's interesting to me is what happens for the A here. And here I see A is actually two dots here. This is just how it has printable characters. And here I see, ah, it's C3, A4, right? So apparently it used UTF-8, but that's not... So let's change this here. So will this change now? Will it change now or will it look the same? So the yardstick for whether you really understood these things that you can say in advance, right this is like the yardstick of the test of knowledge, can you say in advance what's going to happen, not afterwards. Will I get the same output now or will I get a different output, what do you think? Yeah I will get the same output, of course not for the ISO this changed, but here it's still C3A4, it's just in different lines now. This is not how this is stored in the file, this is how the Python interpreter will interpret this. Let's see what the Python interpreter does now. Now I get three characters and actually four because one is hidden. And actually I was a little mean because I changed the character. The terminal also has an encoding. You are sending stuff to the terminal. The terminal just gets sequences of bytes which are also interpreted. So now I set what the terminal does to UTF-8 here. So now everything the terminal gets is interpreted as UTF-8. So let's see that, let's go back to this again. So what we have now, now we have in the file, I have the R UMLOUD encoded as UTF-8, Python will interpret it as UTF-88 and when I print it to the terminal I will see it as UTF-8. If I change this here to ISO 885, now the terminal will get C3A4 and it will interpret this as two characters. And in ISO Latin-1, and let's just check that because let's look at the ISO 88591. This is for a computer scientist, it's like 65536. You should absolutely know it by heart. Let's just look at the table. C3, you see, C3 is this A tilde. So that's why when you donate to Wikipedia, C3 and the other one was, how was it represented? A4 or yeah A4, it's this one. That's by the way the generic currency symbol. So if you understand the encoding then you actually understand why you're seeing what you're seeing, right? It's not just some random characters, it's actually the A, so this character actually do we have? Yeah, we have the A here. In UTF-8 the A is C3A4, it's a two byte sequence. And if you interpret this as ISO 8A51, then the C3 will be interpreted as capital A with a tilde and the A4 as the generic currency symbol. Which is why you get this if this is interpreted like this. Now let's go back to UTF-8 here. And this is actually really important to understand when you write code that does anything with strings and languages because otherwise you get some mistake and then you are just hacking around and there are so many sources where it could go wrong that you are spending a lot of time. So now I'm doing this and now, yeah, I get this. So now Python is just interpreting this as, that's why I get this. Now there's another dimension. This is just how I can also set the, yeah? I'm confused because you said that the terminal now shows you the base. I think so. And Python sends to the terminal the two bytes. Why don't the terminal determine the bytes 7, 5, 5 and 8. But it interprets it as a good. You are confused why you get this? Okay, yeah I understand, let's just see what, so the contents of the file is like this. So I am, and actually there's another, there are two more dimensions which I didn't talk about. One is how is the file stored. So the file encoding here, the file is storing this as two characters. And let's just, I can change it in vim with ISO 88591. And what So 88591. And what I will see now. Okay and now I probably have to save it. Oh now it says converted, see. That's also another frequent source of error. You're writing a file and you're typing all kinds of funny characters. Now the question is how is it stored on disk. So now it's only one character here, so it's E4. Now it's only E4, let's go to the table, E4 is actually a umlaut, right? So now by changing the, yes, so now I change the, and now, what will we get now? Now we get an A and we are in UTF-8 here. Because that's another dimension now and that's maybe your question. What this is just saying, how should I interpret the bytes in the file at this point? And the bytes in the file here is one. And the bytes in the file here is one byte now, because I changed the file encoding to ISO 8.8, which means Python now interprets it exactly in the way how the file is stored. But when Python outputs the string, it outputs it as UTF-8. That's still another dimension, how it outputs it. You can actually, I mean you could now, yeah, we saw that you could specify the encoding. I could do encode and decode. You can now do a lot of things. So there is how is the file stored, how does Python interpret it, how does Python print it, then how does the terminal interpret it. And then there's the editor, how the editor interprets it. And that's actually, that's I think another one. I think we now have six dimensions and I strongly encourage you to play around with this yourself. So now it's, did I do, I'm actually not sure. Ah, you see? Now I get a question mark here. Let's see. And this is why I showed you XXD because it's very hard to understand what's happening now. What is it now? Is the editor storing? What's the editor storing now? That may be hard to understand, let's just look at the contents with XXD. So the editor is storing E4 and my encoding is set to Latin-1, which is a synonym for ISO-88 and my file encoding is also set to, yeah. So the editor is now interpreting this as Latin 1 as it should but this is in a terminal which is set to Unicode, right? So that's why it's displayed in a strange way here. Actually in, yeah, I can see it here if I do in Vim, I can see that's the character A4. So actually now everything is in sync here. So the file is storing it in ISO 88591, the editor is interpreting it that way, showing it that way, and Python is also interpreting it that way. But just the term, let's we can go the wrong way. So if I switch to this now, I don't know, if I, now I have to write it, if I now, okay, now I get, hmm, yeah you see funny things happening here, right? And these things can pile up as we saw if you get these, let's be this the last thing we do, let's maybe set the, no I don't want the arrow bells, let's set the, oh the encoding is back to UTF-8, ah because I left the editor and when I entered it again it shows the default encoding. OK. And, yeah, it's really tricky. But it's important to understand that these things are going on. So let me, no I actually want to convert this all back and this is how the editor interprets this. What's this? Ok this is the file encoding, this did not change. File encoding back to UTF-8 and this is the last thing I want to show you and then we continue with the, make a break and continue with the other stuff. Let me just type the German Bumlaut again here. Okay, I've messed it up so much. Yeah, now I'm confused myself. I think I have to leave the editor. Oh, the terminal is still in. Yeah, now I'm confused myself, I think I have to leave the editor. Oh, the terminal is still in, yeah you see. But at least you know where to check, right? I think the take home message is, it's so confusing, there is the terminal interpreting stuff, there is how the file is coded, how the editor interprets it, how the editor shows it, how the editor interprets it, how the editor shows it, how the Python interpreter shows it. And you just have to know that all these things are at work. And then if you know that, then you at least know where to look at. So last thing we show, now it's again two characters, UTF-8, UTF-8, UTF-8, UTF-8, that's of course the safe way. I now get it here. And now if I do interpret this as two characters. And here also the terminal also interprets it. Now I get four symbols, right? And you can even compute which one. Now it's here, it's two bytes in the file, Python will interpret it as ISO, so as two characters, then it will print it as UTF-8, and then the UTF-8 will again be interpreted as ISO, which means I get four characters here, one of them is invisible. And you see that stuff a lot when you're actually working with text and printing and in the web browser you also see these symbols and you wonder what's going on. So very useful knowledge I think. Play around with it yourself, it's very interesting. Also for reference, how you do it in the various programming languages. So every programming language has to handle it. This somehow in Python, we have just seen this in more efficient languages. Here you don't always want to waste too many characters, so in C++ you have string, and then you have double use string when you use more characters, so you have another class for this, and in Java, and you have to pay attention what is now the length, right, this is also very important to understand and the frequent source of error, what is now the length of the string, is it the number of bytes or the number of characters, so if you take this here, you get length 2. Which is pretty confusing because it's a 3 byte, it's a 3 byte UTF-8 sequence, and you get length 2. And why is that so? Because it's, yes? Yeah, it's UTF-16 in Java, right? It's UTF-16 which means Java uses two bytes, which works for most characters, but not for smiley face, needs three bytes, so it needs two 16-byte codes, which is the length, it's actually two, and not one, as you would have expected. No, this is just wrong. Is there a reason for the double open? No this is just wrong, but it's good to ask because you never know, right? Maybe there is a reason for this, it's just a mistake, thank you very much. And also in Java you can convert between take home messages, really a lot of things going on and if you have any problems there you have to look deeper and now you know how to look deeper and how to understand. Yeah, and in Python here are some more examples of what I showed you. Okay, any questions about Unicode or anything we did in the last two lectures before we now go to the next topic? Yes please. Oh yeah, thank you. In the last line. Here? Yeah, probably I was just, thought just in case. Thank you. Thank you very much. And it becomes more complicated when you communicate with a web browser, because then the web browser also has its own encoding and stuff. Yeah? Yeah? But why is it in UTF-16, is mining only one trial? Because a car, it's a very good question, the question was why is it not one. A character in Java is a UTF-16, UTF-816 code. Which is, if you ask me, that was a very bad design decision. So the character in Java is a UTF-16 code and not really a character. Which means this is two characters. So a character in Java, car, is not what you think it is. It's not an actual character, it's a UTF-16 code. Which means some characters have two characters in Java. Which I think was terrible design decision, but I'm sure they had their reasons for it. But the problem is, and you know, when you, like us, when you build real search engines and stuff, and you really want to do it efficiently, that's not easy. Because then you have to deal with UTF-8 and doing that in a language like C++, not easy. And there are so many more questions. So in a company like Google, I've worked at Google for some time a while ago, you have a whole department dealing only with Unicode. It's a big department, they're doing only this stuff because it's so complex but so important, right? You want to display and handle this stuff properly. Any other questions before we make a break and go to the next topic? Okay, let's make a break. Five minutes and then next topic which will be relatively short. Yeah, back to lecture 8, vector space model. So we just have nine slides and this is now the start of something new connecting to what we have done earlier but we are now going into the wonderful, beautiful, enchanting world of linear algebra. And this is a very lightweight introduction into this before the Christmas break and also the exercise sheet will be very lightweight and we deliberately made this an own small lecture so that it starts easy. So we will now represent documents as vectors. Here's our running example for today. You have to understand it but it's easy to understand. So we have six documents or text records and so each column here is a document and the rows are now words or terms as they are usually called in this context. The term T-E-R-M. So this just means this document contains the word internet once, the word web once, the word surfing once and so on. And this is like the term frequency we had in the first and second lecture, which means D4 contains the word serving twice. And maybe before we continue, let's just, let me just make it example.txt because I don't have columns here. Let me just if I have a file with one line per text record, let's just check how that file would look like. So the first document has the words internet, let me just write them here, and we will write code for parsing it, web surfing, right, that's the first document internet web surfing. The second one is internet surfing. The third document is web surfing. This one is internet web surfing, surfing beach. So let's, so it's like this. And actually the order of the word is not important. So I could also have a totally different order. So you have the word surfing twice. So I just do it in this order. So now I have surfing with a 2. And now I have two documents which are the same. Why not? Two documents can be the same and they are both about beach surfing. Yeah, surfing beach. Let me do it like this. Surfing beach. And this one is the same but the words in different order but it's the same document for our purposes. Also how we did it so far, remember it didn't matter in which order they occurred. We didn't have some of you in their extensions considered proximity but here we actually ignore the order. So this is exactly the document collection we see here. And also understand why I chose this example, this will be important for the next lecture but it's also, I have here two words which mean more or less the same thing and one word that's called a polysem which means different things in different contexts. But that's not important for today but for the next lecture. But that's the reason for the peculiar example. These are synonyms words which mean the same thing basically can be used interchangeably. Serving as a polysem means different things in different contexts. And language is full of these different ways to say the same thing and the same thing meaning different things in different contexts. Okay, and the zero entries. So in this example, we have relatively few zero entries. If you think about a real collection here you will have basically all the words in the dictionary and most documents only contain a fraction of them, which means you have a matrix which is full of zeros, mostly zeros, not in this toy example and that's important and we will see in a second why. And we just use TF scores so it's just a whole numbers here which just say how often does the word occur. For the exercise sheet you will use PM25 scores. You can also write 0.14 here right and it just has a different meaning. Okay so this is often referred to as the vector space model, why? Because yeah, these things are now vectors. Here in this case, a document is a four dimensional vector, so we can just see them as points in a four dimensional space. The whole thing becomes a matrix if you write them side by side, and then you are in linear algebra word. So in this case you have four vectors in four dimensional space. So it's a vector space. And this is the term document matrix. Of course you could also have a document term matrix but that's just historically one has the terms and the rows, documents and the columns and one calls it like this so let's also call it like this. And a term for our purposes is really just a word. So they are also synonyms. So let's see, are we just being fancy or does it have an advantage? It has lots of advantages and we'll be fascinated, it to explore this to write these things as vectors and matrices. Let's first start with something very simple. Now I have my six documents and this is a query and the query is just where I have a one, this is the word in my query. So this is if I type, if you think of exercise 1 or 2, I type web surfing now and I want my hits web surfing. And now let's not compute the hits with an inverted index but by doing little linear algebra. So I'm now taking the dot product of these vectors. And let's just do it. So let me take the dot product of this vector here with this vector here. And you tell me what's the dot product. Dot product is component wise multiplication and then adding it up. What's the dot product? two. It's one times zero plus, do it with the array, one slowly, one times zero plus one times one plus one times one plus zero times zero. What's the dot product here? You tell me the dot product. Someone else maybe, huh? One. One. How does it go on? Two. Three. One. And one because it's the same vector. Okay, so these are now our scores. And they should look familiar to you because it's exactly the TF scores in the second lecture, slide 11. Before we introduced IDF, this was exactly what we did. Think about it. So if you take D4 for example, my query is web surfing and I have web here with a score of one and surfing with a score of two and it's just adding them up. It's just what we did in lecture two but now with linear algebra. So just writing a query like this and computing the dot product gives you the exact same thing. And now this is actually something you can do in linear algebra and we implement this together now but before we do it you should understand the most basic thing about linear algebra maybe it's a bit rusty for you then we will unrust it in the next lecture. Let's write this vector as a row vector so let me write it like this. That's now my 0, 1, 1, 0. Actually I don't need commas. One usually doesn't write commas when one writes a vector. So this is now my qt. This is my... and transpose because below it's a row vector, now it's a column vector and let me write my matrix now, let me just, let's take this time, let's do it slow for once although it's very easy, but it's important to understand this. So this is the one, I'm just copying the matrix now. The matrix, not that matrix, the matrix below. 0 1 1 2 1 and then I have twice 0 0 1 1 0 0 1 1 and now let's just to and this will, so this is now our matrix A, like it's below. Let's also look at the dimensions here. What are the dimensions here? This has one row and four columns and this has four columns, four rows and six columns. Basic rule of multiplication in linear algebra, this has to match. So this only works if this is equal to this. Then you can do the multiplication. And now let's just do the matrix vector multiplication. So first, what will be the outcome? The outcome will be a 1.6, a 1 times 6 vector, right? Deliberately it's just you multiply 1 times 4, 4 times 6. How does it work? Let's also do this very slowly. So let's just, I'm taking this here and multiply it with this here and then I get the first entry. And the first entry is, what's the entry here? Two. Yeah and let me, if I multiply it with the second column, I get the second entry which is one, which is what we have seen let me take the third one here and I get the, which is two and so on and the other ones, let me just write them in blue we already had them before so it will be 3 1 1. These are exactly our scores from the previous slide. This is very simple, but it's just something you have to understand and it should come very easily that when you are doing vector matrix multiplication or matrix vector multiplication when it fits, you are just multiplying the vector with the rows or columns of the matrix. If I would write this as a column vector on the other side and it fits, then it would be the columns multiplied. So this here would not fit for example. Right here, six widths, so in this format I just wrote it like this to show it's the same dimension. That's why I have to transpose it and put it on this side. I could also have done A transpose times Q, it would also work, then I would get the result as a row vector. So that's the important message here, quite simple, but you should understand this, you should be able to do this with no effort. Vector matrix multiplication is just multiply the vector with every row or column depending on the order. And then you get the scores, you get the same thing as we have seen in the first or second lecture and you get it with a linear algebra operation, which is great. So in the second, and we will see a little bit of this now, in the first lecture we did loops and kind of stuff, now you just write, and let me write this here, it's already written on the slide, this is just qt times a. So I have a query and it's just one linear algebra operation and I get all my results. And this is the whole magic also later for the learning stuff when you do deep learning, you just do one operation and it does all kinds of stuff. And this can be a whole document collection, right? It can be a million documents with 10,000 terms. Still it's just one operation and if that is implemented efficiently, which it is, then this can be quite fast. So let's implement this together now. So let's go to our code and I've already copied it to save some time here because it's boring anyway. This is exactly, almost exactly our code from the first lecture, which does nothing I think, because I just have to give it a file, in this case example.txt. I've also copied this, I made a little change. If you, let me just make this no larger than it has to be. This is, in the first lecture we had a file with many columns and you should also do it like this again for the exercise sheet. Here I just have just the text record, not the additional info and the title and so on. So it's just two, that was our example collection from lecture one or two. It's a movie movie, a film movie. Let's look at the unit test here when I built the inverted index. The word A occurs in document one and two, correct, lines one and two. We lowercase everything so capital or lower case letter doesn't matter. Film occurs only in document two, correct. Movie and here also note let's do the repetitions again. It's like a lecture one just for the sake of demo. It occurs twice in document one, so 1 1 3. Let's just check whether the unit test works by just doing it the simple way, inverted. No it doesn't. Oh yeah because I renamed the x file, I renamed it to example.txt because it's not a column, it to example.txt because it's not a column, it doesn't have columns anymore. So my, yeah, so that works. So it's just building the inverted index. And now let's do the following. Let's have a function build term document matrix. And it doesn't, it's a member function, it just takes the inverted index and turns it into a build term document matrix from the already constructed inverted index. Ok lets not write a... Ok we want to build a term document matrix we need to know the dimension. So lets make a few amendments here. Lets do here we somehow need the number of terms. And note I am speaking of terms now not of words because it's called term document matrix and I also need the number of docs, we used to call them text record now they are called documents, I already changed that here record id to doc id, doesn't really matter right I'm just calling them documents now because this thing is called a term document matrix and not a word text record matrix. Yes? That's a very good question. How do we, so in our matrix this here is of course just written for explanation. We just have the matrix, right? So we need to store somehow, if we have document identifiers, we have to store them. And the terms we will actually store. Actually let's do it right now, since you asked. Let's explicitly store the terms. And how do we do it? And we will see in a second why this is useful. So how do we store the terms? Let's just do it, let's just store them in the order we encounter them first. And we actually already have that here. So here is if we see a term for the first time if it's not yet there and here we can just do self terms append term. So now in the end self terms will just be the four terms. In the, and let's also check that, so terms should be which array? So what's the order? We have four terms I think. No, three. In which order will we have them if we do it like this? Yeah? A movie film. A movie film, yeah exactly. It will be in that order just by the way we did that. There's also always something, typical problem when you work with these things, here it's a hash set but you have to pay attention that you get them in a certain order. That's why I write sorted here so that I can test it because in the dictionary they are sorted in any order. But here we want a particular order. And here of course it's important you can't just change the order of the, unless you also change the order the same way for the query. It's important which row belongs to which word when you interpret it later. Okay, we also have to, we need to know the number of terms. Oh, it was already written here. I've got to delete that. Self num terms. Yeah what's the number of terms? I have this inverted list which is just a map of words to a vector. So it's just the length, the size of this thing. So it's just this here, this should give me the term. And the number of dogs, how do I get the number of dogs given this code? If you look at the code, what's an easy way to get the number of docs? Maybe someone else, thank you. Yes? Yeah, it's just docid, right? I start with docid 0, I increase it every time I see a new document. So let's just docid and let's include it into the test. Let's do it here, so it's selfnumTerms, selfnumDocs and for this example it should be we have three terms and three docs. Ok it's both three but that's just how it is. Let's check it. No it doesn't. Oh no, it's not self here, it's the inverted index. Ok. Yeah, it works. So now we have, ok, now let's not create our, let's start simple. And now we use something, a library, the linear algebra library of Python and it's called NumPy. I will talk a little bit about NumPy in a second, let's just use it for the moment. Let's just create an empty matrix with zeros with the right dimension. And there's actually a NumPy function for this zero. And I just tell it how many rows, that's the number of terms, and how many columns, that's the number of documents. Can you say it? Oh yeah, it's a cell, thank you. Thank you for paying attention. And this function just builds it and lets return it or lets do it like, I don't know, lets do it like this. So build the term and show it and some other stuff I will now show a few things. So let's get it here. You will get this code, you can't use it one to one for the exercise sheet but it will be useful. We will give it to you. And let's just do some prints now. So this is now the term document matrix A. And let's just print it and see what happens. Now I would expect to see a matrix with the right dimension of, let's just try it, inverted index with my example txt, ok, I'm now reading from file, ok, I don't think I, yeah why not, why not, why not. Ok it's now a matrix with all zeros. First thing we should do before we continue and now I'm not sure, this is ugly right? So it's writing it at zero point something very important when you play around with it or you want to debug it, you want to look at it in a nicer form. So now I have to, and I'm not really prepared to help me because I know, yeah that's very, you can say how should matrices be printed, you can give it a formatter. Let me just try to guess like a language model how it could so now I'm chatgpt. Okay I think you should say for these are floats and for the float you probably have to specify a lambda function which says how should I, ok let me try x and now the x should, I'm not sure, now you should say how you format it, maybe let's do 4.1, use 4 characters for each thing and 1 after the maybe and this should probably not be curly braces but something like this x. Now let's see whether this is probably wrong. Yeah it was wrong. So incomplete format. Okay maybe some of you is, so what could be wrong here? Float formatter equal... Natalie, do you see what's wrong? Does anybody see what's wrong? Formatter... So I just want to say how do I print the characters. I thought... Let's see. Maybe sometimes it pays to... I usually don't read the error messages, when I get an error then I just go into the code and see what I did wrong. This is the field by itself, writing compilers are so bad at writing error messages. Incomplete format. Lambda x. What did I do wrong? get numpy set print options, let's see example. Oh we have chat GPT, I should, why am I even using Google anymore? Can you give me an example usage of numpy set print options for float? Certainly, of course. Let's see. Oh, I have a typo, print options. let's see whether it can do the typo. Oh yeah, that's a good one, I could have done that. Certainly, here's an example of how to use to change the way printed. NumPy, np, set precision, set precision 3. Ah, okay. Yeah, yeah, yeah, that's not what I wanted. Okay. Okay. But that's not what I wanted. I wanted it with a formatter equal. Oh, it's doing, it's even reading my mind. Float custom formatter return, oh it should be return probably. Okay, yeah. Maybe it's, is it return here? Is it return? I don't have to return it. Ok I want it with a ok. Here is what I tried and it didn't compile. Can you fix it? And let's just take it. I have no idea where this would be most amazing. And it didn't compile, can you fix it? And you can also help me in case chat, yeah, you know what it is? Oh, there's an F missing, you're completely right, yeah. Now that you say it, let's see what, oh now it's sweating. Now it's, it looks like the syntax error in the code. This is quite amazing, right? It's a 4.F, it recognized the problem and I mean this is eerie. That's, I don't know, is it even giving me the, oh, it's giving me the correct version. It can't, yeah. I mean, this is not, this is a real, yeah. So, apocalypse is near. Yeah there we go. So let me just say this, I mean nobody, like Google since its inception, 1999, it didn't have a competitor, right? Basically everybody who came afterwards didn't have a chance anymore. As long as it stuck with this keyword search thing, this will now change everything. I mean if you have this, why should you go to Google and do a keyword search? You rather paste this, can you correct this, what's wrong? I mean people will even teach people, you prefer this. And it's not that Google hasn't tried to do something like this. So this, I am sure, will change the world. And it's a major threat for Google, the first one in over 20 years. So very interesting times. Yes, please. How will they, in future advertising, will they give you any issues or something like that? Oh, the question is how do they make money, OpenAI? So OpenAI is this company which is famously called Open, but it's not open. The precursor to this, this is called ChatGPT. The precursor are these language models, GPT, and they were already monetizing them. So what GPT, the previous models which were not for conversation mode were already extremely good at is writing text and summarizing. They're extremely good at summarizing which is a big business. You have an application, 20 pages, summarize this. Summarize this colon text and it will give you a nice summary, which is most amazing. No it doesn't belong to Google. It was founded as an open non-profit organization, but then they were overwhelmed by their own success and turned to somewhat profit pretty quickly. So OpenAI, yeah, it's really like this. So it's a very interesting company, because what they let me show you a little bit. They have this blog which you a little bit, they have this blog which you should absolutely read because I mean for years now they have to, they have also produced these like playing games, Dali is also by them producing pictures automatically, but also a while back they had this sentiment analysis, we have worked on that, I don't know how far. And what they are, where's the sentiment, unsupervised, interesting. It was, ok, yeah. Is this here? Yeah, yeah, this was a thing five years back where they would just do sentiment analysis. You have a text and then you color each letter depending on the, so this starts very positive, so typically passive aggressive way, you start very positive and then you say what you really think. So worst disparity, blah blah. And what I wanted to show is they have this blog where they talk about their stuff pretty early so they've done something, they've achieved some breakthrough and they communicated, but not necessarily with papers, papers also come with very accessible blog posts that everybody can understand, also journalists but also tech people. And as you saw on the previous list, they had a lot of these posts. And initially it was just interesting stuff, but pretty quickly they had stuff which actually worked and which was useful, like the language model, the GPT. And then they started on the side to monetize this, and now they're actually already now earning a lot of money, they don't have to worry. So GPT has a web API, so as a company you can just say I want to use this for summarizing whatever in our, and then you just pay for it like 100,000 summaries so and so much. So that's one way to monetize this. And the big advantage they have, let me also say this, and that's also interesting, that will also change the world somewhat because right now with Google and these companies came the advertising business, right? Because how do you monetize a search engine that's free? You do it with advertisements. Here it's different because to compute this language model, this took, cost like many million dollars, right? Which is a protection by itself. We can't do it. You can't train such a language model. Even if you understand how it works technically, because you need 1,000 GPUs, a lot of energy, and millions of dollars, and a university. So these companies can do it. And then once you have the model, then of course it's very precious. But then you just say, here's an API. you can ask the model, we have it and you pay per API call. And now this is actually the first time that they made such a demo public. It will not be public forever I think. At some point this will be behind pay. So use it while it's still there. I don't think this will be public forever. So possible models will be like you pay something, you log in, you pay per request, I don't know. Interesting times, this will change a lot and it fixed our error. Quite amazing but you also fixed it in parallel. But very soon of course we will not be able to compete So we have set the I am amazed that it fixed this so now we have okay It's now and actually we only have integers here, so let me just set this to point zero and maybe This should be enough. I think Yeah, now. I have a nice one and now Let me just run it on the one which I already created. Let me just show it here again. It was the example 2 which was exactly the one from the lecture. Let's just look at the term document matrix. Okay oh no that was the doc test I'm sorry. Okay it's all zeros now let's fill it with life. And then we are almost done for this lecture. So it's all zeros, this is just to create the right dimensions. Well what do we do? We should go through our, I think we should go through our terms for term in self terms. So we want to go through the terms in order and note how I did it. What's the order of the term internet web surfing beach, internet web surfing beach. So they occur for the first time in the exact order I had them on the slide. So that was deliberate. So I can just iterate over the terms here. And actually I want the term, I also want the index. So let me do it like this term. Is it I comma term or term comma I if I do enumerate? I'm not sure. If you know it, just tell me. Enumerate in Python also gives you the index. I'm not sure if you know it, just tell me. Enumerate in Python also gives you the index. I'm not sure right now. Okay now I have this and now I have the inverted lists in each order so let me go through four and the inverted lists up here I have them in the unit tests. They are just sequences of docids, 113 and so on. So for docid in self inverted lists term, so I need the term and the index here and maybe let me not call it term, i by term id and now I can just set colon here a term id doc id, I just set it to 1. So in my inverted list, so what am I doing here? I meet this 1 here which just means the term movie occurs in document one, so I just write a one there in my matrix. So you see, very little code, the exercise sheet will be like that, super little code. Let's just do it. Unfortunately it's wrong. What did I do wrong? Let's not ask chat GPT, let's think for ourselves for a change. Index out of bounds. Look at this code and tell me why the index is out of bounds. What do you think? Yes? Can you start counting the dog id with one? Yeah we started counting the dog id with one and of course when you use it for indexing it's all zero based. So here the numerate it's right but here we should take. So for a change this was an error message which was useful. It also happens sometimes. Okay that looks almost like our matrix but only almost. What's the yeah we should this is just setting it to one if you see it for the second time we also set it to one we start with an all zero matrix so with a very small change whenever I see a word again and it happens like in this example, 1 1 3, then I want for this entry in the term document matrix a 2. So let's just do it. And there we have it. There we have our term document matrix. Now let's just do and that will be it for the coding already. Let me also set our query vector. So how do I set a vector? Actually in NumPy there are arrays and matrices and you shouldn't use matrix which is super confusing but you should use array. Matrix has its problems. There's something on the slides about this but just follow the example for now. So let's, it's this, it's a column vector now, I just write it like this, I initialize it with a python array and yeah, let's just do it, print and let me just copy these two lines so this is my query vector q which is a row vector and let's just see it. There we go. Yeah, so it's my query vector and now let's multiply so now we get scores. And what do we do? We just do the dot product of Q with a, and use a function dot, you don't use, so it's not like this, it's a, you write it like this in function notation. Actually if you use numpy matrix then you could use the dot notation but matrix has all kinds of problems and it's a bit confusing that numpy has matrix and array. So this is just computing the dot product, so one linear algebra operation, it's even doing it efficiently. This is actually, now it's small matrices, it doesn't matter. This is not doing like two loops in Python and doing the computation, but this dot function is something which is written in C or even Fortran or I don't know what language they had 50 years ago. Linear algebra stuff is often written in very old code, but it's anyway, it's compiled machine code, which is just, and you have a nice interface. So this will be fast also when the matrices are quick. And now I see the scores which we have seen. And as you can imagine now, now I can do the exact same thing which we did for lecture 1 and 2. We can just do it with one matrix operation. And we will see that recurring in the next lecture. Stuff we did earlier and more complicated stuff. Just a few linear algebra operations which handle maybe large matrices. Do it. And also note one thing here, so why didn't I do A.Q, does it work? No it doesn't. And it tells me exactly what I expect, right? A is a 4 times 6 matrix, Q is a 4 times 1, doesn't write 1 here, it doesn't match, right? The second dimension here should match the first dimension here. I would have to transpose A and then it works. So probably this would work, transpose, I'm not sure whether it's transpose or transpose. Yeah, now it works again. So this also works, but let me write it the other way around because it's shorter. Say it again. Point T in the old one which I deleted now. Should I delete it now? Ah, point T, there's a shortcut for this, yes, that's true, there's a shortcut for this. Actually now that you are mentioning it, this will not be the last lecture where we use NumPy, it's the first lecture in a series of lectures. NumPy is really, if you do machine learning already now or later, you will use NumPy, it's the first lecture in a series of lectures. NumPy is really, if you do machine learning already now or later, you will use NumPy a lot. It's just the linear algebra library of Python and it's quite good, easy to use. And we've prepared for, the NumPy documentation is quite big because you can do so much with linear algebra. We have over the years created a cheat sheet with the most typical stuff which you need. So here you may want to go to that document when you do the exercise sheets and there's always a link to the reference. And here you see SciPy because there are actually two libraries and I will come to that in a second. But what I wanted to say, now there is this cheat sheet on the wiki with the most basic stuff which you need for the exercises. So if you are wondering how does one do this or that, I will come back to the exercise sheet in a minute. You may want to look at the sheet sheet first. So that's what we did, numpy array and dot, not star. Yeah, we did all that. This is how you can install it if you use our virtual machine, it's already there. So that's the last important thing. I told you that in the beginning, in our toy example, we have few zeros. If you think about the real matrix, most of the entries will be zero. And you simply, so even for the data sets which we have in this lecture, which are not huge, if you store the whole document matrix like this with real zeros, it would be too much, right? Because it's like quadratic, right? Not quadratic, one also calls this quadratic, you have, because it's like quadratic, right, not quadratic, one also calls this quadratic you have something times something else. That's just too big even when the somethings are not huge. You have a million documents here, you have ten thousand here, that's already ten thousand times a million, it's ten billion, like ten giga, if you need eight bytes per thing, you already have eighty gigabytes just for a million documents is nothing, it's a toy collection. So you can't do this, but fortunately most entries are zero and there's something called, and you should absolutely use that for the exercise sheet, a sparse matrix representation. It's actually very easy. So this very same conceptually very simple things in this lecture. Let's just give the rows indices here and let me use this colour. This is row 0, 1, 2, 3 and also let me give the columns colors so this is 0, 1, 2, 3, 4, 5, you start with 0, whenever you have indices you start with 0. And what do we have now? Now this just says the entry in the matrix at 0, 0 should be 1. So it's just entry, column, index, row, index. A very simple format. You just say for every, you just specify the entries which are non-zero by just giving the index pair and the entry. So here 2, 3, entry 2, 3 is just 2. That's what it says. And they are highlighted in bold here, the ones I have in the list. Of course if you have a dense matrix where you have very few or no zeros at all, that would be a terrible representation, right? Because you are, then you would rather store it as a dense matrix. But if you have few entries, then this is a good representation. Yes please. Please. Ah, interesting. The question is, let's assume we have a matrix where most of the entries are 52 and only a few are different, could you then also say the default entry is 52, I'm not storing it and just the others I'm giving explicitly. At first you can do it with linear algebra, hex of course. I'm not sure whether you can do it, what you could do is you could just write your matrix plus 52. If you just do plus 52 then it would just add and then you have to subtract 52 from the, so there are certainly hacks to do this, but I don't know if you can, I would say yes because NumPy and SciPy they are so, they are functions for everything. So if there is anything where you wonder can you do it, you can probably do it. There's probably one, one liner which can do amazing things. And it pays off to learn a little bit about this. So this is a simple representation. Note that you have a choice here. Do you do this in row by row or column by column? So this is known as row major or column major. And here is a very interesting thing, very simple but very important to understand it. Let's do row major, which means I'm storing it row by row. Row by row means first all the non-zero entries of the first row, of the second row. Well it's easy to see if I store the entries of the first row, I basically get the inverted list for internet, right? It's just in our inverted list we didn't bother to store the documents which don't contain the word. We say oh there's also this document, it doesn't contain internet. We only store D1, D2, D4. Which means if you have row major order, row by row, then 0, 1, 3 is just my inverted list for the term 0. 0, 2, 3 is just my inverted list for the term 1. 0, 1, 2 is just my inverted list here. And here I have my term frequency. So, row major and this is so simple but interesting and very good to know, sparse row major representation is just like the inverted list concatenated, it's just the same thing and also has the same memory consumption, it's just good to know that it's like this. And of course in NumPy or SciPy you can specify, and I think the default is always row major. And of course, let me come to that in a second, this is unfortunate but for historical reason NumPy was doing the basic stuff and then more advanced stuff was in a library called SciPy, scientific Python. So there are two libraries, NumPy and SciPy, they overlap a bit, they are both about linear algebra, it's a bit unfortunate, it should be in one library, but historically it's two. So for sparse matrices you need SciPy and not NumPy. But you also use NumPy and SciPy, so SciPy builds on NumPy. And now it's important, it's on the slide, so you have heard it, when you construct a sparse matrix, don't, absolutely don't do it like this. This just did it for demo purposes. That's the worst way to construct a matrix. That's super slow. I have two loops in Python. Just imagine this is 10,000 terms, 100,000 documents. It will take forever. You should absolutely use some built-in stuff. And the built-in stuff here is I give three arrays, which is, if you go back to this picture, it's all the first entries here in one array, all the blue entries in one array and all these in one array and then I tell SciPy make a matrix out of these three. CSR is just, should be written here, compressed sparse row, it's row major representation. So there's also compressed CSC probably how it's called. So you can do this, so you have to create a sparse matrix and then you can do the same thing Q.A but now this will be efficient even, I mean you can try it with a dense matrix and it will take forever. A, but now this will be efficient even, I mean you can try it with a dense matrix and it will take forever. If you do it with a sparse matrix it will be fast for, because as I said this is actually machine code. And the last slide, if you have questions I will be happy to answer them. Now that you have a matrix you can do normalization stuff. Let's just go to this one here, back to this one. What did we do in lecture 2? We did IDF. What is IDF? For one term multiply all entries by something because maybe this is a frequent term, multiply it by something small, this is a rare term, multiply it by something small, this is a rare term, multiply it by something larger. And we did it with a loop when we did this, like multiply all entries here by some factor to emphasize or de-emphasize the term. This is, in matrix speak, this is like normalizing the row, divide a whole row by something. This is in linear algebra called normalization. And let's look at for this exercise sheet for normalization. So what you can do, you can normalize each row or each column such that a certain norm is one. What is a norm? For example you could just sum up the entries by ignoring signs, that would be the L1 norm. Just multiply my entries such that they, if they are all positive, sum to one. L2 norm, such that the squares sum to one. And you can do that for the columns or for the rows. So you have four possibilities. That's also interesting because implementing the norm in Python, you have to think a little. It's again these one or two liners, but yeah, it's fun to play around with it and find how to do it. And never do these things with four loops. That takes forever. Find the operations which do it fast for you. And let's go to the exercise sheet, should be there now, yeah there it is. So the exercise is, I'm not reading it, just showing it here, just take exercise sheet two, how we did it for lecture 1 now in the lecture, take the code from there and now change to a linear algebra word. To the exact same thing and that will be very little code, like for our lecture here, this is basically what we added and then we can do, when we do answer a query with just two q dot a. So it will be like that in all the next lectures. Clever code but very little code. And then just try out the four combinations and I've given the name here because you should have a command line argument when I pass rl2 it just means normalize the rows by L2 norm, when I say CL1 normalize the columns by L1 norm. Just try out all four and see, maybe one of them gives better results than BM25. It's a very lightweight exercise, I mean you still have few days before Christmas, a few days afterwards. It's not much, it's a very lightweight intro into linear algebra work. That's it from my side. Are there any questions right now? Natalie, do we have a Q&A session on Friday? Yeah, we have a, maybe we should check whether anybody wants to come, but yeah, we can have a short one, why not? So I guess there will be one on Friday, if you have any questions. You should come, it's fun and it's short and you can participate via Zoom, so no effort. Any question for now? So are ah there's a question, yes. Are there any advantages or disadvantages for row or column majors? Advantages or disadvantages for row or column major? Yeah, that's a very good question and I think we can answer it, certainly, like chat GPT, of course. And actually you can see it here in this picture I think. It depends on what you do with the matrix. So let's assume I'm doing these operations a lot, then column major would be good, right? Because the question is just think of this as a thing with 100,000 entries. And then when I do this multiplication I'm going through this 100,000 entries. And now the question is are they contiguous in memory or are they spread out in memory? The difference will be a factor of 100. So this operation will be much faster if it's in column major than in row major. So memory layout if you have huge data plays a big role and yeah you have to make a, so in this case column major would be better. But for the exercise sheet you can ignore it. But why not try it out and see if you see a performance difference? That would be a worthwhile thing to try. Any other questions? So, then I wish you happy and relaxing holidays and see you next year. Bye bye.Welcome everybody to lecture 9, Information Retrieval in the Winter Semester, now 2023. There I said it, happy new year. May it be a normal year, a very daring wish after the last three years. Let's see. So what are we going to do today? I will say something very briefly about your experiences with the last exercise sheet which was about the vector space model, about the exam. You should absolutely register because you only have less than a week left. And today we talk about some truly beautiful and amazing linear algebra. You will see what all this means. And the exercise sheet is another pencil and paper sheet, whether you really do it with pencil and paper is up to you, but it's mathematics and it's even a special kind of mathematics, it's the computing, calculating part of mathematics. So like 6 times 7 is 42 but with matrices. In the past years we did a coding exercise here but that had problems. I will maybe talk about this later but it's a really beautiful topic. Vector space model. So the last lecture before the Christmas break was pretty lightweight. We did some UTF-8 which was left over from the previous lecture and then it was a very gentle introduction into linear algebra. Most of you, not all of you found it relatively easy. The normalization was a bit tricky. Let me remind you it was just take the second exercise sheet and now don't do it with an, just do it with linear algebra, with matrix computations, where you can do everything very elegantly. I enjoyed this sheet a lot, it was a very good way of getting back into linear algebra, that was exactly the goal of the lecture, a little bit rusty, this is what I expected, but I hope I can brush it up in the upcoming lectures. I hope that too, and that will also be one purpose, because we will have linear algebra in a number of lectures. Now, most of the time spent debugging sparse matrix stuff, several people said it wasn't really hard, but of course you had to understand how to use this stuff. The efficient normalization, this is also what we expected, there was very little code, you had to write code for normalizing a matrix and you had to figure out how to do that. Maybe let me very briefly show that to you, so the solution, it's now past the deadline, so it's linked here. Here's the master solution, this was basically all stuff that was given to you. Let's look at the routine for here for L1 normalize, so that's now the solution. Most of it is just doc tests here, we give you a lot of doc tests. And this is the code, and this is very very typical for linear algebra in general and in particular if you do it in numpy, scipy, here you compute the sum of each row or column depending on what you specify here. I won't go into any details, and here you do the multiplication so that everything gets normalized. So it's like three lines of code, you could do it in one line. Look at it, it's very typical. Of course you have to know how it works, you don't even need to know special commands here and then you can do it in. And of course it's super important if you would have done this with for loops, I think I said it, but many of you also realized it while working on the sheet. These matrices internally in NumPy, they are implemented in an efficient way. If you do these operations, but when you now go over this with a Python for loop, then it's super duper inefficient. You shouldn't do that, you should do everything with linear algebra operations. And that was also one purpose of the exercise sheet to make that experience. I found it difficult to find the right numpy functions if I ever had a friendship with SkyPy and numpy it is over now. So remember, I also said it in the lecture, there is a cheat sheet which we prepared for you, which we gathered because numpy, scipy, it's everything linear algebra and it's huge, so the documentation is huge and it may be hard to find what you need. So here we have what you typically need for the purposes of this lecture, very nicely explained. So this is on our wiki. I said it already in the last lecture, in case you missed it, now I said it again. So that should be helpful. And in particular everything you need for the lecture should be there. The exam, it's on February 28th, Tuesday, the usual day but not the usual time. It's at 11, we don't know yet, two hours, two and a half hours, we will tell you that in time it's in this room and the one below will be announced in time in which room you sit. As usual in the last lecture I will spend half an hour little more talking about the exam, doing some exam tasks together with you telling you what are typical exam questions and so on. And very very important because this is a hard deadline and our Bufengsamt does not accept late registrations, it's January 15th, so that's I think Sunday, Sunday midnight. So if you want to write the exam you should register. And I've seen it, many people have already registered. So don't forget this deadline please. So that's it for the organizational part, so let's start with the content, which is really really nice and exciting. It's about latent semantic indexing. Funny name, we will see what it means. So this is our matrix which we already used in the last lecture. Let's try to understand this a little bit better, because there is a reason I chose this matrix, which we didn't really need that in the last lecture, now we need it. There are these two words here, internet and web, which for the purpose of this example mean the same thing, you have that pretty frequently in language, two words which more or less in a certain context mean the same thing. You have a word surfing, which can mean different things in different contexts, surfing the web, surfing the beach. And now I have six documents, three of them, D1, D2, D3 are about web surfing, just using different words. Here it uses internet web, here only internet, here web. D5 and D6 are about surfing the beach and D4 is kind of about both. So that's the purpose of this example, easy to understand, of this toy collection. And now let's see what the problem is with the approach, how we did it so far. So this is just introduction, very easy to understand, then we come to the mathematics. So let's, here we have a query and it's a query about web surfing. So if you would now do an evaluation like we did it before, then we would say first document is certainly relevant, right? That's about, it has the word internet web surfing. Second document is also relevant. It doesn't use web, it uses internet surfing, but it's relevant. This one is relevant. It doesn't use web, it uses internet surfing, but it's relevant. This one is relevant. The fourth one is maybe a little bit less relevant. It's about web surfing, also about beach surfing, and these here are not relevant. They are about some other surfing. And now let's look at the scores, how we computed them in the, let's maybe take wonderful orange here, so if we just take dot product similarity you tell me between this vector and this vector of each of the six what do we get for the first one? What's the dot product similar dot product? Just this vector times this vector. It's a number. Two, yeah that's two. Here. One. Here. 1 here, here, 3, 1, 1. So if you would rank according to this, so we see here a problem, and the problem is these four are relevant and they get pretty good scores except this one, right? So this one here is kind of, let me do it like this, this is too, it's a bit hard to write at the bottom here, too low. This is too low and it's easy to understand, I mean this is a relevant document but the score doesn't reflect it. It gets the same score as these not relevant documents and why? Because it doesn't contain the word web, it contains the word internet, right? So it contains a different word meaning the same thing. And this is a very deep down, a very, it's a very deep problem with search. So how do we solve that? Here's a very deep problem with search. So how do we solve that? Here's a very simple conceptual solution. You just fill in the gaps. You say yeah, in document two it contains just internet, it should also contain web because it means the same thing and vice versa for D3 I just fill in the missing word. So what do I, now let's do the same thing. Let's do dot product similarity now. So now what we get, here we get two. Here we also get two, the first three documents are now exactly the same, right? Here we get three, here we get one, and here we get three here we get one and here we get one and let me maybe also write again the so this one is relevant this one is relevant so now the scores reflect the relevance. So if we just knew how to and this is not relevant and this is not relevant. So now by adding the synonyms to the documents we somehow made it work. And now the goal of what we are doing today is to do this auto magically. So fully automatically, not by some dictionary or someone telling us these words are the same thing, but just from the matrix. So in learning you would say unsupervised. And how do we do that? This miracle we will now make it happen. So here's a simple but powerful observation. If you look at this matrix where we now changed what we would like to have. If you look at this matrix, this has rank 2, column rank 2. And what does column rank mean? It means, now this is linear algebra, there are two vectors and I wrote them here on the right side so that you can express each document on the left side as a linear combination of the two things on the right side. Let me just do that. So the D1 is actually just, this is just B1, right? It's just B1. D2 and D3 is also just B1. D3 is also just b1, d3 is also b1. What is d4 expressed as a linear combination of b1 and b2? Yeah, it's just a sum of the two, and that's a linear combination. So in this case, the linear combinations are really simple. It could be 0.2 times one plus 0.7 times the others, but it's just a simple example and here it's just... Yeah. So now we have proven that this matrix has column rank at most two and it has column rank two because, for example, the first one and the sixth one are linearly independent. One is not a multiple of the others. So what we have here is a low rank matrix. And if you look at the matrix on the previous slide, if you would compute the rank, it's a full rank matrix. And full rank means the rank is 4, the rank cannot be 6. You have a question or comment? Oh yes, thank you very much. This matrix here has rank 4 and we will compute it and in a second we will do some coding and we can just compute the rank with numpy. So yes. And one important linear algebra fact, we will do a lot of simple linear algebra and some things I will just mention and not prove. So this year the original matrix had column rank 4, this matrix can never have column rank 6, why is that? Because the column rank and the row rank is the same, that's not trivial to prove and it looks a bit magic like so many things in linear algebra. I mean what do the rows, if you just look at the row vectors, that's vectors in six dimensional space, what do they have to do with the column vector? It looks like nothing. But it's a fact that you can have at most, so in this case you have four linear independent column vectors and this you can prove then you also have four linearly independent row vectors. So the column rank and the row rank is the same, which is why one can speak of the rank of a matrix. If they would be different you would have to say which rank, column or row. But you can just say column or row. Is it too loud? Yeah, okay, thank you. And we can make this a little more precise, what I wrote here. I can have two vectors and each vector is a linear combination of the two. This is exactly what matrix multiplication does. And let's understand for a moment how matrix multiplication works. I think you all know how matrix multiplication works, but for the purpose of this lecture it's also important to understand how it works and the context of something which is real like documents. So for example, I mean that's always the hard part, understanding mathematics in terms of the real world. So if you just look at, so this is, I'm saying that this matrix product, let's first maybe verify that this is a 4 times 2 matrix here, maybe let's do that in orange too. Whenever you do a matrix multiplication you should always, always check that the dimensions work out. This is a 4 times 2 matrix, these are just headings. This is a 2 times 6 matrix and this is a 4 times our original matrix. 6, not the original one, the changed one. So this works. 4 times 2 and 2 times 6, this has to work. Otherwise you can't multiply them, gives a four times six matrix. And how does matrix multiplication work? Let's just go, let's just produce these four entries and see how it works. And I do that slowly just in case you forgot how it works. Let's maybe do this in green first. So what I could do here, I could take this 1, 0 and then multiply it with 1, 1. So I take, it's all dot products. You take rows of this matrix and take the dot product with a column of this matrix. So it's 10 times 11, it's 1 times 1 plus 0 times 1 and this gives this 1. So this is how this 1 originates. So let me do it just for all four so that it's clear. This is 10 times 11 and this gives this 1. And then let's see whether we find so many different colors. This is 11 times 11 it gives this two dot product and the last one. Maybe let's take it in orange here. This 01 times 11 gives this one. And also note the following properties, so everybody I think can do matrix multiplication, but some things are then not so obvious. The effect of this 1,1 by what I just did is really to take this plus this, which is what I told you earlier. If you look at it slowly, so this is really B1 plus B2. One times B1 plus one times B2 and this is exactly this one and this one. This is just the effect of the matrix multiplication. This column is just this times B1 plus this times B2. And this is what I wrote here. And kind of if you want to interpret this then you could say, here I have a collection of many documents, here it's just six, I just have two different concepts and if you now go back to the terms, then this B1 is kind of the web surfing concept and this is the beach surfing concept. And now each document is just a linear combination of the two, and which linear combination this matrix tells you. D1, D2, D3 are just B1, it's just web surfing, D4 is both and D5 is just the second one. So that's just understanding matrix multiplication in terms of documents and terms. Okay, now what's the goal of latent semantic indexing, the goal is to do, yes to somehow get this matrix automatically and what I've told you so far already gives you a hint. The original matrix has rank 4 and this matrix has rank 2 and it's good that it has rank 2 because then we can express everything as a combination of few concepts. And the idea is that the high rank of the original matrix was actually not so much information but more noise. For the purposes of search we didn't really care whether web or internet was used. We kind of want to get rid of that noise. It's good to have lower dimension. That's a trick by the way which is also very fundamental to many learning methods. And this is what LSI does. So you have a term document matrix, any matrix now, and now you have a lower rank, for example two. So our rank here originally was four and now we have a two. And now you want to find a matrix of the smaller rank which is as similar to the original matrix as possible. This is what we did here. We just had a rank 4 matrix, we changed these two entries to ones and then we had a rank 2 matrix. So make as little change as possible to the original matrix so that we have a matrix of smaller and in reality much smaller rank. Maybe in reality you have a 10,000 times 1 million, you have rank 10,000, and now you want to reduce it to rank 100. You want to change as little as possible. Now we have to make it mathematically precise, what it means to make it as similar as possible, and we do that by subtracting the two matrices. If we would subtract these two matrices from each other, all that would remain is the ones here and the rest would be zero. And now we have a matrix with small entries, we take a norm. The norm we take here is the Frobenius norm, which is just, so you take the difference and then you take the squares of all the differences, you sum them up and you take the square root and you wonder why that norm, why not just the absolute of all difference summed up and the answer is very simple, it's purely that way the mathematics works out. That's very often the reason why in computer science you use a particular approach, because if you use that then you can do mathematics, otherwise you can't. And we will see that in the upcoming lectures too. So you just take this norm, but it's not really important for now. What is important is you want to find a matrix of small rank that's close to the original matrix. So how do we get such a matrix? And is there any question so far? Or was everything clear so far? This was the motivation, Yes please. Yeah, this is exactly right, so there is, when you see something like this, there is always a relation to Gaussian noise, so in a sense this method is optimal if you have a process like you have a real matrix, like what you actually have in the beginning when the world was created and then some noise was added by a Gaussian process. And the Gaussian process, Gauss-Schapart says you would now have to define what it means for a matrix and this would now add noise to everything, and now you want to remove that noise again. And then this would also be the optimal method. So yeah, that's correct. We don't need that property, but yes. So how do we compute this approximation? And now a very, very beautiful mathematical construct, and I will tell you a lot more about it in the second and third part, the singular value decomposition. Oh there's a question in the... Yeah, does it make a lot of documents equal? I mean in the example I just had two concepts, if you do it in practice you have a hundred concepts and you don't have, a linear combination is also 0.1 times the one concept plus 0.2, so it will not happen that they are exactly the same. This just happens in my toy example. On the slide it was said factorization. Is there always a factorization? Yes, this is exactly what the singular value decomposition is about. The singular value decomposition is a very particular factorization. And we will talk about it now. It's a magical thing for any matrix. And that's quite popular exam question. Does it exist for any? It's for any matrix, anything can also have negative entries or whatever of particular rank. You have three matrices so that you can write that matrix and it's also not a square matrix or anything, any matrix. Really any matrix so that you can write A as the product of three matrices. Right now we have no idea why three and why is that good? Where the three matrices have the following properties. U is an M times R matrix and let's just write down the dimensions as I introduce you personally to these matrices. So this here is my original matrix, it's M times N. So U is M times R, R is the rank. The next one is S, that's a diagonal matrix, so entry is only on the diagonal, the rest is zero. R times R. And then we have another matrix V, which is R times N. And now that's always, that's what you should always do when you multiply matrices, check the dimensions that they match. M times R, R times R, so this gives an M times R. You can multiply it with R times N, gives M times N. So this works out. And now each of these three matrices has special properties. S is as I said diagonal and U and V have this property that if you take the transpose times the matrix or here the other way around you get the identity matrix. And I will show you on the next slide what that means because it's very important and then I will show you an example. And moreover this decomposition is unique. Now matrix multiplications can never be completely unique and why is that? That's easy to understand I think in our example. Look I could just switch B1 and B2 and switch these two rows and I will get exactly the same result. If you can just switch B1 and B2 and the two rows here, I get the same result. And if you have bigger matrices, you can just permute the columns here and permute the rows in the same way, you get the same result. So it can never be unique, can only be unique up to permutation. But here there is a way to make it unique, namely these diagonal entries. You sort them in a particular way. We will see it in a second. They are positive. You can make it so that they are positive and then you sort them. So largest first. I should maybe write that. Largest first, writing down here is always a bit hard. And I think before I go on I will show you an example. Largest first. Let's first see an example of this, just so that you have seen an example and then we go back and forth from the slide. What I have already done for you, I have just copied the code from the last lecture. One to one, the only thing I did, this was called inverted index.py, I called it LSI.py. Apart from that it's exactly the code from our last lecture. And let me just run it. What it did was, okay I have to call it, I have this, let's also recall our example 2. Let's make this a little, this was exactly the collection which we used and now it will just show me the term document matrix, this is exactly the matrix from the slides, the original matrix and then last time we just did some dot products right just for this query which is also the query this lecture. And now let's just do some, and that's actually super simple, let's not do any, let's just compute the SVD and show it. That's actually pretty simple, it's USV is numpy, then there's a lin-alk sub-module, SVD, couldn't be much simpler, A, and now let's just print, print it, print U, print S and print V. And you might find this, you don't need it for the exercise sheet but you might find it helpful to play around with it a little bit to understand this. This is a USV from the SVD of A. And let's also do an empty line here so that it looks nicer. Let's maybe, not maybe but write, because these matrices now will not be purely integers, so let's write one digit after the dot. And let's maybe for now add another new line here, so that they are not also here. So now I have three matrices, I have this matrix, I have this matrix. So here you see Python actually or NumPy gives me just the diagonal entries. This is just the diagonal entries and I can actually show you the whole matrix. Let's just do that. And I think it's diag s and we will make the check that this is actually the single, we will multiply them again. And let me actually check, let me do that and then let's check that u times S times V is indeed A. I can just, yeah, I just do U matrix product with this, matrix product with this. I did something wrong. Yes, that's a, you might notice, and I have a slide later about this, Python gives me something slightly different than what I had on the slide. What I had, this is a 4 times 6 matrix. On the slide it said this should be a 4 times 4 matrix, rank, number of rows times rank, this is 4 times 4, rank times rank. And this should be now four times six, but it's six times six. Actually these two rows here, I don't want them. And I will tell you later why they are there. And I can just cut off these last two rows. I just take the first four, where four is the rank rows. You do it in Python that's super nice you can just do it like this this just says first four rows zero one two three not including this one and this just means all rows oh yeah you have to absolutely say that this is, yes, now let's, yes, and let's maybe add a new line at the end here. And right now it's completely okay to wonder what are we doing here, why are these matrices interesting? Let me also write on top that we are check that compute U times S and check that it's A again. So I have my, sorry, just one more. So I have my original matrix A, so lot.0 but it's actually all integers. Now I have three matrices here, U, S, V. S is diagonal. If I multiply them together, I get this looks like the original matrix. Minus point zero zero, why minus point? Because if I would print it out it would be minus 0 0 0 0 2 because the method to compute this is a numeric method which only computes an approximation. So there will be small rounding errors. So now, what's maybe check that U transposed U is the identity matrix. This was one claim on the, and this is a pretty magical property, let's compute U transpose times U. So you multiply the transpose of U with itself. And let's see what that means. See I multiply the transports of U, I mean it's a square matrix so I can do that, and now I get 111 and the rest is 0 or almost 0. If I do the same thing with V, let's do the same thing with V, except that I do now V times the transpose of V. I also get an identity matrix. So these matrices have a very particular property. And let's look at that property. So what does this mean? And I have some, and it means the following and that's important to understand. If you, so how do I do this? So if I have my matrix U here, and if I look at the columns, let's look at the columns, and how do I do that, which column, it's an m times r matrix, right, m times r, so m terms, r columns, which means, let me write it symbolically like this. So it has these r columns, so this is now an columns now become rows. And then I have U again, I multiply it with U and now my, here it R, 1, 2, R and this is now this matrix, it's the identity matrix, so everything here is zero. And what does that mean? So what I am doing for every entry here, for example, for... Yeah, let's just take the, as usual in matrix multiplication, if I take here, if I take the second entry, and this one, then I get this entry here. So this here is just the product of the second one with itself, the second column with itself. If you compute the product of something with itself, it's just the norm, right? Let's also do that. If I have a vector x1, xn and I multiply it with itself but what do I get? I get a scalar and that scalar is just the sum of the squares. And this is 1 according to this equation which means this is the L2 norm of this vector is 1. So this diagonal means that all the things multiplied with themselves are all one, they are all normalized. And the others mean if I multiply two with three or two with four or one with one, and these are the same vectors just here, they are as rows and here as columns, then it's zero. And zero means they are orthogonal. So what this means, ut times u equals the identity matrix, means what's written here, the columns of u are all L2 normalized and pairwise orthogonal to each other. So u is a very special matrix. And the same for V times V transpose. Now it means the rows of V are normalized, so each row has a norm 1 and pairwise orthogonal to each other. So if you look at the matrix here, so this, if you just take the four column vectors here, take any product, this times this or first times four, it's zero. They are all orthogonal. And the norm, that's why you have these not so nice numbers, will be one. If you take the sum of the squares, it's one. And what's the row? Here for you it's the columns and here it will be the rows. So that's the property of this singular. It's still not clear why this is useful, but now it's at least clear that this is a very special decomposition, right? Do you have a question or no question? This is a very special decomposition. It's also very important to understand, and you have to do this for yourself, ut times u is not the same as u times u transpose. You can also compute u times u transpose, but that's a completely different matrix with completely different properties. And if you think about it, I leave that as an exercise, this cannot be the identity in general and also this. But this can, yeah, this maybe let me write the, at least write this here, this is an M times R matrix and this is an R times M matrix, which means this is now an M times M matrix. And I mean if M is greater than R, it just can't be the identity, then it would have rank M, which is not possible. And here it's an N times R and an R times N. So if you do it the other way around, it's something completely different. Okay, but that was just for some, yeah? Yeah. Can you say that again? The rank is always smaller than the minimum of m and n. I mean if I have an m times n matrix the rank can be at most the number of columns or at most the number of rows because of what I said earlier. It could happen, yes it can happen, that's completely correct. By some magic it could happen if it would be a square matrix that this is also the identity. But then of course you would have a very special matrix. So that's a very, exactly right. So my statement was in general this cannot be the identity but it could be in very special cases. You are completely right and here also. Yes, but in general this is not the identity and when m is greater than r it can't be the identity and when n is greater than r it can't be the identity. So these are all these small puzzles or things to think about. And now the magical thing is if you have the singular value decomposition, which here we just take for granted and the exercise sheet you will compute it by hand, then we can do the following. So now we say, okay, we have our A, it's rank 4, we now want to rank 2 matrix, which is very close, where you wonder how do you do that. Now you just take, and let's just do that in, yeah let me first show it and then do it in Python. Just take the first k columns of U, so the first two in this case, just take the first k columns of U, so the first two in this case, just take the first two diagonal values of S, they are sorted by size, so the two largest, and take the first two rows of V. And now, as I told you, U, the columns are pairwise orthogonal and normalized, if I take a subset, they are still pairwise orthogonal and normalized. If I take a subset they are still pairwise orthogonal and normalized. I'm just taking something away. And the S is still a diagonal. And now I take this matrix and let me just write the rank here, but it's k. This is k times k and this is k times n. So the product is again just the same dimensions as my original matrix. And now this matrix has exactly this magical property, it's a matrix of lower rank which is as close to this Rubinius norm to the original matrix as it possibly can. And let's just do that and see what comes out. So let me just remove these checks here. What did I do? This was just for checking some stuff and now let's compute the rank 2 approximation. So what I do now is I take the first k columns from U. So let's just, let's make it a variable here. U k is U. So I want all the rows, let me just write it like this so that everything will be aligned. And now I take the first k columns. For S, it's a diagonal matrix. so I take the upper KK portion so I will do it like this. Let me put a just take the first K rows. And as I showed you under, so this should work now. And now I just take the product of these, so I take UK times SK times VK dot, I should use dot to do matrix multiplication, let's first check that it works, yes it works. And now let's do print it and maybe print a new line afterwards and I don't print anything else. Let's see. So now we have just done what I said on the slides and what happens. Look at the original matrix and what did I get here? So it's not quite how I did it manually, when I did it manually I actually put a 1 here and this became all equal, but it did something similar, right? So this vector now becomes 0, 9 and this also answers the question from the chat. So it's not like in my toy example where everything became perfectly just one of two concepts. Now this first one, one, one, zero became zero, nine, zero, nine, one, one, zero, one. So it didn't change it much. This one, one, zero, one, zero, the one became a little less, but the zero now became a zero point six. So it kind of did figure out that web and internet are the same thing, it made these entries the same. The one didn't change much, gave me a little bit of beach here for whatever reason. Here did the same thing, so it just here added the 0.6, this became a little less. Didn't change this one much and it didn't change these at all. So somehow, and by what I told you so far, at least you have some intuition of why I did it, right? It just says, okay, let me try to do something rank two that is as close as possible to the original matrix. And by what I, and apparently what I did, just second what I did was not the best in Frobenius norm, otherwise it would have found it. So in Frobenius norm, this is actually closer to this matrix than if I just replaced this here by one. Yes, please. I have a question. Now, we took the first entries of the new matrix of code and S3. S3-8 and 1-5, right. And it's sorted. And do I understand it right? It's important that it's sorted and you take the highest values and let's say something like this is the highest significant absolutely or differences we take and we ignore them, you know, so important, or so significantly important. Yeah, now you are asking already about the mathematics and we will come to that in the next part, why, but it's exactly like you said, like this is doing like a component analysis sort of the matrix and this is the most important component with a weight of 3.8. Next important with 1.5, so if you've heard of principal component analysis, this all has something to do with eigenvectors of, and this is less important and this is even less important. So that's why it would not work if I would take any two, I have to take the two with the largest value, absolutely. And it has something to do with important directions or important components of the original matrix. So that's the, and the proof, but let me just say that one thing again, you may wonder why didn't it do this nice fill in one one here, because Frobenius norm wise that would be worse. So if we would compute the Frobenius, I mean you can compute the Frobenius norm actually if you do 1 1 you take the difference then you have a matrix with two ones here and the rest is zero. If you sum up the squares you have two, you take the square root square root of two. So this here will have Frobenius norm less than square root of two. And this SVD is just doing the mathematically optimal thing. But it's pretty good. And I will show you at the end of today's lecture or next lecture when you do that in practice it actually does something nice. I think it's time for a break now. You can think about more questions and then... Do you have a question now? Yes please. I'm unsure if you said it but I didn't hear it. The major just said that for some reason it had extra rows of columns that were sorted away. And you said you would get back to that later. Are you ready to do that? Absolutely, I will get back to that after the break. Yes. Okay, let's have a five minute break and maybe get in some air and then we continue. So we're ready to continue and since the exercise sheet will be mathematical I will now give you mathematics background and also a bit of calculation background. So how does one compute the SVD? Do we have mathematics students in the room? Ah, great. So you probably know about the SVD or at least, but have you ever computed one by hand? Probably not, right? So you will do that for the exercise sheet. So computing the SVD. So mathematically you can compute it from the eigenvalue decomposition or just eigen decomposition. And I will show you how to compute. So eigenvalues, eigenvectors are more familiar to singular vectors, you probably haven't heard it before unless you study mathematics. In practice, what we just saw in the code, you use numerical methods, which means iterative methods which then stop after time and have to compute some approximation. Just very quickly, I will not go, this is not a numerical lecture, if you only want the largest eigenvalues and the eigenvectors that pertain to those, you use a method called Langsource. And it has the nice property, so I only want the top k eigenvalues, eigenvectors, singular vectors. top k eigenvalues, eigenvectors, singular vectors, it has complexity k times the number of non-zero values in my original matrix. So that's kind of the best one can hope for. And of course in a typical term document matrices most entries are zero. So if you reduce what I used now for the toy example, I have a slide about this won't work. For dense matrix this takes a very long time. And I will also mention about SkyPi. So let's first compute it mathematically. Let's go back one step, eigenvalue decomposition or eigen decomposition. And I just call it, it's not actually called EVD in the literature, but I just call it like this so that it's similar to SVD. So let's just look at real symmetric matrices and I will give you an example in a second and maybe that second is now. Let me just call it V, why not. Let's just take any matrix that comes to my mind, and the matrix that comes to my mind. Right now let's take a 2, let's take 2, 5, 5, 2. So this matrix has real values, it's symmetric, which means when you transpose it, I mean it's symmetric, you can mirror it across the diagonal it will be the same which means you take the transpose it's still the same. And then this has n, so in this case 2, so this is a 2 by 2 matrix, let me always write this down here, so it has two eigenvectors which are orthogonal to each other. And how do you compute them? Well, maybe, who sees an eigenvector here? If you know what an eigenvector is, you see it immediately. If you don't know it, you will see in a second what one is. Who sees what? An eigenvector. What's the eigenvector you see immediately here if you know what an eigenvector. What's the eigenvector? You see immediately here if you know what an eigenvector is. So one eigenvector which you see immediately if you know what an eigenvector is, is U1 equals 1,1. And why is that an eigenvector? That's an eigenvector, I mean the definition is written above. If you multiply the matrix with the vector, B times U1, and let's for once do 2525 times 11, what's the result? 77, exactly. So of course you always get some vector, but this vector is special because it's just a multiple of the original vector and that's very special. Most vectors will not have that property. You multiply it, so this is, if you view it geometrically, it's a linear transformation of my space. It will do something to my space, it will stretch it and change it. And here this vector is only stretched, but it remains, the direction remains. And this thing here, that's called, this is an eigenvalue. This is the eigenvalue of that eigenvector. So eigenvalue 7. That's the eigenvalue of that eigenvector. And now, if you have an eigenvalue, if you have eigenvectors all but the last one, you know that they are orthogonal to each other. Now there is only one vector left here, what's the vector or a simple vector that's orthogonal to 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, vector here in 2,2 which can be orthogonal to this one, this is also an eigenvector. It just will be, I know it, which is not obvious. Why should the orthogonal vector to that one also have the property that when I multiply it by b, it's a multiple of itself. Eigenvectors are truly miraculous and they have very, very deep meaning for our world. So all of quantum physics is about eigenvectors, eigenvalues as you might know. So what's the result here? Three I hear a 3, yes. And the other one? Minus 3. So it's also multiple, right? And in this case it's minus 3 times U2. So as you can see, and maybe I should rewrite this because I was a little bit generous with this space here. So this is eigenvalue 7. And this is now eigenvalue minus 3 and I chose that example. So eigenvalues can also be negative, right? They don't have to be positive. So now you wonder why are my singular values always positive? We will get to that. Eigenvalue minus three. So now I have this pair of... Oh and actually, ok let's normalize, what's the normalized version of, let's maybe do that here. versions of u1 and u2. I mean what should be obvious is if I take a multiple of this, I multiply with any constants it's still an eigenvector, right? Because everything is linear. If I take 2,2 I would get 14,14 here. What's the normalized version, L2 normalized of U1? So let me write it like this. So I think it's one over square root of two, one one. And yeah, one over square root of two squared is one over two plus one over square root of root square 1 over 2 plus 1 over 2 is 1. If I do it for 2 it's the same but just with a minus sign and minus 1, 1. So these are the... Ok, and now by these properties this holds. I can write now B as, and this just follows from what I just did, it may not be completely obvious but you could compute it which means that my B which is 2 5 5 2 that's my B and I can write it as I now define a matrix U and U is just a matrix where I put the two eigenvectors side by side, the normalized ones side by side and that's now just as columns, U times and I put the, I could also put S is now just, maybe I should give these two eigenvalues a name. Let me call them lambda1 and you usually call them lambda. So this is eigenvalue lambda1 and lambda seven, lambda two, zero, zero, lambda two and then the transpose of it. So it's U times S times UT. So difference to the singular value decomposition is the matrix on the right and on the left is the same. So this is equal to 1 over square root of 2. And now I write these two vectors here. minus 1, 1 times and now the diagonal matrix 7, 0, 0, 3 and now again the transpose, the factor is not touched by the transpose. Now I transpose it, means the row here becomes the column like this. Minus? Oh yeah, absolutely, minus three. Yes, this should be minus three. Make it absolutely. Yeah I was just trying to do the proof in my head why, I mean this here, that you can write it like this follows from the eigen value, eigenvector properties, but I don't have a one sentence explanation now, doesn't matter. You have this now, you need to do the exact same thing for the exercise sheet, so it's good that you have it. So now we can do something like this for symmetric matrices. You can just compute eigenvectors, you can guess one, you can compute the others. And for the matrix, let me quickly show you the exercise sheet, which is a very, so where is it? It's not, oh here it is, yeah. So that's the matrix you have. Doesn't look so easy right? It's a three times five matrix and you should compute the singular value decomposition by hand. But it's a rectangular matrix so how, now I showed you how to compute stuff if you have a symmetric quadratic matrix. Well that's the, ok first one thing, ok I was a bit too fast. Here I told you how to, I see one eigenvector, actually why do I see it? I mean if you see these values here, 1,1 is always an eigenvector because of the symmetry. So that was easy. Let's say you don't see it, how do you compute it? Well you can always compute it via the determinant and let's prove that this is correct. Why are the eigenvalues zeros of, you take this matrix B minus, you consider lambda as a variable. So let me write that here, lambda is the variable here. So this gives you a polynomial equation and we will see it in an example. So why, what does it have to do with a determinant? Well let's just look at, let lambda be an eigenvalue. That's actually easy to see that each eigenvalue is zero of this function. Then we have some vector u which is not zero such that b times u, we have an eigenvector, it's a definition of an eigenvalue which if you multiply b with it you get multiple of u and the factor is lambda. So let's just... So B times B minus lambda times the identity times U is 0 and now I have a matrix times the vector which is not which is not 0 is 0 and if a matrix sends a vector, a non-zero vector to zero this means the determinant of the matrix is zero. I won't prove that but that's just how it is. So that means the determinant is, which means this matrix doesn't have full rank. That's also a magical thing of linear algebra, these determinants and what they have to do with not having full rank and so on. But this is stuff which you do in the linear algebra lecture. OK. So this is just the proof, so this might look intimidating but it actually follows directly from the definition of, yeah, it's just lambda times, has something to do with eigenvectors. So let's just do it for our example matrix. So we have B was 2, 5, 5, 2. And now let's just do B minus lambda times the identity. Well, that's just subtracting lambda from each diagonal element. 2 minus lambda, 5, 2 minus lambda. And now let's just compute the determinant of that. Determinant of a 2 times 2 matrix is not very hard, determinant of large matrices is no fun. But by some magic the determinant of which you get when you do the exercise sheet, they will also be nice. You don't have to compute ugly determinants. So what's the determinant of this one? Who can tell me or write in the chat what the determinant is? Determinant of a 2 times 2 matrix? So you... 2 minus lambda squared minus 5 times 5 squared, exactly. And let's compute that, that's lambda squared, right? What do we have here? Here's a 4 minus 25, which means we have a minus 21 in the end, right? And then we have a minus 4 lambda, right? Is that correct? I think it is, yeah. And now we can find the zeros, of course there's a formula for that, but I mean to find this, let me show you this, it's a little bit sheaf, I'm not happy, but I won't write it again, I'm unhappy that it's not completely horizontal. Okay, so if you have two zeros, I mean if you have a, if p and q are zeros and then you would have lambda minus p times lambda minus q. So if I just compute this I get lambda squared minus p plus q plus p times q. So I have, just by, if you have these integer numbers, you can usually see the zeros by looking at it. I have to find two numbers so that the product is 21 and the sum is 4. Two numbers, two numbers where the sum is 4. Two numbers where the product is... So find... So no PQ formula. So find PQ, let me just write it such that P times Q is minus 21 and p plus q and there is a minus here is 4. And the two numbers are 3 and minus 7 is minus 4. Yeah it's 7 and minus 3. Of course you could have cheated because these are our eigenvalues from the previous slide right so this actually so the determinant is, yeah let's just continue here. This is just lambda minus 7 times lambda plus 3, right? So the two eigenvalues are... Say it again. There's a lambda missing on, oh yeah, absolutely, thank you. That's unforgivable. Plus p times q. So very often you can, in case you didn't know that trick, if you are given a, and you know that there are nice solutions you can always see them like this, you don't need a PQ formula. our two eigenvalues. And now from that you can, but I don't think I will do that, I mean now you can compute the eigenvectors from these two by solving a system of linear equation. But this you can figure out. Let's maybe continue because there's more interesting stuff. So now we know how to compute either by divine intervention or by computing the determinant eigenvalues and eigenvectors for a square matrix. But we want singular values. Okay here's one more thing. I'm again ahead. Let's assume our matrix has a rank and the rank may be smaller. Okay this is also, this is something I wanted to show you. It's again basic linear algebra calculation stuff. So let's assume my B, how do I write this? My B is an N times N matrix a U, I have a S and I have the U transpose and they are N times N and this is N times N and this is N times N. Now let's assume R, for the picture let's take a smaller rank. Let's assume just so that the picture is interesting. So R is somehow smaller than R. So then I have zero eigenvectors, so then my matrix will actually look like this. So I have, so this is R now, and here I have some interesting values, and here I'll have zero. So this part of the matrix is zero. I mean off diagonal everything is zero anyway. Let's maybe write the R inside here. So this is R and this is R. So here I have zeros, here I have zeros. Everything off diagonal is zero. But the lower the, on the far end of the diagonal they are also zero. So if I do, so if this is R here and this is R here and now I'm claiming these values and that's now a if you don't believe it, you can try to prove it yourself. If I have something which is all zero here, I mean think about it, if I, let me maybe take one, maybe green thing here, if I do this line here and then I multiply it with this matrix, then it somehow gets here, so everything here will be multiplied by only, everything beyond the R will be multiplied by only zero and the same thing happens on the other side. So this part, if in my diagonal matrix there are zeros and maybe there are three zeros, then the last three columns of U and the last three rows of the transpose, I can just throw them away, it will make absolutely no difference. Which means what I can do if the rank is smaller, oh this curled up, ok, r times r, r times n. So I just omit the eigenvectors beyond. They exist because I always have n, I mean these are the ones spanning the core of this, so it just spans the space that is sent to zero by this linear map, but that's not important now, but I can just ignore them here. So I can just write n times r, r times r, r times n. And this answers the original question, so what I said earlier, what number I compute here, these four, we haven't come to that yet, but these are also eigenvectors of some matrix. Which matrix we see in the beginning, so the rows here are eigenvectors of some space, and there are just four of them, and it's six dimensional space, which means there are two more eigenvectors. And NumPy just gives them to me in case I want them. I don't need them because they will be sent to zero by a certain linear map, but it just gives them to me. But I don't need them for the same reason that I can omit these things here. I have a question. Yes please. Which choice do we have? We don't have a choice. The eigenvalues are fixed, right? If I move all the eigenvalues, eigenvalues can also be zero. And if I move all the eigenvalues which are zero to the end, that's a well-defined number. Maybe one thing wasn't clear, and maybe then you ask your question again. The order of these things here must be the order of the eigenvalues otherwise it doesn't work. Maybe that's something I didn't make clear. Is that maybe a problem? Here I can't just switch the two, then it's not correct. If I write here the first eigenvector and the second eigenvector then the diagonal I must write the eigenvalues in this order otherwise it doesn't work. So I wrote it implicitly here but I didn't explain it properly. So I can't just permute them anyway I want and I still get the same result. Does that answer your question? with the three time free makers, that the third of our vectors is a linear combination of the first two forms of them. Yeah. And now there is kind of an order because we can derive each of them by a pair of other twos of them. Yes. And I've just tried to understand Yes. of our dimensionals of our rectangle, one of the i's are here, and now I'm a bit confused why they are fixed, but I think I have to do the top again. Maybe part of the confusion is this B is not our term document matrix, so here it's not about vectors being linear combinations of others. We will now come to this on the next slide. So here it's not about vectors being linear combinations of others. We will now come to this on the next slide. We had a rectangular matrix and here we are talking about symmetric square matrices. So I haven't told you yet how we get from our A to the B. And the B actually is something where you don't have that intuition, which we had earlier. So maybe let's go to that and then maybe it will resolve, or if not maybe you just ask it again. So how do we get now from eigenvalue decomposition, which we will do for the exercise sheet to singular value, and that's actually simple. Here is a, I have now an arbitrary m times n matrix and now how do I turn it into a square matrix I just take A times A transpose. And that's super simple, let me just do it for you, A times A transpose. I mean why am I doing this? We will see in a second. This is now, and this matrix product doesn't have a good intuition but it's certainly a square matrix. You can see it, it's an M times M matrix. And now if I, let's just take, symmetric means the transpose is again the same matrix. If I take the transpose is again the same matrix. If I take the transpose of this matrix, A times A transpose transpose. So in general, in case you didn't know this, in general if you have two matrices and this is something which is proved easily by just doing the basic math. The order changes, right, if you take the transpose. The transpose of a product is the product of the transpose but with the factors switched. Which means here I have ATT times A and A transpose transpose is of course, sorry, like this, and A transpose transpose is just the original matrix. So it's just you change the order and take the transpose, which means you get A transpose transpose and A transpose and A transpose transpose just A. So yeah, symmetric. Which means it's a symmetric matrix, which means this has a reduced eigenvalue decomposition within this one. And let's just, yeah, this is now so this is now an m times m matrix and it's symmetric, we have just proved this, which means we can find the non-zero eigenvalues, so if this has rank R, it has R non-zero eigenvalues, we take the R eigenvectors, write them as columns here, write them as rows here, and this is what we get. We get this decomposition. I can do the exact same thing. Yes please. Yes? So if you say it's n times n in dimension, and you say it's, no, you don't. OK, sorry. I just was thinking about the dimensions beforehand and before the book, which how our matrix is but not as c-like in state. OK. And for the same size is the other way around. Rows become column. You can do the same thing with the other way around, AT times A. This is also symmetric, now N times N. You can just do the same thing. Now this is an N times N matrix, it's also symmetric, same proof. You also have, this also has rank r, it will also have r non-zero eigenvalues, which means you can have, so these are now the eigenvectors of this funny matrix, these are r eigenvectors of this funny matrix, it's different eigenvectors of this funny matrix, it's different eigenvectors. And now, by some magic, this already gives me everything for the singular value decomposition, it's actually easy to see. Let's assume, this is now not a complete proof, but almost complete, let's assume that A, and let's just ignore the blue stuff for a second, let's assume A is USB with the properties, and the properties are with the properties from the with U times U is the identity matrix, S is a diagonal matrix and V times VT is also the identity matrix. Right, so this is the SVD properties and let's just do A times AT, A times AT is then, well it's U times S times V. Now the transpose of everything but with the order changed which means it's V transpose times S If you transpose a diagonal nothing happens. S transpose is equal to S. I mean you transpose around the diagonal. Well, what do we have here? Here we have V times V transpose. This is exactly the identity matrix, which means it disappears. I have U times S, a diagonal matrix multiplied with itself, you just square the values on the diagonal, that's also easy to see, which means here you get U times S squared times U transpose. Right, that's the usual linear algebra magic. You write something and then suddenly stuff disappears and you get nice expression. Let's also do it the other way around because it's so nice. Of course you have to be careful that it's actually true what you're writing down, now we are writing the transpose which is VT times S times UT times A which is U times S times V. Now here we have, it's exactly what we need here, so this is now the identity which means in the product we can drop it and this is now Vt times S square times V. Yes, and this proves, so if something like this exists and you can, by the same way you can, the eigenvalue decompositions give it to you, then U, the matrix U from the singular value decomposition is just the eigenvectors from this A times AT, and this is also how you do it for the exercise sheet, you just compute this matrix and compute its eigenvalues and eigenvectors and then the V matrix you just get it as the eigenvectors and eigenvalues from this matrix. And what this also shows is that the eigenvalues of this and the eigenvalues of this are actually the same. So what we, yeah, the singular, so the two matrices S1 and S2 from the previous slide are actually the same. This is also easy to see if you have a, let's say, I mean it follows from this proof but you can also prove it directly. Let me just write it here. U be an eigenvector of let's say A times AT, which means you have A times AT times U is lambda times U, right? Which means if you just multiply the whole thing by A transpose, you get A transpose times A times A transpose times u is equal to lambda times A transpose times u because everything is linear. Now I have now I have constructed a vector here, I've done two things at the same time here. So now I have a so ATU is now an eigenvector of this other matrix right with the same eigenvalue which means if it's an eigenvalue for AAT, it's also an eigenvalue for ATA. Of course not with this, so this even gives you a transformation how you get the eigenvectors from the one matrix for the other matrix. Actually for the exercise sheet you don't need to compute this one, it suffices to compute this one, and there's a trick to get the eigenvectors of this one, it's written on the exercise sheet. And I think it's very important to have an exercise for this, where you actually compute this because otherwise you just hear it, you may believe it or not, but you only really understand it when you do it, and when you do the calculations. And you just need everything, which I explained to you here. And it's actually pretty nice to do it by hand with a small example. Yes? Now I'm confused by y times yt is the identity. Which one? This one? Oh, I assume that So I'm assuming that I have an SVD for A. So this is the SVD of A, U times S times V. And the SVD has these properties that U transpose U is the idea. So if I have a singular value decomposition, which means here the columns are pairwise orthogonal, here the rows are pairwise orthogonal here, the rows are pairwise orthogonal. And then, so from the singular value decomposition I get eigenvalue decomposition of these two funny matrices. This is what I'm showing here and it's also working the other way around which I didn't really show. Okay and now comes a lighter part but first one more comment. Doing it in Python, I showed it to you, you can use numpy lin-alk svd, that's for dense matrices, everything in numpy is for dense matrices. This will take forever on a large matrix. I mean already for the last exercise sheet if you would have used numpy, it would have taken forever, it's made for dense and not very large matrices. And also numpy gives you the non-reduced form, I've already explained it to you, it gives you, yeah we have learned now that these are actually the rows, eigenvectors of this AT times A matrix, thank you very much. And it just gives you the two other eigenvectors which are actually mapped. They have eigenvalues zero, which means we don't need them. But since the method which NumPy uses computes them, it gives them to you whether you need them or not. But we can just drop them. And for, I mean this is not part of the exercise sheet, but let me just tell it to you, in SkyPy you have the same for sparse, it's not absolutely analogous, SkyPy, sparse, linalloc, svds it's called. This is not using a particularly efficient algorithm, it uses the Rayleigh-Ritz method, I'm not talking about this here, you can check the documentation. Anyway, you don't need it for the exercise sheet, that will be non-coding. And maybe here's the time before I go to the last smaller part, why. In the past we did coding exercises for this, which is very nice, because if you do this for a real document collection, then you actually see how just by linear algebra magic you find concepts, like you find real things in your data. And I will show it to you I I think, next lecture for an example collection. Right this web surfing, beach surfing things, you actually find it, like you find underlying topics. Just by doing singular value decomposition, you find topics in your text collection. And I mean, the algorithm just gets the numbers, doesn't know what the words mean or anything. And it's very nice to see that. But if you just implement this, then you can implement it without understanding anything, right? That's the problem. You just call skypey, parslin, like SVD, you do the matrix calculation, but you don't really understand how it works. That's why I think that a math sheet and one where you actually calculate, so let me go once more here, what you actually do is you take this particular matrix and you actually compute the singular value decomposition for this one. Which means you compute A times AT, then you compute the eigenvalues and so on, and then you also check, is this really pairwise orthogonal columns, are these really pairwise orthogonal rows, is this the diagonal matrix, and I think it's a great way to learn it. So last part, it's not so long so maybe we could make a break but then maybe it takes too long. So how do I use it now for document retrieval? Let's just go and that's now a little bit lighter again, I mean this was the math, now it's again using it. Where was the, I mean the whole point was this right, now I changed the matrix here in the beginning, I just did it by just adding this one and now I get better scores. I do the exact same thing, I do dot products with a changed matrix and I get better scores. So that's the simple way, you just, and this will be variant one which I will talk about in a second, just replace your matrix with the new one, with lower rank and then just do the same thing as before. But as we will see, this has a pretty obvious problem. So here actually I did it, that's exactly the matrix, if you remember, which we get here. It doesn't do the 1-1 thing because that's not optimal in the Frobenius norm. It does these funny numbers here. So why don't we just do this? We replace this by that and now we do retrieval with this matrix. Do you see a problem? Yeah, it's not a sparse matrix anymore and that's really a problem, right? If this is 10,000 times 1 million, now you have an, and this is what the linear algebra will do, it will not leave any non-zero value. Because it's now doing, yeah, here actually that's a super coincidence for this toy example. But usually nothing will be, and that's written on the next slide, it's a dense matrix. All, almost all its entries will be non-zero. Here's just, so this is huge right? Let's just say one million terms, ten million documents, now I have ten T terra, so if I have eight bytes per entry I would have eighty terabytes of data, you can forget it. So that's not practical. But what could we do? And this is also something for the exercise sheet, you will just do it by hand, all three variants just to check that it really works, that the same thing comes out. Just do the, we'll just give you, that's the second exercise, you just get a query vector here, very simple one, one one one. This has three terms, this matrix, and now you just compute similarity scores according to the three variants which I quickly explain now. matrix here, this V matrix, in this simple example, where do we have, ok I haven't shown you the, no it's just the first two rows here, If I take this here, this gives me, these are exactly my concepts, how I started with, right? They now tell me take the, for document one, take minus 0.4 times concept one and minus 0.5 concept two. Now it's not like 101011, it's a real linear combination, but this is what this matrix V gives me. So, yeah, I've copied the value here. So instead of working with the original matrix, and maybe let me go back once more to make it absolutely clear. So in this case, instead of working, or let me, I don't know which picture do I take, instead of working with this matrix, I could just work with this matrix here. This is equivalent to this one if I know the concept, right? Instead of this one which is in four dimensional space, I could work with this one. It has the exact same information. And it's much simpler. It tells me these three are just concept number one, this is concept one plus concept two. So I could do that, and that's what I do for variant two. But now I also have to map my query to the lower dimensional space. I mean my query also has four dimension, if I want a query on this matrix I need a query in two dimensions. And just with a little bit of linear algebra I can do that. Okay that's just, it's much better now, let's just skip this. So let's just, this is the, let's just do the matrix multiplication here, it's my query vector. This is just replacing my original matrix with my rank K approximation. So that's just the approximated dense matrix. This I don't want to compute directly, it's terrible because this is dense and very large. But how did I compute this? I computed it like this with these truncated U S V matrices, first k columns, first k eigenvectors here, top k singular values, first k rows. Now let me change it a little bit so now this product which I would like to compute is equal to this product and here I have my vk which means I could simply, if I get a query I just compute this thing here which gives me this, right, this is the same as this. This is just my qkt below. So I get a query, I multiply it by uk and then by sk, then I get a query which is now in, and let's maybe also check that. So that's in, that's a 1 times m vector my original query was m times 1 this here is m times k oh I want to do this in orange this is 1 times m m times k k times, which means here I get a one time k vector, not plus plus but times. And I will skip these considerations on the runtime. So this is more meaningful now. I just store this matrix here. So this is what I could do. I just store this matrix and then when I get a query I map it to from four to two dimensions by what I just showed you. And there's the last variant and that's the last thing we do. I can compute this matrix. And this is the matrix which I talked about earlier. If I do it the other way around I get the identity matrix. If I do it this way around I get the identity matrix. If I do it this way around, I get a matrix which is not the identity matrix in general, but it's a matrix that has a very interesting property and at least I wanted to briefly show it to you. One can prove with a little bit of linear algebra, which I will skip now, it's also not important for the sheet, that just by using the same small calculations as before, that my reduced rank matrix is just this MxM matrix times my original matrix. So that's one way to view how I get from my original matrix to my matrix with a smaller rank. I just multiply it from the left by a big matrix. And this is the proof, you can look at it yourself, it's not important for the sheet, but it's a nice interesting linear algebra proof. And what does it mean, and I wanted to show you that. What does it mean to multiply a document, so that's now, here we have a four times four matrix and that's the last thing I will show today. This is here, write it as a column vector. And here I've already written words along the matrix because there's a very interesting intuition here. What happens if I have a document vector in the original space, I multiply it by the 4 times 4 matrix and I get another vector in the original space which would be a vector of my reduced rank matrix. at this entry here maybe, this entry here. I mean what happens is when I multiply this matrix and it's a particular matrix now which has almost the identity matrix and it has a 1 here and a 1 here. And if I do this multiplication what happens is that this zero gets replaced by a one and why does that happen? It happens if you think about it let me 2, 3, 4, so this is just name the dimensions here and this is, you will see a second while I'm drawing that in. How do I get that entry? This is the second entry and I get it by computing the dot product of this with this. This is just how I get the... sorry that was not nice. This is how I get the second entry. It's by computing this times this and the second entry, this is one times one plus one times zero plus zero times one plus zero times zero. So the only thing of value here that gives something non-zero is really this one here, right? It's this one times this one. One times one gives this one and the rest is just a multiply, there's a zero either here or either there. So just by the way the matrix multiplication works and this is a, it's actually easy to see but it's interesting that this is the intuition. The effect of this one is, what's the effect? The effect is if there is internet in this document add web to the result document. That's what it says. If there is internet in this one add web to the output document. That's exactly the effect, right? If there is internet, if there is internet, yeah, it's always good, if there is internet add web to the result vector di prime. That's just by the way it works, that's the effect of this matrix. Which means you can interpret this matrix at, look at the off diagonal matrix which are large and they will kind of identify synonymy information about. What this method here figured out, if this would have been the result, is that web and internet kind of mean the same thing. And you can just compute the matrix, interpret the values like this, so this was an exercise sheet we had in the past, you just compute it, you could actually just do it for fun and look at the past, you just compute it, you could actually just do it for fun and look at the pairs where you have large values here and you will find that they are words which mean the same thing. Which is like magic, right? You just computed something and then you get word similarity without any understanding what the words mean, right? It's just the thing just gets the matrix. Yes, and that computing this is terrible. This is very expensive to compute and store. This is again great to understand it, you would never do this, you would not actually compute this thing and then do this, multiply this here and actually do it, but it's a great way to understand the method because what I was saying here in this last part, variant 1 which is just replace the original matrix by the rank 2 approximation, terrible because all the non-zero entries go away. This variant, terrible because this is huge, you can't store it, you can't compute it. The second one is actually practical, but they all do the same thing, which means this is a great way to understand, another way of understanding what the method does. What the method does, implicitly, whether you do it that way explicitly or not, it's kind of finding synonymy information between words and adding that to the documents, which is exactly what we wanted in the beginning, right? We said, look, these two words are the same. If there's a one here, add a one there. And it's really doing that. And you can show it by just looking at the linear algebra a little bit closer that it's doing that. But you wouldn't compute it that way, but if you just take variant two, it does the same thing. But just, yeah, I mean that's the method you would actually use. You would just take the representation in the lower dimensional space, which is dense, but if you have a small k, not bad, and then do all the computation there. Which is something, and this is much more than just this LSI and information retrieval, this is a very general idea which you find in all of deep learning also, right? You have a lot of data and you reduce it to some lower dimensional space, get rid of all the noise and then now you have the information like the core information in a very compact form, so that's what's behind this. And that's a, so I've already shown you what the exercise is, I think it's a super nice exercise, it's a calculation exercise, so no, there are a lot of small things which you can prove if you want to, but the exercise is really just calculation, but you have to understand what you are doing. Yes please. Very important question, you also have that in all kinds of learning methods, it's so called hyper parameter, which is super important, right, for the what do I take, 50, 100, and there are all the black magic sets in, and somebody will tell you you should take 47 to always give the best result. So there's no simple answer for that. It's a hyper parameter of the method, which has a big, and there are lots of papers about this, you should choose this or that, but in reality it's very hard to say. I think in practice if you really use this method you would pick it rather small because everything is linear in this, if you take 100, everything takes 100 times. If you take a 10, so you would take a relatively small one. If you think about, I don't know if you know a little bit about deep learning, you have the same word vectors or something, you reduce every word to a vector in dimension k, and the question is which dimension do you take. There you take 300 for example, it's a typical dimension. But it's a value which comes out of nothing. Somebody else might say 100 is also good. But so low hundreds are typical values, but there's no mathematical reason for that. It's just you need a certain dimension to distinguish things. Yeah, but it's a very deep question and hard to answer. And there's no mathematical reason to prefer one over the other. Maybe there's another question. One thing that's written here, really important, as usual in mathematics, for the simplest thing you can write three pages, then you're doing something wrong or you can do it in one line. So if you end up doing very long calculations with square root of five plus pi over e imaginary numbers cropping up up you're probably doing something wrong. There's a reason for this matrix it's very well behaved and gives rise to nice calculations if you do it correct. So if it gets very long very complicated think check again or ask in the forum you probably did something wrong. Any other question for now? Okay, so have fun with the sheet and see you next week. Bye.Welcome everybody to lecture 10, Information Retrieval in the winter semester 2022 and now for some time already 2023. I will say something about your experiences with exercise sheet 9 which was about beautiful linear algebra, latent semantic indexing, eigenvectors and so on. The official course evaluation has started, there will be more information next week, but feel free to do it already. But if you do it already, here's some basic advice, please take your time, you have taken so much time for the lecture, listening, doing the sheets, you can take some time to do the evaluation, like 20 minutes I think is an appropriate amount. Be honest and also be fair, so if you do the evaluation you should have slept well, the sun should be shining and you should feel in a generous mode, like you want others to feel towards you, I think that's then a fair attitude. But I will say more about it next week. And today we will talk about classification. I will say a few general things about classification. A bit as usual since the math for many of you is some time ago and you forgot much of it, there will be a little crash course in probability and then we will talk about Naive Bayes, which is a very old basic and simple learning method but very good to learn a few things. And the exercise sheet will be you get some movies and I will talk more about it and then just from the description of the movie you should predict the genre, whether it's comedy, documentary, horror, romance or science fiction. So what about the last exercise sheet, latent semantic indexing? Most of you found it very interesting and quite doable with few exceptions. Nice, definitely help for understanding the lecture. Very interesting, matrices are more fun than logarithms. The math sheet I enjoyed the most up to this point. Several of you said that it was nice to improve the mental arithmetic. I cannot emphasize enough that doing calculations in your head has benefits way beyond doing I cannot emphasize enough that doing calculations in your head has benefits way beyond doing calculations in your head. So like two digit numbers, 17 times 36 doing this in your head. I mean it's mathematically simple, but it's a great training for so many things, for doing things in your head, for focusing on something and it will also as a consequence improve your math skills. I cannot emphasize enough if you are for some reason not good at it or you have problems, practice it. It is something you can practice like muscle work. My god I hate pen and paper math but it got better so there was at least a silver lining. Took a long time, mixed feelings regarding calculations. I think just take it as an opportunity to practice. Fun doing linear algebra and seeing the practical uses. A word without eigenvectors would be pretty eigen-artic, wonderful German word. It's a shame that in modern capitalist societies everyone only thinks about their eigen-artic, wonderful German word, it's a shame that in modern capitalist societies everyone only thinks about their eigenvalues. That's a very good one I think. Why are several of you asked, and let me, why do eigenvectors crop up everywhere, in learning computer science, mathematics, physics, everywhere. They are so universal. Why is such a very theoretical construct so universal? One can give I think a whole course about this, but let me give you a very short glimpse maybe. Again I said some things in the last lecture. So what eigenvectors and eigenvalues characterize is what a matrix does to the space. So a matrix, if you have a matrix, and let's take a symmetric matrix, it's a linear mapping and it does something to the space, it distorts it somehow, maybe it rotates it, maybe it stretches it, it does things to the space. And what the eigenvector, eigenvalue theory does is that what every, at least for a symmetric matrix, what every such linear mapping does is that you will have for example this three dimensional space, take any linear mapping from a symmetric matrix, then it will stretch the space by a factor of three in this dimension, it will shrink it here by a factor of 0.1 and in another orthogonal direction it will stretch it by a factor of 10. And all linear mappings do this. And it's kind of magical that you do some linear distortion of the space and then it's kind of magical that you do some linear distortion of the space and it always does this. You can always find these three dimensional orthogonal eigenvectors and in these directions it will either stretch it or squish it. And this by itself is already interesting that linear mappings do nothing else and the other interesting thing is that you can use it to reduce the dimension. Let's say in one of the dimension it squishes it by some factor then you just squish it completely and you leave out that dimension. That's what dimension reduction does. Let's just look at the dimension where a lot of things happen. Where you stretch it by a factor of 10 or so. That's where the most action is. And all the other dimensions you just ignore them. And actually here is one, this a and x should be blue. Here is one calculation, I could talk a lot about this, where you can actually see this. So linear mappings they do something, you can apply them to the vector, very often in physics for example, also in other sciences this can be a process, you map something to a vector, you get a new vector and you can do this repeatedly, which means you get a power of the matrix. So let's just assume we have the eigenvector decomposition, so this is the eigen decomposition, decomposition, which means U is orthonormal, right? So this is orthonormal, sorry, this should be written again. Orthonormal and the S is a diagonal matrix of the normal and S is diagonal. And now let's do A squared. What's A squared? So it's just A times A, it's U, times S, times UT, times U, times S, times UT. Now you have this linear algebra magic. And this here is just, because it's just the identity matrix, which means what you get is U times S squared, times UT. And S looks maybe something like this, for let's say, let me just draw it in two dimensions, it would be the same in many dimensions, you have something across the diagonal. How does the square of a diagonal matrix look like? Just the square, it's very easy to square a diagonal matrix, right? So you just get this lambda 2 square and here you have zeros so that's very easy and it's also very easy to see what happens when you take this to the kth power. Now you just have, I think, let me not write it there, let me write it to the right of this. So a to the power of k is just u times s to the power of k times u transpose, which means s to the power of k is lambda one to the power of k and here I have lambda two to the power of k. And what happens if you now, let's say you would normalize this matrix, and again this is a bit, let me write this lambda one again, if you, I'm not doing the full math here. If now k becomes very large and then you normalize this again, and let's say this one is the larger one and this is the smaller one, then the difference becomes emphasized. That's easy to see, right? Let's say this is 10 and this is even if this is 10 and this is 1, then this grows incredibly more than this one, which means this one becomes emphasized and this one becomes de-emphasized. And it even happens when this is 10 and this is 9.9, right? If you take powers of it, the gap between them becomes huge, which means in the limit, and I won't write this down, only this one remains. Which means in the limit, if you apply this mapping again and again and again, everything will go in the direction of the first eigenvector. So the first eigenvector has a very, very special meaning. Everything will go in the direction of the first eigenvector, magically. And that happens in a lot of processes. You just apply something again and again, you let the process run for some time, and then what happens is dominated by the first eigenvector, by this strange theoretical concept, and then it gets a very practical meaning. I could talk so much more about this, there are of course also YouTube videos, let me point, I mean 3blue1brown, who knows this channel, this YouTube channel, many of you know it, that's very good, because it's a very sophisticated channel with beautifully made videos by Grant Sanderson on many math topics, they give nice glimpses of intuition, for example here is one about eigenvector which show you how, and you have nice visualization of the space, we don't look at it together. Now, but here is a very important warning, I give it at least once in every lecture, learning from videos you don't learn anything from videos. Even for the intuition, these videos are great, they are beautifully made, but it's very very dangerous. They can be very entertaining, it may appear like you listen to it and you say, now I understood eigenvectors. You listen to it because everything is so nice and you understand it. Similar to some of our lectures here, maybe, but you only really understand it by doing it yourself. And a simple yardstick is look at one of these videos after a month, you don't do anything with it, you will just have forgotten it. You won't remember because if you don't do it yourself, it's not anchored, you forget and you never really understood it in the first place. So it's super deceptive and I think that's a huge change because there are so many videos now for school, for everything. You watch the videos and you have the feeling, oh now I understood it. You only understand it by doing it yourself, absolutely. The only good thing is, okay, let me get some additional, and even intuition. If you feel you get some intuition, then you also have to deepen it yourself. This is so important, this is so dangerous. About exactly the point I made, there's a very nice, here's a link to a YouTube video scene. Who knows the movie A Serious Man by the Coen Brothers? Okay, who knows the Coen Brothers? Who knows movies? Okay, at least someone. Movies by the Coen Brothers, there are very, yeah, a lot of them. Fargo, I think, who knows Fargo, the movie or the television series? Okay. Yeah. Okay, that was or the television series. That was the first 10 minutes, maybe 12 minutes today. Let's talk about, oh no, there was one more thing, I wanted to show you a LSI demo very quickly. And I always like to show this because it's a student project which was done in my group 20 years ago. It was written in OCaml, obscure language, and it still runs today. These guys, they wrote a Windows installer without change, it still runs today. They wrote it in OCaml because they wanted to prove to me that OCaml is as efficient as C++. They lost that bad but everything else was great and they agreed that they lost it, so it wasn't as efficient but they wrote a UI and everything in OCaml which is amazing, it's a functional language. This is a movie data set, we have worked with movie data set, you will also get one for this exercise sheet again. So here it's a matrix which you have in latent semantic indexing, this is a variant of the method probabilistic latent semantic indexing. And I just want to show you just by doing a matrix decomposition, here I just take movie descriptions, similar as for this exercise sheet for this week, and let's just find 20 concepts by reducing it to dimension 20. And here we can see the concepts and what you see here, so this for example is such a vector and it's sorted by, so just remember the two vectors b1 and b2 from the last lecture and just look at the entries with the largest numbers. So here you have murder, supernatural, child, occult, robbery, ghost, revenge, death. Okay, that looks, and let's just look, now we can say, let's look at documents which are similar to this concept. And we see, okay, a lot of horror movies, also some fantasy movies, and so on. Okay, of course we also see the usual dirty stuff, relationship, father, son, mother, daughter, brother, sister and so on. Let's look at the movies here, documentaries, action and so on. But the point is, here we have war history, World War II, Nazi France, so this is probably documentaries, we get a lot of documentaries. And just so that it's clear again, this was a completely unsupervised method. You just give it the text as a term document matrix and let it figure out 20 concepts by itself and it will find very meaningful things. We can make this downloadable, there are also other text collections here. It's just a yes, I really want to quit. So that was just a, because the last exercise sheet was theoretical, I at least wanted to show you once how it looks like in practice. So today we talk about classification. Classification, now you have objects and classes and you want to predict for each object the class. And let's just start with an example. So you have a training set and a test set and in the training set you are told for each document what the class is. For example in a small island off the American coast, the Wockleeds live in an old mill where mysterious bloody beings, so what comedy, science fiction, no it's a horror movie. So you see, you can get an idea from the, a starship crew in the 23rd century goes to investigate the silence of a distant planet's colony. Which of the five genres? Comedy, ok, it's science fiction. Two hearing protection products sales reps have mixed fortune in the exercises of their trades, they first have two. That's not so obvious, right? Some are more obvious than others because just from the words and from the flow, mysterious bloody being, it's probably not a documentary, could be. Okay, so now you get these, you can learn from these, you get a lot of these, you can learn from these and of course the computer doesn't know, these are just letters and words, doesn't know what they mean. now you get this one. Professor Iris discovers a secret in an ancient stone and when he opens the crypt he revives. Ok, what's this? Western. What's this class? You have to predict it from just the words given the training. That's the problem we are going to use. I will show you how to solve it theoretically and then you will implement it using naive Bayes and see how that goes. And this was the real example which you also use in the exercise sheet for showing this stuff. I will work with an artificial example which looks like this, not so interesting but easier to work with. So these are my documents and understand how it's meant. This is a document with three words, so the letters here are the words I could have put spaces in between but I didn't. This is a document with three words and one word is ABA. I can have the same word twice. And these are the classes, so I have here six documents. This gets class A, this gets class B, A, B and so on. I can learn from these six and now I get a new document, ABABABA, which class, B A, which class? B A A A, which class? This is also good because this lets you see the problem a little bit more from the perspective of the computer, right? Because the computer, for the computer this is as real or abstract as this one, right? Because it doesn't understand what the words mean. And this is not so clear, right? What's now the right class if you have a lot of As? Does the A documents have more As or the B? So, before we go into the algorithm, let's briefly discuss the differences to what we already saw. So, one thing is clustering. We didn't do, I did it in an earlier, many years ago, I also did clustering in the lecture. K-means for example is a clustering algorithm where you really, you have your objects and you, you divide everything into clustering. This is sort of clustering what we are doing but not quite. Soft clustering is what we did in latent semantic indexing. That's actually soft clustering in the sense you have these, think of the last lecture with B1 and B2, these concepts and now every document can be a mix of the concepts. Like this is mostly B1, mostly B2, this is half B1, half B2, this is called soft clustering. You can belong to different clusters, to different proportions. So, and what's the difference to what we are doing here? So here we are not, yeah, in clustering it's usually unsupervised, so you have no learning phase, today we are doing something with learning. And in clustering you don't have names, you just say here's my data, cluster it in 10 clusters please. Here we have names, right? We give things names, labels. In clustering you don't have that, you just say please divide it into 5 clusters whatever they are. LSI, no learning here and today we are doing something with learning. We are having a training phase and then a prediction phase. Let's also, you need this for the exercise sheet, very quickly talk about quality evaluation and maybe just consider this example again, now you did this, you learned, you trained here and now you have a lot of documents to predict. Let me maybe also show you the real data set for once. So let's look at the test data set maybe. And it just looks like this. So it's just the first, it's the tab separated values, one record per line. The first column is just the genre, so this is comedy. And of course for the test set this is not what you should use, but just test set this is not what you should use but just test this is what you use. A shy San Francisco librarian and bumbling falling girl. And now you should predict what it is and sometimes you are right, sometimes you are wrong. And let's maybe also just look at the distribution of these things, maybe for the training set here. If I just cut the first column I just get this. Let's maybe sort this and count the size of each group with unique minus C. So you see it's an uneven distribution. We have over 10,000 comedy plots, documentary, horror, science fiction is the rarest, which corresponds to IMDB data. So we took a sample, not too many so that you don't have problems efficiency wise. So when you do the quality evaluation you will predict sets, which means for example for class comedy you have the documents, the movie plots which are really comedy and then you have the documents which you predict as comedy. Ideally they are the same. Let me draw a picture, I mean this is for example for comedy, this is the set of movies which are really comedy and this is now, this is the set of movies which you predict as comedy. Hopefully they are very similar but yeah, so there is a part which you don't get, these are comedy but you don't predict them and there are some which you predict as comedy but which are something else. So what do we do? We can compute for each class precision and recall. I will not talk about it too much here. I don't think I will talk about it at all. You can just read off the formulas and work with it. Maybe just one thing. So you have precision and recall, two numbers and they just measure somehow the overlap and then you have the F1 measure also called F1 score which turns this into one number and why does one take the F1 measure? Well let me just, if you take the average of the reciprocals of these, so if you take this, no let me not write it with parentheses, divide it by two. So if I take this, if I multiply this by p times r in the numerator and denominator, what I get is p plus r times 2p times r, which is just 1 over f1. Which means f1 is just, you just take the average of the reciprocals and then the reciprocal again. So that's called the harmonic mean. The mean would just be P plus R over 2, but for F1 you take the harmonic mean, which means you take the... but that's just why this formula looks maybe a bit strange, it's actually quite natural. This also might be an exam question or at least sometimes I have it in the exam. This is one of these many little proofs, this single number is exactly 100% or 1 if both of these are 100%, this is the only way it can be 100%. So it's a good single measure. You will evaluate it for the, just to get, yeah, then you get for each class a number which tells you how well you are doing with respect to that class. One more thing before we dive into the algorithm, what's the difference to what we did in lecture 2? We had quality evaluation, precision at k, average precision, discounted cumulative gain, BPREF, well the setting there was the ground truth was a set, you have a query and here are my relevant documents, but what you compute was a ranked list. When you have a ranked list and your ground truth is a set, then these measures from lecture two are the right measure. What we have today is the ground truth is a set, all comedy movies and what you predict is also a set. Here's everything I recognized as comedy movies. And in that case you use precision recall and F1. And here is a comment which you may not fully understand now but as you work on the sheet and later, very often you find papers or also maths thesis where you just get a table and then it just says here and here is precision and recall and F1. So if you say precision, recall, F1, you always have to say what are these two sets. Otherwise one cannot understand what you are doing. It's always with respect to two sets, right? You have the ground truth set and what you predict. That's what the red comment is about. Is there any question? This was just some basic stuff, setting the scene. Yes please. So in clustering we have hard and soft clustering, do we have something similar in classification or do you always classify it to one thing? No actually we have exactly the same thing and it will be apparent today. You will always have a soft score, I mean you can always turn soft into hard by just taking the largest one, right? And that's what we will do today. So naive Bayes will also give you with probability so and so it's this, so and so it's this, you can just leave it there and in all of deep learning you also have this, you always get these scores and now you can either say okay take the largest one, then it's hard or leave it soft. So very good question, you basically always have that distinction. Are there any other questions or comments? So Naive Bayes is based on probabilistic assumptions so and for that you should understand two things which you should have heard before at some point, kindergarten, school maybe, maximum likelihood estimation and conditional probabilities of Bayes' Theorem. Very very basic fundamental stuff, maybe a bit rusty. So I will give you a crash course as usual. So maximum likelihood estimation. That's just to understand what this is, a very simple example. Coin flips, head or tails, somebody has flipped the coin 20 times, each flip independent from all others, 5 times head, 15 times tail. And now, I have to say, is this a fair coin, or more specifically, which probability distribution of a head and tails is the most likely? Ok, this, yeah, I mean there is 5 times age, if you would have to guess you would probably say this, right? It doesn't look like a fair coin, with a fair coin I would expect 10, 10, but it's 5, 15, so it looks like a tail is 3 times more likely than head. But it looks like this, but is it like this? In which sense are they the most likely? And this you can do mathematically, and let's just do this. So what's the, I don't have anything else here so I can start writing here. So let's just take p, let me write it like this, oh no let me not even write it like this, let me just... So p is probability of heads and then probability of tails is just 1-p. I just have one degree of freedom here because it's always a probability distribution. So then probability of tails is 1-p and I want to find the best p. It's again sheaf. Ok, so what's the probability of this sequence, H, H, T, T and so on, the sequence above, in terms of P? So what's the probability of seeing exactly this sequence, if P is the probability of heads? Yes? I think it should be that we take the probability of H, that means p, the power of the times that H occurs. Which is? 5. Yeah, p to the power of 5, that's good. Yeah, 15, that's correct. So combinatorics in a nutshell is figuring out where to multiply, where to take to the power and where to plus. So if you don't know combinatorics then you are 5p, p to the 5, p plus 5, p to the 5 plus, but that's correct. You are multiplying the probabilities, 20 probabilities here which you multiply and 5 times it's p. That's correct. And so now we want to find, find p such that, let's just call this star, is maximized. That's what maximum likelihood estimation is. And that's a whole mathematical field and also practical field. You have something, you observe something, you have underlying probabilities when is this observation the most likely. That's maximum likelihood estimation. Find the parameters such that the thing I observe is the most likely. Okay, first thing we do equivalently, and we have already seen this, equivalently find P such that, and I don't maximize this because now I want to compute derivatives and derivatives p to the power of 5 times, that's not so nice. Let me just take the logarithm of this, such that f of p ln of this is, let me just write it like this and simplify it in a second. 1 minus p over 15 is maximized, yeah, and this is the same. Let me just write it very briefly because log is a monotone function. Because ln is a monotone function, right? Whether I maximize, where this is maximized also the original one is maximized. That's easy to prove and we don't prove it now, we have already used this before. Okay, and now you see why in all of deep learning you have these things, you always take the log because this is much easier to handle and that's not the only reason we use the log. f of p ln, and this is logarithm's law, you should know them by now, that's f times ln p. So this now becomes a product, so the power becomes a constant factor, much nicer. The product here becomes a plus 15 times ln 1 minus p. Okay, let's take the derivative f prime p is, what's the derivative here of ln 1, not 1 minus p, but p, 5 over p, that's why we take the ln because that's a particularly nice derivative. And here we have 1 over 1 minus p in a derivative minus so it's 15 minus 15 times 1 over minus. Yeah? Okay. And now we want to find find p where this is zero, where the derivative is zero. So this is again one of these simple calculations which one cannot practice enough. So we have f prime of p equals zero, let's write equivalence here, that's equivalent to saying five over p is fifteen over one minus p. And this is a... And let's take, let's not take p equals to zero or p equals to one, because if p is equal to zero of f of p is what? If p is zero, where does this go? Minus infinity, it's ln 0, it's minus infinity and for when p goes to 1, where does it go? Same, yeah, exactly, it's also very typical, it's p times 1 minus p, it's kind of symmetric, not exactly symmetric but similar things happening on both ends. So we don't look at these, we want to find a maximum value, so if we find something larger than zero here, which we will, it will be larger than minus infinity, so let's not bother with the borders. So here we can just take, now if p is not 1 or 1 minus p then it's really equivalent so we take 5, 1 minus p is equal to 15p and this is correct, 5 minus and this is equivalent to, now it's really equivalent what I'm doing. This is now equivalent to 5p, now 5 is equal to 20p and you can already spot the solution for which is equivalent to 1 over 4 which is exactly what we assumed in the first place. So the only zero of f prime is 1 over 4 and One should also check that the second derivative, we don't do that here, but you can easily do it, you can also compute the second derivative p equals 1 over 4. Which is what we wanted to prove. So this function, so this was simple but also not trivial and this was just one variable. Now imagine this with several variables like you don't have head or tails but maybe you have a die which you roll six possibilities, now you have five degrees of freedom or you have six probabilities which together sum to one, then this becomes again Lagrangian optimisation which crops up a lot in this context which you have also seen. OK so that's maximum likelihood. It is strictly monotone for the ln, I think we mean that ln is strictly monotone because the zero function wouldn't work. Very good, it's true. This is very true, it has to be a strictly monotone function, yes, because otherwise you are making things equal which is the original one. Yes, very good. Strictly. Which ln is is right? Let's just so that everybody notices, I need more space at the bottom. Ln looks like this, Ln x and this is one. Okay any more questions or comments before we go to the next mini probability crash course? And the next one is conditional probabilities. That's also, I know people have trouble with this, but actually it's very easy and it's always good when you want to understand the basic concept to look at a simple non-trivial example, which I will do now, and then try to understand it for that example and work your way up from there. Like computing two digit numbers in your head. Start with the simple stuff and work your way up from there. So let's assume we have a probability space and two events. Let me just ignore this for a second, let's just start with an example right away and then let's see what's written on the slide. So our random experiment, let me, there is nothing here. So the random experiment is, and that's always a good one, is roll a die. Roll a die, once, just once. Which means our omega, and that's also, you don't understand probability theory if you always be clear about what's my omega, what's the space of possible things that can happen. And you have to give these names, here we just give these the natural names, which is just the numbers, 1, 2, 3, 4, 5, 6. And that's also such a basic thing about math. When people have problems with math it's often because basic concepts are not clear. Probability and then it's not clear in your mind and when it's not clear you can't compute. It's like when you program a computer, you have to be very specific. So the probability space is a set, the set of possible elementary outcomes. And each of these, so for each x in omega, p, probability of x is 1 over 6. So that's a particularly simple example. It doesn't have to be equiprobable, here it's equiprobable. probability of x is 1 over 6. So that's a particularly simple example. It doesn't have to be equiprobable, here it's equiprobable. Now let's look at two events. And an event is a subset of omega. An event is a subset of omega. Let's look at this event and you can also give events names. But it's not important. This is the event even number. The event, I'm sorry, that an even number is rolled which means 2, 4 or 6. An event is just a subset of omega. Even number. And let's look at another event B, which is another subset and that's just small number. So this is also a subset of omega. Could be the whole set, so that's a small number. I roll a small number, namely less or equal 3. And now let's talk about, before we talk about conditional probabilities, let's just talk about normal probabilities. What's the probability of A? What the probability is, is just if you have these elementary events and they are all equally likely, then it's just the size of A, so just do counting. So in this simple setting, and eventually all of probability reduces to this simple setting, it's just combinatorics. It's just counting how many things do I have in my whole set, how many things do I have in my event. So this is just, I know this is very basic, but I think this is good for many of you. So that's just this. And the same thing for probability of B. If the problem becomes more complex, immediately people struggle. And the only reason they struggle is because it's not clear what they have to compute. Soon as you ask yourself what's the omega, what are my events, what are the subsets, then you're back to counting and then maybe you have a counting problem, maybe not, but then at least you are on a level where you can make progress. So this is the relative size of b, which in this case it's also 3 over 6, it's 1 over 2. And now, in this context you can understand very what a conditional probability is. So A given B. So what does this mean? Now I can say very easily what a... So after this vertical bar we have what is given, which means we have 1, 2 or 3. This is given, we are already in a smaller space now, it's 1, 2 or 3, now what's the probability that A occurs? So what would you say? We are in the space where only 1, 2 and 3 are possible, so what's the probability that also A occurs? And maybe before I ask you let me write down so the definition of this, now I only look at the part of A that's actually in B because I'm already in B world now, right? So I take A intersect B and it's very natural to define it that way because my probability space is now only the B's. And now I look, take the relative size over all B's. So, my B, I have three things in B and now I just have to look of these three things, how many are in A? How many are in A? Yeah, just one. So it's just one over three. That's just one over three. But this is also, and this is, what is A? Let me just do a transformation and then I prove Bayes' theorem for this simple setting, but the general proof is not more complicated. I can just divide by omega both in the numerator and the denominator. So this is B divided by omega, I don't write a double fraction here. And what is this? This is the fraction of omega of this. This is just by definition the probability of A intersect B divided by the probability of B. Yeah? And actually I haven't written this here. The probability of A intersect B is, what's the probability? It's the size of A intersect B divided by omega, which is A intersect B is just one element, right? It's one over six. So this would also work. I mean obviously that's 1 over 6 divided by 1 over 2 which is also 1 over 3. And maybe, let's do, there's also sometimes the Bayes theorem is sometimes formulated like two here. So from this, let me just put it up here. I mean what this says, what I've proven here is probability of A intersect B is, if I just bring the denominator to the other side, then it's just the probability of A condition B times the probability of B, but since A intersect B and B intersect A is the same thing, I mean this is exactly the same thing, I can also reverse the roles of A and B here, right? It must be the same as this, by the same logic, which means these two are the same, and that's Bayes' Theorem. So it's actually, it looks like, okay, what is this, how is it proven? It's this very basic, you are in a constrained space. Actually the most fundamental math in the universe is always like this, if you boil it down, Heisenberg's inequality is also something as simple. The most basic laws of nature are like super simple math, if you just start from basic things. Just if you look at it from the abstract they look maybe scary or not. Any questions about this conditional probabilities? We will use them in the following. OK, so let's start. So this is the second part of the lecture, let's maybe start a little bit and then have a break and then we continue. So that's just these 13 slides left. Naive Bayes, make some, now we want to, yeah, this is our setting now, we want to, we have such, here's the training data. This is what we are given, we want to, we have such, here's the training data. This is what we are given, we want to learn from this, text with labels. So now what we assume is, we assume some underlying probability distribution. It might not be there, we just assume it, it's an assumption, which means it's wrong. We just assume it so that we can do mathematics. And then we just do the mathematics whether our assumptions are correct or not. We also did this with latent semantic indexing when we used the Frobenius norm. We made some implicit assumptions. They are not right but still something useful comes out. We assume that we have a probability distribution over the classes. Some classes are more likely than others and we will try to estimate this. We will see this in a second. For each class we are assuming a probability distribution over the words. So maybe for horror, bloody, spurt, splatter, I don't know what will be more likely than other words. So we have a probability distribution over the words. And now we assume that documents are generated by the following process and you will see how that assumption comes into play in a minute or after the break. You have a question now? WC, yes. I think this will, yeah, it's not clear right now, I mean, first this is just a probability distribution over words for a particular class. Every class has a different distribution and now I will explain how we use that probability distribution. That I will explain now, namely to generate a document. I now have to, I want to generate a document. First I pick a class according to this probability distribution over classes. Let's say it's equidistributed. We have five genres and I pick comedy with probability one over five. And now I pick one word after the other, independent of the other words, and I pick each word according to this probability distribution. So I just say, let me pick the first word. And I just pick it with this probability distribution. And now I've already picked class horror, so now I will pick the word blood, spurt, splatter, more likely than some other word. And now I do this one after the other. And what I haven't said here and I'm wondering myself right now is when I stop. Let's just ignore that for a second. The length of the document I think is just fixed. This is not part of the model so I'm just just fixed. This is just, this is not part of the model, so I'm just saying generated. Let's talk about this later, if it becomes an issue, I have a document of length 12, now I want 12 words and I just, each word I generate independently of the others according to that probability distribution. And I really want to understand how unrealistic that model is. This means I could now generate blood, blood, blood, blood, blood, or and, and the, the, and the, blah. In two ways this is super unrealistic. One is words are not independent, right? When I have science documents and I have chosen relativity then probably theory will be more likely, I mean it's not just, this is also called bag of words, this model where you just have independent words but in reality they form sentences, they are dependent from each other. And also you have word order, right? I mean a normal sentence, you can't just perturb the word or sort them lexicographically, then it's not a sentence anymore. So super unrealistic model, but that's very typical in all of this, all of learning. You have these super unrealistic assumptions, but at least you have something to work with and you can do the math and then you just do it and hope for the best. Is there any question about this simple model? And that's why it's called Naive, why it's called Bayes we will see in a second, but it's certainly Naive. Yes please. Is now P, W, C and P of V intersected C or how to translate it into V? We will have an example in a second and then it will become super clear. We have an example where you just see all these probabilities. I could now give you another example but then I will just repeat what you will see in a second. So maybe if it's not clear the example will help you. And for the, let's maybe do the example and then the break, that's maybe then a good way to have a break. So we have a number of things now, I just, when I'm in training I can just count, let's just look at this one here, I just count how often do I have comedy, how often do I have romance, I call that TC, this is now combinatorics counting, and now I do another count in all documents that I comedy, how often does the word Punjabi occur? You can just count it, right? Maybe it occurs three times in comedy plots. And then I count how many words do all documents from class comedy have altogether. I count this. And now I just do the following. This is quite natural, so my probability for a class I estimate it like this. So maybe I have 10,000 movies, 4,000 are comedies, then my p-comedy is 0.4. And it's not equidistributed on purpose here. And also this probability, so maybe I have all the comedy movies have 1000 words and just one of them is Punjabi, then it's probability is 0.1% for Punjabi. And maybe 10 of them are movies, then movie, then 1% probability for movie. And actually these are, that I take these, this is exactly maximum likelihood. So if I don't know anything then this is like the most, this is exactly what I did with the head tails thing. This is the most likely underlying distribution to generate what I saw. So these are the most likely probabilities then to generate what I see. We can just compute these numbers. And this is just a forward pointer to later. These of course can easily be zero, right? Maybe blood never occurs in a comedy movie, then the probability blood comedy, P blood comedy would be zero or N would be zero. And we see about that. And now let's do the example. And after the example we have a break. We now do the training of Naive Bayes for this example. And the first thing we do, let's just see, we just count how many documents of class A do I have, how many documents of class A, you tell me all the numbers, 3, yes, how many documents of class B do I have? I also have three. And so how do I, which means, do I have a variable for this? No. Oh yeah, I call it, oh no training set T of objects, ok, I gave it a name, which means I have T objects in total, which means I take a probability for A, it's super simple if you write it down like this, they are just one half, both of them, right? P of B is just, these are maximum likelihood estimates. So in my model I just take these to be one half. Now I count, and that's your question, it becomes absolutely clear now. So I have two words, only two words, my vocabulary is A or B. The number of As in documents of class A. So I just look at all the documents of class A, there are three of them and I just count how many A's do I have. How many A's do I have? Ten. I think that's correct, 10. And let's also count the number of Bs in A. How many Bs do I have in document of class A? 5, 8 should we take the average or harmonic mean, maybe geometric mean? I think it's five right? Just so that we are all on the same page, in A it's this one, this one, this one, this one, this one. And you see the order doesn't matter, it doesn't even matter in which document they occur. Very strange thing we are doing. And then we add another number which is just how many words overall are in document in class A. How many? How many words are in A? Yeah, 15, it's just the sum, right? So, and I can use these to estimate my... So this is now the probability of A in A, so this is my PWC, capital letter is the class, this is the word, and this is just by maximum likelihood estimate, it's just a proportion of this word over all words, which is just 2 over 3. And the probability of B in A is just NBA over NA which is 1 third. Yeah, I just have two words A and B and my probabilities they have to sum to one. Two thirds of the words in documents of class A are A, one third of the words are B. Simple enough. Let's do the same thing for class B. How many A's do I have in documents of class B? One,2,3,4,6, I think you are correct. How many B's do I have in documents of class B? 12, wow, what a coincidence. And B, how many all in all? 18, the sum of them. And now let's compute the probabilities again. Let's be less generous with the space here. What's a in class B? P of A in class B is just a proportion of N of A in class B of all in class B. So it's one third. So one third of the words and documents of class B are A's and then the rest must be two thirds. So P of B is just N B B and has to concentrate here so as not to make a mistake and that's just two-thirds. So, and just to summarize and then I will also summarize it in, so the result of the the result of the training we have learned six numbers, so we have learned the distribution over the classes. The rest was just intermediate, we don't care about the n's anymore. We have a probability over classes and for each class we have a probability distribution over the words. So we have P, A, A, for class A we have this probability distribution, two thirds are small a's and one third are small b's. And for class B it's exactly the other way around. A B it's one third and P B B is one third. And I hope it's absolutely clear that you can do the exact same thing. Oh yeah, this can't be right. I started talking too early. I hope it's clear that you can do the exact same thing if you have 17 classes and 500 words. That nothing is different, this was just to keep it small. So that's the result of the training, you just compute it from numbers, that you take these portions was just maximum likely principles. If you just have the numbers, this is the most likely distribution to assume. We haven't used Bayes theorem yet. Is there any question about this before we go into the break? Or you think about questions in the break. So, five minute break. We are back online again. Two degrees colder. And here is the T missing training. So we were talking about naive base, we did the training, the result of our training was just six numbers. Did any question occur to you on the training process and on these numbers? So we just both classes equally frequent and hence equally likely and the A documents we have two thirds A one third B's and the B documents, capital B documents the other way around. Simple enough. And the same you will have with words and genres for the exercise sheet. Now we do prediction. Now if we have learned these six numbers, how do we do prediction? And now comes the base part of Naive, we have already talked about the Naive part. We want to compute this. We are given a document, so conditional probabilities, you can understand it as the right side you are given, the left side you want, that's not quite correct but that's the first approximation. We are given a document D, what's the most likely class? Now we can do turn this around, that's maybe, I think it pays to just, so what we had was, yeah, so we have A, you can always derive Bayes theorem yourself, you don't even have to remember it, I don't remember it, I always derive it myself like this. You can say A under the condition B times probability of B or you can do it the other way around like B under the condition of A times the probability of the condition. So this way or that way around and what that means is that you can, if you just, if you have the one thing or you want the one thing, probability of A under the condition of b, you get it from the other thing by doing, just bringing it to the other side. Probability of a divided by the probability of b. And if you wonder about whether it's this way or the other way around just derive it from first principles again. So that's what we're using it, we want it this way around, given a document what's the class, what we know from a training, you will see in a second that we can derive it from a training, given a class what's the probability for that document. These were our assumptions, they were in the direction, I have a class, now generate a document. So this is more like what we are given and here we have these probabilities, probability of a class, we have learned those, and this thing we have no idea what it is, but we will see in a second that it doesn't matter. Now what's this? Probability that the document of seeing that document means it's the probability that the first word is the first word. So now we are asking ourselves what's the probability of this, of seeing this. The probability is the probability that the first word is comedy, that's just the probability of comedy for class comedy times the probability of movie for class comedy times the probability of is for class comedy and so on. So we just multiply these and let me just, so here this is a pie, but this is nothing else than we have. So this is p, that's now the probability of word 1 for class C times the probability of word 2 for class C. Of course w1 and w2 could be the same, you could have a word repeated here. It's just the product of these probabilities. It's just what we, no it's not what we did for the example yet, I will show it to you in a second. So that's just the probability of seeing that document, it's just the product of these word probabilities. And then I need to add this, so this was just this part, this was just this part here. Then I have to add this, this happens on the next slide. So we have the probability which we are actually interested in is the product of the word probability for that document times this divided by this. So it's the product of these word probabilities for the words we have in the document times the probability of the class divided by the probability of the whole probability of seeing the document independent of the class. Now I will come back to that later, ignore that note for now, it's just for reference. This here is the same for all classes. We don't know how to compute it, but we don't have to compute it, because in the end we just want to find the class where this is the largest. So if there is always times 0.3547 for each of them, we can just ignore it. It's similar to the monotone finding the maximum. We just want to find the maximum class in the end. So we will just ignore this, which means we will just compute this thing, which is in bold here for every class, and then pick the one where it's largest. And we will just do that for an example now. So and you will see how easy it is when we, so we learned. What did we learn? Let me write it down again. We learned during training. Training these probabilities. PA one half, let me write them down again. This I think was two thirds, I hope so right? This was two thirds. This was one third. Now for the third time that I'm writing them down and I think for the last time in this lecture. This is what we learned. And now we use it to predict. the probability of class A for the document AAB. And my notation is not 100% consistent here, I'm not writing class is equal to this, I'm just writing what I observe. So what's the probability of class A for the document AAB and given by the formula on the previous slide, this is, I can just turn it around. I just take the probability of this document given class A, and this is something I can compute, times the probability for the class A divided by the probability which I cannot compute, which means the overall probability of seeing that document. And this probability AAB, this I can break down, this is the probability of word A for class A times this probability again times P of B for class A. You see there is no problems with repetitions here, naturally you see a nice thing which I will come back to later. If I have a word here five times I will have that probability to the power of five. And the five is just the term frequency, right? So the term frequency will appear as a power here, if I would write it that way, times and this is the probability for class A times this probability which we do not know and do not have to know. And let's plug in the numbers so that's now two-third times two-third times one-third times one third times one half divided by the mysterious probability, which we don't know. And now we do the same thing for class B, AAB. And this is the probability of, now we turn it around, what's the probability of seeing this document? It's the same document for class B. Multiplied by the probability of class B, divided by the mysterious AAB probability. And this is now, and now we break, this is a base right, here we use a base, maybe I should write it, so this is using a base theorem and we are also using it here. No, that's not true. What I said here, I think I should be more careful. Maybe I should do it like this, so this is using Bayes' theorem. This is using Bayes' Theorem. And this here is using the independence assumption, right? Independence assumption from Naive Bayes. Naive Bayes. So in a sense you could say this is the Bayes and this is the naive, right? I could have also written it that way. The first one is the Bayes and the second one is the naive of the naive base. Ok, now I compute and I do the same thing here but now it's relative to class B. a of class b times p, b of class b times probability of class b divided by the mysterious probability. times 1 over 3 times 2 over 3 times 1 over half divided by a last time this mysterious probability. So now I don't have the exact numbers but I do have relative numbers so what I can compute now, let me do that at the top. What I can compute now is I compute the odds ratio, I can compute the quotient of the two for example. Let's take the quotient of this, the a probability that it's class A divided by the B probability. And the nice thing is if I take the, I just want to know which one is larger, so I want to see do I get the number larger one, smaller one or one, then one it's a tie, if I divide them then this strange probability which I don't know just falls away, it cancels out. And what do I get? Let's even write it down very explicitly. So I get here two-thirds times two-thirds times one-third times one-half. And here I get one-third times one-third. So you don't even have to compute them because most of this cancels out. And the result is 2. 2. Which means what do I predict A. So for this document AAB I predict A, and actually if you think a bit deeper, or what? Now we will come back to this in the next lecture, it's not important for now. What do you think, is there a simple way by just looking at the thing when it will predict A and when it will predict B? Yeah, that's exactly what it will do, it will just predict A when there are more A's and B when there are more B's and when they are the same it will be a tie and you can actually prove that from these numbers. And it's fairly easy to say. This was not so obvious when we started this, right? But this is what Naive Bayes does in this case. And there is also a deeper truth behind this. So, that's Naive Bayes. It's as simple as that. This was an example with two words, two classes, now you do the same thing with five classes and ten thousand words or so. Yes? Exactly. You see here all these assumptions come into play, right? The order is totally irrelevant, we could sort all the documents and there is also a hint on the exercise sheet about this. And you see it here because it's just the numbers multiplied, right? The order, which is totally unrealistic, naive. But still, yeah. Also the deep learning, the models driving CHAT-GPT, oh my I have no power here, actually just a paper came out because these transformer deep neural models, they have positions somehow encoded and now someone has proven that you don't need it. You can even ignore the position, the network, the neural network which drives things like chat GPT, they learn it themselves, you don't have to, you can even give them this stuff in other order. Now just some, so that's the basic working of Naive Bayes and now comes some refinements which are important for the exercise sheet so you should definitely listen. Maybe before I go to that let me show you the exercise sheet, here it is. It's pretty obvious what you should do, you're getting these collections, one for training and I hope it should be clear that for training you can of course use these. Like in the training we use these to compute our probabilities and stuff and when you test it you don't, you just use them to evaluate your algorithm. When you test them then you just get this sentence here and then you are supposed to predict this and you just use the true label to evaluate how well you did. And so what you should do, you should write code which does the training, which computes these numbers, the class distribution, the word distribution, you write a method predict which does the prediction as we just did for the simple example and then you write a method to evaluate it. And what you should also do is output the words with the highest probabilities because that should give you an idea of how well the method did. So it's train, predict, evaluate. And then you should just run your code on the data. That's exercise 4 and then there's again a table. Ok, and here's some important refinements. They are so important that if you don't do it, it won't work. They are very important. So one thing is, when you see a lot of probabilities being multiplied, there are two things, and that's universal, which can go wrong, or which are a problem. One is, you have a thousand probabilities here, one of them is zero, everything is zero. Is that a problem? Yes, that's a problem. I mean, think about it. I mean, yeah, your comedy class, I's a problem. Think about it. Your comedy class, as I said it earlier, maybe blood never occurs in comedy. And now in prediction you have the word blood and the prediction for that class will be zero. Even if all other words shout that it's a comedy. Just one word will just make your whole probability zero. That's bad and you need a smoothing here. And the way to do smoothing is when you compute these probabilities from the counts, you just add an epsilon. That's a very heuristic method and let's just briefly understand why you also, so you just add an epsilon which means you make, by adding the epsilon now it's strictly positive and just understand what I write here. Let's just do the math quickly. If I divide, if I sum these up for a class, so if I sum up over all words nwc plus the sum over all w times epsilon. And this here is just the number of words in all the documents from class C. We have already seen it, that's just nc. And this is just number of words times epsilon, which is why I have to... this is just epsilon times size of the vocabulary. Which is why I have to add it here, if I don't add it here I don't get a probability distribution, as simple as that. So for the exercise sheet it's written on the sheet also just take 0.1. That's one of these hyper parameters which can make huge difference in practice. I mean if you take it very large then you smooth out everything, you make everything equal. Communism probably doesn't work. If you make it very very small, then you are not very far away from the zero case. Then again if a word doesn't occur, the probability is so small that that class gets never predicted. Now there is another case, so there are these work probabilities, there is also the class probability. This can also be zero. Let's look at that case too. So what about PC equals zero, is that a problem? What do you think? What does it mean that in training PC is zero. Yeah? We don't have to divide right? Here we just multiply, we just multiply, we only divide by a strange probability which we ignore. Yeah, exactly, it means PC0 means count of that class over all things, P comedy 0 means we didn't see a single comedy plot, if we didn't see a single comedy, how can we predict it? We know nothing about it. So in that case it makes sense, right? We didn't see it ever. So and that's written on the slide. So we didn't see it and so it's actually meaningful. I mean depending on the application you could argue, ok let's sometimes predict comedy even if we never saw it, but I think here it's reasonable to say, ok if we never saw it, how can we predict it? We never predict it. We should ensure that we see it at least once. So at least for multi-label classification with few labels that's very reasonable. We expect to see everything a few times at least. Here's another important thing. So I said two things are dangerous when you multiply a lot of probabilities. One is zero, the other one is numerical. If you multiply many probabilities, and think about it, the probability in our toy example are two thirds, but even two thirds if you multiply it a thousand times you get small numbers. But here, now it's probability distribution over a vocabulary with ten thousand words. The probabilities will be something like 0.0001 and you multiply a lot of them. And think about it, so what's the smallest number that you can represent, that's actually a very interesting research topic by itself, what's the smallest number you can represent with an 8-byte double, it's 2 to the minus 1074, it's actually less than 1024, it's actually less than 1024 which you would expect because you have an 11 byte mantissa, maybe let me write that here, I think I have a link below. 11 bit mantissa, if you know how floating points are, so the exponents you have 11 bits, 1 bit for the sign, which means you have 10 bits now, which means you would expect, and then 2 to the minus 1024, 1024, and then you have a thing called subnormal representation. So you have interested in that kind of stuff. It's actually super interesting, it's also relevant for stuff where when you're programming low level C++, it's actually quite important to understand how rational, how numbers are represented in a double. So the IEEE standard has all kinds of fancy stuff. It's a bit like UTF-8, a lot of magic going on there. So you have subnormal representation where you can, so that you can represent smaller numbers than actually the mantissa gives you bits, which means in the end, and there is an article I've linked to, a Wikipedia article here or in the end, there are also things like you can store infinity, not a number, minus zero, plus zero, lots of interesting things. So this is a number which you reach fairly easily, so let's say you have probabilities, they are less than one over ten thousand, that happens quickly, if you have ten thousand or one hundred thousand word vocabulary, you multiply 54 of them, which also happens quickly, you have a document of length 54, and product is already zero. So it happens quickly. So what do you do? You just take the logs, then your product of probabilities becomes the sum of logarithms. And then the problem goes away. So the logarithm occurs naturally here in another, as saviour in another way. Not only is it easier to derive and stuff and products become sun, but also some numerical stability problems also disappear. Yeah, because it's monotone, it's just as good, right. You find the thing with the largest log or the thing with the largest non-log, same thing. And I think your comment should be should be strictly monotone. That's important also here, it wouldn't be strictly monotone. And God forbid, don't take, I mean you could now say, ok just to compute it let me take the log and in the end I take exp again so that I get the real probability, I mean don't do it because exp, let's just maybe verify it here. Python 3 minus, let's just write it in import math print math exp. And now we are talking about small numbers so minus 700 should still work. Yeah, that's a small number. What do I claim? 46, so 45 should still work. That's still a positive number but already very small. 746 is already 0, right? And minus 746 is not a large number. So you say, yeah sure of course I can take the exponent of that, yeah you can't, that's of course zero. So don't take, just work only with the logarithms. And there's a, okay before I, so this is important, you have to do this, otherwise it won't work, you have to do smoothing at an epsilon, you have to do logarithms otherwise things become zero quickly. Here a few things which you could do but you don't have to do. You don't have to take words as features, right? So in my example I had, you could also take the, I mean you can do anything, I mean this is now the problem of what you take as input features, I could also take the characters as my input features, right? So this here has so many L's, so many O's, I could take the bytes, maybe there are special characters here, I just take the bytes, I take it as UTF-8 sequence and I take the bytes. Actually there is a whole research area on this for neural network. How do I feed my text into the neural network? As words, as bytes, as characters, as sub-words, there is also where you just break the words into smaller parts and you feed those smaller parts. That's actually what you do. You do like smaller parts, that works the best, byte pair encoding it's called. So for the exercise sheet just do it with words, but I just wanted to mention it. You can also add more features, like why not to give a little bit of order, for example when you have this, World War as one feature, the whole phrase when it occurs next to each other, you do this. You could do it with everything that occurs next to each other, only with things that frequently occur next to each other. You could do that, you can input anything as feature. You could also input as feature that this is the second word of the document. Or whether the second word of the document starts with a capital G or something like this. You could just include stop words like and, these are called stop words. They are frequent words, you could just ignore them. You have a lot of a, and, a, in, as, you just do as if they weren't there, ignore them. You could not count frequencies but use more fancy scores, so what we did, yeah, you just count frequencies, actually tf occurred, I mentioned it, just replace it by tf-idf or something and see what happens. You could do that, but the exercise sheet, you can do it. The only thing you have to use is to use... Oh no, you don't even have to use this for learning but just to filter out what's shown in the end. In the end you should show the words with the highest probability for each document and if you don't filter this it will be and the is, they just, because they are the most frequent, naturally they are not very discriminative between the classes, but you don't want to show them, so just filter them out. Everything else and whatever else you can think of to make it better, do it. And here's the last thing, and listen carefully because that's important, of course you should implement this with linear algebra operations. And I think I've shown it to you, no I mean you have done the exercise sheet two exercise sheets ago, where actually it was just very few lines of linear algebra where you computed the scores for all documents and something similar you can do here. In particular I will show it to you now for predict, like show you the basic idea, you have to figure it out yourself for training. So let's again assume the documents are given as a term document matrix, why not? And you don't have to write the code, this actually if you implement this the hardest part and we give it to you as a late Christmas present, so yeah constructing the term document term matrix, we just give it to you with tf entries now, you can just proceed from that. And then everything can be done with linear algebra, very elegantly. And this is something you should really, if it wasn't clear to you so far it should be clear to you now, whenever you have weighted sums and many weighted sums, this calls for linear algebra. You shouldn't program a for loop and do this, but you, oh this is matrix vector stuff and I should use library for matrix vector stuff, because if you do this yourself with a for loop you are implementing linear algebra very inefficiently, which means it won't work on large collections. And I now show you how to do it for prediction, figure it out yourself for training. This is what we computed, this should look familiar now, this is just a little different now I didn't write the words here, but just the index of the words, but otherwise it's the same thing. And this is, I think what should be, this is a bit misleading, I'm just realising this now, so please pay attention when you look at this later. This is not the same M as before. M is now, before it was the length of the document, the size of the vocabulary. Note if a word doesn't occur at all, then PIC to the TFI is just 1, it's just to the power of 0, so it's just 1. So this is nothing else than what I have before, right? What this says is I have word number 3, so this is word number 1, 2, 3, 10,000. Word number 3 occurs 3 times, then I just have the probability 3 times here, that's what it says. Just probability for word 3 to the power of 3. If I have the word 7 times, I will have that probability 7 times here. We have seen it in the example. The word doesn't occur, this is just one. And then I multiply it with this, and this I ignore. And this is just the number of occurrences, just term frequency. And we give you all the code that computes these numbers. Now you should take the logarithmm, because otherwise these numbers become too small, and we can ignore this because it's the same for all C, we just want to find the largest one. Then what do we have? Then we have this number, right? If I take the logarithm of this one, the product becomes the sum, and the tf is, yeah, we have seen this earlier, right? Actually we have seen this in the maximum likelihood thing, p to the 5 logarithm becomes 5 times logarithm of p. So you get this here and this, now you have a sum of products and a sum of products is always matrix, vector stuff. This is just the product of these two vectors. So I'm already basically telling This should be capitalized. This is the dot product of these two vectors. Just look at these two vectors, take the dot product and it's tf1 times ln p1c plus tf2 times this plus and so on. And here we have a 1 in the end plus LNPC. This is also wrong. This should be LNPC here. And this is for one class, so you have a dot product for one class and you can do this for all classes simultaneously with a single matrix vector product. So prediction, you do one matrix vector product where your vector is the document, the matrix just contains, and look at this, I mean this is just here all the probabilities for one class and now you have this for every class, gives you a matrix. Single matrix vector product gives you all the things for all the classes and now you have this for every class, gives you a matrix. Single matrix vector products gives you all the things for all the classes and then you can just read off the largest one. So prediction in Naive Bayes, if you implement it yourself like this, if you do it with matrix vector, it's a one liner. So one line and you have to find the largest entry. And it's very often like this. And in numpy, if you know the right function, everything is one line. Training is not quite as simple. So here's the last slide. Just an overview of the things, and please recall that we did for you this very nice sheet sheet, I think I said it already in the last lecture. Because the NumPy, SkyPy, you need SkyPy again because a large matrix, sparse and so on. The documentation is huge but you don't, you just need a subset and this page here gives you. So just as a reminder these things can be done in one line in numpy. Multiply a matrix, this is always from the perspective of a matrix, multiply it with a vector, transpose it one line. You want to set it, that's a sparse matrix construction, you have given indices and values constructed that's a one liner. Component wise addition like multiply everything by three, add one to every entry, it's a one liner. Apply an operation, you have a matrix, you want to apply LN to every entry, it's a one liner. Never ever write for loops, I mean you can, right, you can do for all rows, for all columns, do this entry equals log this entry, it will take forever, just do log matrix, it will do the right thing. Compute for each row the sum, for each column the sum, it's a one liner. For each row compute the largest value. Sort each row, sort each column, it's a one liner. Everything you can... You have a vector, you want multiple copies, like you have one row vector and you want it ten times as a matrix, that's a one liner, yeah, and there's the cheat sheet. So that's it. It's a very nice method, very nice exercise. Any question about the topic or the exercise sheet? Anything in the chat? Oh, there was one question oh yeah, it resolved itself Okay, so have fun with the sheet, see you next week, bye byeWelcome everybody to lecture 11, information retrieval, winter semester 22-23. A bit more people in the room today, I attribute it to the warmer temperatures, maybe, maybe it's also the topic. Experiences with exercise sheet 10, which was naive Bayes and official course evaluation. I have a slide. Deadline is on Sunday at midnight. But please do it before and today. We generalize what we did last time to a theory of linear classifiers. So a lot of super fascinating linear algebra again. And then I will talk a little bit about history, naive Bayes again and then we talk about logistic regression which is something similar to naive Bayes but just mathematically more well founded and for classification more the method of choice. Will be very interesting and super nice mathematics. Exercise sheet 11 will also be nice, just the exact same task from exercise sheet 10, which is you have these movie descriptions and five genres and now you use logistic regression. It will be very interesting to see where the code will be similar or different and to just run it and compare the results against Naive Bayes to see how different it is. And again like for Naive Bayes will be very little code but you have to understand it so nice exercise. So exercise sheet 10, very little code but linear algebra was tricky for some so so for some it was relatively easy, really nice to see how a lot can be done in a few lines indeed, if you look at the master solutions from predict, it's really one line for computing the probabilities and another line for computing the maximum, so can't be any simpler and it's not that simple an approach. I particularly like doing it based on matrices. Some people have said that they've heard it before, maybe not fully understood, naive based, but now doing it with linear algebra was really nice. It was nice and challenging exercise, I really liked it. I learned a lot by solving this exercise. Shouldn't man or woman be added to the stop word list? I will talk about this on another slide. The idea was super simple but it took a lot of time to find the NumPy function so if people struggled there were a few. It was mainly because of that not so much translating it to linear algebra because I had a slide on that but actually doing it in numpy. Matrices and numpy is a hill too tall for me to climb. Very interesting lecture, Professor Bass seems to be very happy being allowed to do so much math was very contagious. I like the formulation being allowed to do, but yeah, I like to do math. But I also like to do coding. Videos by Mr. Beast. Why did I ask that question? Maybe it will become clear in the end. I don't want to talk too much about it although I could give a whole lecture about it. Most of you know him. I was not surprised. Most of you know him, I was not surprised. Most of you like his videos, I was a little surprised. Few critical voices, my voice will be a little bit more, tiny bit more critical. Squid game, interesting way to try and spend all the money. As you know he likes to give away a lot of money. His videos are enjoyable especially the squid game that was mentioned a lot. I watched his videos, they are funny and expensive. He likes to give away a lot of money, but he can do it because they have 100 million views and then you also earn a lot of money, so you can spend a few million per video. He puts a lot of effort into learning what people like, so like these videos are optimised for 100 million views, perfectly made to reach as many people as possible. I like seeing people using money to actively improve the world. Part, he is a philanthropist also, mostly he gives away money for stupid things, but he also does some philanthropy on the site, like saving forests, cleaning oceans and so on. Interesting person, but videos are not for me, so there were a few critical comments, not the style and type of content I enjoy. My opinion, I'm sorry it's my personal opinion, when I watch these videos the fast cuts drive me crazy, it's like also when there is action, it's filmed with a shaky camera, right? I know that's style, so hard for me to watch it, you also have, I mean it's very super typical of social media content nowadays, which I'm not surprised that it's so popular, but it's also symptomatic in a way, looks to me like an expression of severe attention deficit disorder. I think he has severe attention deficit disorder. I've read articles by him, I've watched a long interview with him. He says he can't listen when somebody else is talking, he can't listen basically, he can just, so it's conversation with him is a one way street. And if you watch the videos you can barely, you don't have time to think about anything, right? It's bam, bam, tuck, tuck, tak, and it's just a maximum speed. And of course optimized to grab your attention and to keep your attention. I find it interesting because if I watch back my own, so my videos now they are maybe the opposite of MrBeast's videos. I think they used to be a little more like that in the past, so I think ten years ago I put a little more like that in the past, so I think 10 years ago I put a little more weight on entertainment and also making it faster and it's interesting, I mean I've been doing this for a very long time now, I've been giving lectures for over 30 years, I've actually moved away from it, doing it slower, not so much weight on the entertainment because I see a big danger there. I told you what I told you last time about math videos. You watch it, you find it entertaining, but you don't really learn anything. And especially for educational videos that's very dangerous. Of course it's nice if you also have a little bit of entertainment, if it's enough to get you motivated, but it's surely not the main goal. The main goal is to learn something. And I think if you watch too many of these videos, it really does something with your brain. So I'm a little bit worried, especially if you, and I know his main audience is teenage kids and teenage boys, also girls, but teenage kids, I think if you watch a lot of this content it does something with your brain, so a little bit worried about this, but I would like to discuss more about this with you but no time for this now, but I did like to mention it. So here are the results for the exercise sheet 10, we had 5 genres, I don't show the data again, it was movie descriptions from IMDB, we picked these 5 classes. Here were the results, the frequencies were interesting, so half was comedy, we did this on purpose because it's very common in learning that you have uneven distribution, so you had a lot of comedy but only 4% science fiction and actually also interesting it was the distribution from IMDB. So you have a lot of comedy there and not so much as science fiction. So here just precision and recall value if you do F1 it's a kind of average, it's the harmonic mean. Just so comedy interestingly was easiest to predict, but there's another slide on this. I just say a little bit, it's very interesting producing numbers is one thing, but then understanding what does it mean is another thing. So when you do a project later or thesis with us, this will become an important part, understanding what the results mean. Does it mean that comedy was easiest to predict? Not necessarily because it was so frequent, right? You can just always say comedy, comedy. Then you are right half of the time because half of the lines were comedy. And let's look at the words in a second. Also what does it mean when recall, like here for documentaries much higher than precision, it means, I wrote it here, that the naive Bayes classifier tended to predict documentary a lot. Just imagine the extreme if you say documentary all the time you get perfect recall, you get all the documentaries which are there, they are right, but most of the time you are wrong, right? 89% of the time you will say documentary when it's not. But for the 11 documentaries it will be right. So getting high recall, low precision, think about it yourself means you tend to predict that a bit too much and the other way around. So it's good to have this balanced somehow. Of course that's not an explicit goal of naive Bayes but it manages somehow, they are not too far apart. You also maybe noted that some classes have more specific words than others for horror, there were words in the top 20 like killer, mysterious, old, death, night, town, that seems to fit. For romance it was more like man, woman, story, these words occurred also in the other classes. What I have, let me just go to the terminal here, if I, here I have the master solution, if I just execute it with a normal stop word list, so we have like things like woman, man, of course they are woman, man, you find it a lot here. I found really interesting young, I mean super bias alert, right? Young is the most frequent in all, so when you make a movie it's important that you have young and good looking people there. So interesting bias that young is a stop word for movie plots. And I've done one thing and some of you have maybe experimented with this, there was the stop words list and I just added word man, woman, young, old, life story, family, father. Right, a stop word? Like a word without particular meaning depends on the context and in case of movies these words are not particularly predictive of particular class, daughter, it's just people occurring in movies, friends. So if I take that list, so if I take extended here as the last argument, I get a little more interesting words here. So now with romance I have marriage here in the end and it becomes a little bit better. I just wanted to quickly show you this. Science fiction looks very predictive, scientists and mysterious. Okay mysterious we also have for horror I think. Yes. But so overall it works, it's not great, right? So it's also interesting to see, it's not easy to solve such, but it's also not easy as a human, right? If you just get the text and you have to predict the genre, that would have been an interesting experiment too. Do it for yourself and try to predict the class and see, oh we could do that for the next, when we do this the next time. Human performance. Okay, here's a video about bias in movies if you want to watch it. Which precision is considered good? Just one more slide for understanding the results. So one baseline is you just guess uniformly at random. So I just guess without looking at the words. Then what do you get? You get one over number of classes. Just think about it, I have a movie, I have no idea, I don't even look at the words and I just roll a die with five faces and I say comedy chances one over five that I'm right. And actually it's easy to do better by just always predicting the most frequent label if you have uneven distribution. So here I think it was actually 50% when you round it up. So right here were the frequencies. Sorry. sorry, 50% of the movies were comedy, we did that on purpose, which means when I don't want to do anything special I just always say comedy and I get 50% overall precision. Because for 50% of the movies it will be right. And I said that because actually in papers, in research papers when you read them it's actually pretty frequent, you have something and you should always ask yourself, ok what's the baseline when I do something really simple and it might be that you have very good distributions and actually the baseline is already 80% or 90%. Such problems are there, spelling correction is an example. You have most of the text is correct, you just have a few spelling mistakes. So you can have a very high baseline by just doing nothing. Here's another baseline, you pick the label according to the distribution in the training data, I will not do the, this is actually not better than this, so baseline 2 is the best if you don't want to do anything special, I leave it as an exercise or maybe exam question to compute what comes out then. Okay, but let's go on with, okay there's one more organizational thing, the course evaluation, who has received, who hasn't received this email on January, oh it seems to have worked this time. Everybody in the room has received an email about the evaluation. That's great. Okay because, so before I say what you should do for those, anybody on Zoom hasn't received it? That's great, but for those of you watching this and not having received it, I will say something in a minute. We are very interested in your feedback, so please take your time. I've already said it last week, I say again, you have spent so much time on this course, please spend 20 minutes on this evaluation and be honest, concrete and fair. So try to be in a good mood when you do it, try to be fair in both directions, concrete, I mean not just yeah nice or no I didn't like it, that's not concrete and of course honest. We particularly like the free text comments, grades are important too, but there are all these text fields where you can write something and it doesn't say here that you get, where does it, does it say it on the exercise sheet Natalie? Yeah I hope so, oh yeah I should have said it on the slides but here it is. You get something for this, right, you get 20 points which replace the points of your worst exercise sheet. And you don't, it's enough if you write in your experiences txt I did it. Of course you should be honest there also but we will just believe you. We will not check but please do it and be honest and then you get the 20 points. So just as an encouragement for really doing the evaluation. The deadline is Sunday so it's not the deadline for the sheet which is Tuesday noon before the lecture, it's Sunday midnight so please do it until then. And it's centralized so it's not run by us, it's run by the university and if it's over it's over, we can't do anything about it anymore. If you have any problems, you didn't receive the mail, any other problem we have a sub forum on Daphne evaluation, you should just check as quickly as possible if it works for you, if not right there we will see what we can do. After the deadline there is nothing we can do. Any other questions about this part before I move on to the actual content? Okay so I move on to the actual content. Ok, so I move on to the actual content. It starts relatively lightly and will get more mathematical as we go along. So linear classifiers. Last time we talked about one classifier, naive Bayes, we will now generalize this a little bit. And last time we looked at, so now we just have objects in d dimensions and we have two classes and last time in the example I called them a and b, so this time I just, you will see later why. And let's say, yeah, so I'm in 2d and here I have some minus one thingy so that's now points in 2D. And here I have some other points in 2D and I have the plus one label. So this is a point, this is a point, I just write a label beside of it and now I have another point here and the question mark is, yeah, let me maybe write the point and then maybe I write it in red and the question mark should this now get plus one or minus one. And I will talk about how to generalize this to more classes but we will not do it today just two classes today. So, and what we try to do, what's linear about it, now we have 2D and I try to separate it by a hyperplane, I will talk about what a hyperplane is on the next slide but in 2D it's just a 1D thing which is a line. So I just want to find a line which separates the minus ones from the plus ones. Of course that's not always possible, I also have a slide about this, but if it's possible then I could do this and now I just say everything to the right of this line is plus one and everything to the left is minus one. And of course for this point it depends on where I put the line, right? I could have also put the line like this, I don't draw it as to not mess up the picture and then this would have been minus one. So it's not that this line is defined by my samples, there are still different ways to do it. Okay. So that's just what our framework for today. Linear classification, I have points with minus one, points with plus one, I want to separate them and then use that for prediction. And it doesn't have to be 2D, but for my drawings I will use 2D or 3D. Here's a definition of a hyperplane. And that's important because it's kind of the basis for everything else. So a hyperplane in D dimensions, so we've just seen one in one dimension which was a line. That's one typical dimension. No, what we have seen, sorry, was a hyperplane in two dimension and a hyperplane in two dimension, what we have seen sorry was a hyper plane in two dimension and a hyper plane in two dimension it's always one dimension less so a line. So I have a, let me draw a picture, and then so this is maybe the most common definition which you know from, let me try to draw something here, three dimensional, so this is, and we are now in three dimensional space, this thing here, this is now a plane, it's my hyperplane, and what I have now, I have some maybe here, that's a point A, so here's my origin, it's maybe somewhere here, so that's my origin. So this hyperplane doesn't have to go through the origin, so what I have here, this is a vector A which somehow lies on the hyperplane, it can be any point. And now I have vectors which span the plane. So in case of hyper plane which is two dimensional, this is two vectors. And let me call them, so I have a basis here, H1 and H2. This is supposed to be a right angle here. I mean it's like tilted because it's perspective drawing. And now how can I write any point, I can write any point, it's written here, any point x, for example I have a point x here, x can now be written as a, so like my base point plus, and now I have some alpha one times h1 linear combination of my two base vectors. I think that's the most easy to understand definition of a hyperplane. So every point on the plane is I go to this A and now I have some linear combination of these two vectors. This is something, I don't know, you should have learned it in school but I'm not sure. Here is another definition which uses normal vector which is a little harder to understand but mathematically equivalent and we will prove it. And it's just you have some normal vector w and let me draw the normal vector with, and this is now, so this is now my W which is a d dimensional vector and it's normal which means it's orthogonal to everything that's in the plane and there is only one such vector because the plane is two dimensional and I have an offset b which I cannot draw here, it's just a number and now my hyperplane is simply all points where the dot product of the normal vector and the point is b. I mean this is not so intuitive and it doesn't look like the same as here but actually these two definitions are the same and because in learning and also for today's lecture you always work with this here, we should I think take some time to prove that these two are actually equivalent to understand that and that's what we will do now on the next slide. And what I will use several times in this proof is what you can always do if I have some vectors in d dimensional space, let's say I have two vectors in three dimensional space then I can always find another vector that's orthogonal to them. And in particular if these are pairwise orthogonal, but also if they are not I can always find remaining vectors so I can always complete to a basis. I mean that's, I don't prove that. So let's take our, let's see whether we can prove this. So let's take our one definition. So we have a hyperplane here, which is all points which I can write as sum A plus sum of, so I'm in d dimension, so I have d minus one, here I write alpha i times hi are my basis vectors where alpha1 and 2 alpha d-1 these are any coefficients which means they are real numbers. Ok, now we pick a w such that w is orthogonal to all these, and this I can do by what I just said, to all these d-1. I have three dimensions, two vectors which span something and now take a third one which is orthogonal to them and pick B and I just define B as the dot product between this W and this point X is B. This is not a closing parenthesis, this was just a weird thing. Ok, that's still a little bit weird. And now I want to, so we want to prove, we want to prove. Oh, I'm sorry, I'm thinking and writing at the same time, that always goes wrong. We want to prove that H and H prime are the same. So I've given this and I have something in these two notations, where am I? Maybe I should, yeah, I have this and I have this, this uses this definition one, definition two and they are exactly the same set. And I think it helps to understand by just doing the proof. So let's do the proof. To prove that two sets are the same you prove that one is a subset of the others and vice versa. So let's do H is a subset of H prime. How do you prove that? Well you take an element from H and then you prove that it's also in H prime. So let's take X from H, which means I can write x as, so there exists these alpha1, alpha d and please pay attention that I don't make any mistakes or logical errors such that x is a plus some a1 d-1 alpha i hi and try to understand what it means. So I can write x as this linear combination and the anchor point. And now I want to prove that it's in H prime. So how do I go about these proofs? Well I want to prove that it's in H prime and H prime says something about the dot product of W with this point. So without much thinking I just take the dot product. So let me just take the dot product and see what happens. Well, this is W, everything is linear, so it's the dot product of W and A plus. Everything is linear, so note that H is a vector and these are scalars, so real numbers, so I can just pull it in because everything is linear. That's why it's called linear algebra. So now I have the W times HI. So this here is, let me do that in beautiful orange, this is zero because the W is orthogonal, the W is orthogonal to each HI, that's just how I picked it. Which means, yeah, and this is B by definition. I don't have to, and we are done. Done, I mean that's it. And hence maybe write it here to save some space. So it's equal to b, that was simple, that was the easier direction. This is equal to b and hence the axis in H prime. I don't know whether maybe let me do the other direction as well. Oh yeah, please. Where comes the B from? Yes. So B is just, so in the one definition I have an A and these things and the other definitions I have a W and a B, these are just my parameters here. And I want to say that the definitions are the same so I have to correlate the W and the B with these things. And I want to say that the definitions are the same, so I have to correlate the W and the B with these things. And I say if you correlate them, then the B is just the W times the A. But it's good that you ask because it gives me the opportunity to make one meta comment. It's good to follow these proofs I think when I do them them but to really understand them you have to do them yourself. So the ultimate, and it's I think a very important comment also for learning, I will also say it again in the last lecture, if you want to learn this and you want to check whether you learned it you have to put it aside and then do the proof yourself. And don't learn it by heart because just try to learn how it works and now try to prove yourself H equals H prime and see if you can do it. And it's better not to learn it by heart but to learn the ideas. Maybe you want to do it a little bit differently. OK, so let's try this direction. So now I have an x which is an h prime, which means, and it's easier, you can already see it now, this definition is a little more unintuitive but it's much easier to work with. Right here you have to say, now I have these coefficients, many of them I can write x as this ugly sum. Here I can just say, I can just say, ok, the dot product, that's really very nice, I can write it very compactly, is w. Ok, and now I want to show, ok, if x satisfies this, then it's an h. Now I can write it as a plus this sum. Okay, we can write, how do we go about this? We can write any x from d-dimensional space, where we are now, as x equals 2. Well, we can write it as the sum of these h things and d-1 alpha i times h i plus alpha d times and let's take the a and let's assume here, ok we assume here, let me make that assumption and later assume that a is linearly independent. That a, oh actually it's enough to assume that A is not zero because if A is not zero then it's linearly independent. So intense, and let me go back to the A linearly independent from the H1 from these other d-1 things. So it's not, if you want to think about this picture of my H1 and H2 and A lives outside this plane right? It doesn't, no actually it's not, let me just say that I think it's not quite right. A could, I think what I want to say is that the plane does not go through the origin. If the plane does not go through the origin then the A is linearly independent. And maybe let's continue with the proof and then see where we need this. that I'm assuming that the A is linearly independent from these and if that's not the case the plane goes through the origin. Let me just note that and that case is actually an easy case. Otherwise I thought I don't... H contains the origin. contains the origin. I mean in general a two dimensional plane in 3D does not go through the origin. These are very special planes. So let's continue with that. So we can write this like this and now we have so. We can just plug this in. So now W, let's write it the other way around, B. I now just take this thing here and plug in this X and pull in the w so it's sum A1 d-1 alpha i times w and again D times W times A. I have a question. Yes, one second. One second. So this is again a zero. We have seen this and this is again B. Yes, please. I was thinking about how I should write the brackets in the best case because I get a bit confused. How are the bracket rules for the dot product? So I'm being a bit sloppy here because in linear algebra you can basically do whatever you like with linear factors, you can pull them into any vector. So let me maybe just write it upstairs here. So for example if I have two vectors, let me write it upstairs here, let's say I have two vectors x and y in Rd and I have a scalar alpha in R. And now the question is what if I have alpha times the dot product between X and Y. I mean now it's well defined, now I've written parenthesis and in linear algebra, linearity, basically everything is always linear. So mean, so you don't have to write the parenthesis. So this is the same as writing. I take the multiple of x here, and take the dot product with y, or I take x and I take the dot product of alpha with y, or I take the dot product first and then times alpha. And this even has a very simple intuition, what does a factor do to a vector and think about a factor larger than one then it will stretch it, right? And so whether I stretch this vector and if you think about the dot product, maybe let's also write that down. The dot product of two vectors is just the sum of xi times yi just of the components right. So if I multiply a vector by something each component gets multiplied by that scalar and it doesn't really matter where I multiply and which one I stretch. So here I am not writing the parenthesis because it's all linear. Ok, so I get here alpha d times b, which means b is alpha d times b which means either, and this is a situation, oh I think in this Q&A session we had I showed this, one has to be careful here, two solutions are possible here right, if you have such a thing. So either alpha d is one or b is zero. It's actually very frequent that one has something like this. So one shouldn't be too fast here and just say alpha d is one. That's not necessarily true. If b is zero, alpha d can be anything and it's true. So one of the two is true. Now if alpha d is 1, we are done. Then we are done. Now it gets, I'm sorry I can't write, then we are done. Because now I wrote x is something, I mean note this here, alpha d is 1, so it's just, then it's exactly, we have proven x is in H, because it has exactly that form. What happens if b is equal to zero? In this case, if b is equal to zero, it means this dot product here is zero. I thought I don't need that special case but if b is equal to zero, certainly then a is orthogonal to w. That's what it means, right? which means A lies in the hyperplane. I mean I chose my, this case is a little bit complicated, A lies in the hyperplane, I mean A is orthogonal to W, W was chosen so that it's orthogonal to these D minus one things spanning the hyperplane. If my A is orthogonal to this one it has to be a linear combination of one of these things spanning the hyperplane. And if my A, if my anchor point lies in the hyperplane, it just means that my hyperplane, if you think about this picture, goes through the origin. If A, I mean usually A does not lie, this vector does not lie in the hyperplane. So it's a, but I don't think I want to go into depth here too much so age goes and that was the age contains the origin. And the case when age contains the origin I think I will skip it here because that's the easy case, I mean in general it does not. Let's maybe go on, this was like a little, I think it doesn't get more complicated than that, I think that was the most complicated mathematics, now it gets a bit easier again. This is a very nice proof, how do you compute and so now from now on it was really just getting warmed up. If you had problems understanding it don't bother you will be able to understand the rest. The purpose of this was getting warmed up and to say that now we will always work and let me just maybe circle this for the rest of the lecture, we will work with this very nice definition of a hyperplane. always one dimension less. So it's not, you can't have in five dimensional space a two dimensional hyperplane. It's exactly one dimension less, which means it separates the space, right? That's what it does in any dimension. It cuts it into two halves. That's the point of a hyperplane. It always cuts the space into two halves, which is perfect Perfect. Transition, why it makes sense to use it for classifiers and to compute the distance from a point to a hyperplane. So, computing just by this nice definition here, you can compute the distance by just taking W times X minus B and dividing it by the length of W and by just looking at the sign it tells you am I on this side or on that side. I mean it can't get any nicer than this. And let's just prove this. And that's also a very nice proof and also a not uncommon exam question. So let me just draw, this is now my hyperplane and my examples from now on will always be in 2D. That's very nice. So now I have this hyperplane and now let's, I don't know, let's take a point to the left, we could also take one to the right. And now the question is what's the difference, what's the distance from x to h? Now first some very basic geometry, what's the distance of a point to a plane, a line in this case, it's like the shortest distance and that's always, let me use orange here, orthogonal, so it's this here, right? And that's a right angle. So now this is a hyperplane, so let's say, so we have the normal vector is now this one, so this is my W and let's say it points in this direction as well. And now let's say this, I think I want to draw this in blue, I like it a little bit nicer. I'm not sure, I have a very strict color scheme here and let me call this distance here is what I'm interested in. This was not nice. So this is the distance I'm interested in. And let's call it r or let's call it d, why not call it dr, I don't know. Let me call it r. OK, so what is this r? Let me define one other thing, let me define w0, that's just my normal vector divided by the length of the normal vector. So this is a unit vector pointing in the same direction as W. Same direction as W. So now I can write my x, how can I write my x? I can write my x as, well let me give this a name also, let me call this here, it's like the projection of x on h as this x0 plus and now I take the unit vector of this w times my distance. Right? And I hope you agree with that. So it's just, I'm going from this point here, I don't know what it is, I just give it a name. And now I have my unit vector pointing in that direction away from this hyperplane to the right side and I take it r times, that's just r times the unit vector when r is the distance, it's the definition of the distance. Okay now let's just, I mean to prove something you always multiply with w because that's the definition of this. This is now w times x0 plus r times w dot product w0. I'm again doing the linearity thing here and it's always the same, we've already seen that. W times X0, what's this? What's W times X0? I hope you are listening and not watching Mr. B's videos. What's W times X0? B, oh yes, it's B, because X0 lies on the plane, right? And also try to see how I do mathematics here, I do the obvious thing, I have my X now and I want to show that it's an H, so let me just multiply it with w and see what happens. So a lot of mathematics, you sometimes need this clever idea, but a lot of it is just going along with it. And now I write this and I see, okay, this looks like something I can simplify. And here I have w times w0, where w times w0, I know what w0 is so let me... so this is W times W divided by this W and what's W times W? It's actually... let me just do maybe... so if I have dot product of a vector with itself, if x is a vector in two dimensions, then the dot product is just the sum of the xi squared, right? That's just what it is, which means it's the norm of the vector square. Because if you take the sum squared, square root of it, you get the length, so this is the length squared. And so here I have the length squared divided by the length, so it's just the length. So I'm just plugging in some obvious things. So I get this here, which means I get this the distance is now w times b minus no w times x I'm sorry minus b divided over divided over w. Which is what I claimed here, right? The distance is just this. I think I shouldn't finish too early here. So this is and this is And this is greater than zero because in my picture, I mean this is just r, the w points in this direction, if the w would have pointed in the other direction then this would have been a minus here. But just by the way I drew the picture, this is greater or equal to zero. Meaning x lies in the direction of let me write it in parenthesis, x lies on the side lies on the side Now I was again thinking and writing it lies on the side of H where W points lies on the side of H where W points. I mean you don't have left and right really in a item, you have to define where you stand to define left and right. So what you can say, let me just use the larger pointer here, so the sign of this, it just tells you it's greater or equal to zero if the point lies on the side of the hyperplane where the W points and it's negative when it's on the other side and if it's zero if it's right on there. And everything what's written above here follows from that. Is there any question about this? Yes? I'm a bit confused. In the last thing you wrote it's in many parentheses, but in the written text you use the distance. Are they the same or is there a reason why? What exactly is your question? Are they the same? Is what the same? So this is just this part, right? Yes, like this term that you have, this fraction, is that the same as the fraction you have up in your text? This here? They use a different kind of parentheses. parentheses in the denominator. Okay I've left out that last step, I mean you're completely right to note this, so here I used the absolute value and here I just used parentheses. So this was just one case. So yeah let me say this and let me also write this. So here I just drew the picture for the case where x lies on the side of the hyperplane where w points. And when I do that case, then this thing here is positive because it's on this side. And then this thing here I can just write, I could have written it here. I mean since it's positive I could just also write this, right? That's the same thing for this case. W and case, let me just write it, their case where X lies on the other side can be proved analogously, analogous. If I would do that case on the other side, now I would have x as x0 minus r times w0, I won't show it now because my w, I my W is fixed, it still points in this direction, X lies on the other side, I have a minus here and I would have B minus WX, I would have it the other way around and I won't do it now but it will also work out so that in both cases this comes out. But you are right to note that it's the, I mean the distance must be a positive value so I need the absolute things here. Yes please. Yes. Which vectors are orthogonal to W? All that we call X, like our H is defined as dot X, which are orthogonal to W. Oh, X is not orthogonal to W, X is just a dot product to W is some fixed constant. Only when W is zero this means X is orthogonal to W. But not, this, it's hard to get an intuition for this. This means all vectors where the dot product with some W is B. I mean that's just, it is a line, we proved this on the previous slide but it's not orthogonal to W, that's not what it means. B is not equal to zero, that would be a very special case. If B is equal to zero, then it's great that you ask. Let's just draw that picture here. This would be the case of a hyperplane, so if this is x and this is y and I'm in two dimensions, this would be an example of a hyperplane going through the origin. And now if I do that, now my W goes like this. And now all elements of X are indeed all elements that are all vectors that are orthogonal to W in that special case. So if you just any vector on the, any vector on H, so let me take this one now is orthogonal to the W. But that's only true for a hyperplane that goes through the origin. It's not true in general and in general B is not equal to zero. So for B equals to zero it's true. It's great that you asked this question and I think, yeah, there are all these little questions to get the intuition right and I think it needs time, but it's very good questions. Let me go on. That's easy now and then I think one more thing before the break or maybe the break right away. Generalization to, oh I think I'm, how do you do for more classes? Just very quickly, this is just two classes and it's limited to two classes, hyperplane. One way is just to think about our movie things, we have now five classes, you just build a classifier for each of them. Comedy or not comedy, horror or not horror. You just do this separately. So now you have five answers. Which one do you take? And yeah, which one do you take? It doesn't say it here. You could take the one which has the... which one do I take, let me just go on, I think I said it here, okay one is I could just compare all pairs and it doesn't say here, and yeah is it comedy or horror to just play a little tournament, that's one option. Or you could extend the theory. So Naive Bayes for example does more than two classes. It doesn't say here, I should write it here, because I think that's the one which you should use for exercise sheet one, yeah, pick the class with the highest prediction. Yeah, yeah, but it should have been said what do you do when the data is not linearly separable? Let me just show you a very nice example, just so that you have seen it. So let's say my data is in R1, so this is the origin, let's say here I have a point, so that's 1 and that's 2, here I have minus 1, here I have minus 2 and so on. I don't even have to write this. And let's say, this here, let me write the labels, this here is, let me write them in, let's say this has plus one, and this has plus one, and this has minus one, and this has label minus one. Clearly I can't, I mean a separator now would be a point, so here I'm in R1 world, and now let me take the following transformation, let me take the transformation, each point from R1 is transformed to x comma x squared from R2. And let's just do that. So if I transform the point 1, it becomes 1,1 in 2D. And so now let me draw it in 2D. This becomes, that's now the point 1,1. 2 becomes the point 2,4. So it's now up here. This here becomes the point minus And now let me again write the labels. So this was labeled plus 1, this was labeled plus 1, this was labeled minus 1, this was labeled minus 1. And behold now I can linearly separate them. Now they are in 2D. In 1D this was not linearly separate them. And now they are in 2D right? In 1D this was not linearly separable, I apply some function, there are many functions which I could use here, I don't even need the square. And now in 2D I could find a hyperplane, any one for example this one. I could also use the horizontal one. We are not doing this here but this is a very important technique for making because you wonder what's the purpose of linear separators when most data you cannot separate linearly. Well by transforming in a higher dimensional space you always can Ok, let's look at naive pace again. Ok that should be quick because that's basically what you have done for the exercise sheet. You're looking a little tired, I think we should make a quick break and then we go on with this. So let's make a short break here for five minutes then we go on. I think it's worth it. See you in five minutes. So let's, now it will be a little simpler and then it will become a little more complicated but also super, the hardest part we have behind us. Let's just look, naive base, what you did for the exercise sheet. So now our words are called V, big change, they used to be called W, now they are called V. I hope, yeah, because W, I mean I could have used this W which we have used all the time here right? It's just always called W in learning, this vector here, it's always called W because it's weights, it's a weight function in learning and I don't want to use another vector here because then we would be confused for other reasons. But it was words last time, so now words are V for things from the vocabulary. And what does Naive Bayes do? It computes this probability of a certain class for a certain document and it was just the probability of these learned things times the class probability divided by this funny thing which we don't know and don't need because it's the same for all classes. And this we already saw, I already gave you that as a hint, you can write it as OK. I have a document where I have this word three times, then it's just that probability to the power of three, and the three is just the term frequency. And now I just very quickly go through what you already did for last sheet, that I can use this to write this nicely in linear algebra. Namely, so this is nothing new, so that's just repetition, so I can write the naive base probability like this where the term frequencies here. So let's just use, abbreviate this here, that's what you commonly do instead of writing this variable here is this class, this document you just write the value of the variable. It's very common when you read anything on deep learning, although it's not correct, it's like an abuse of notation, but it's usually clear what it means. So this is the probability of a class for a document and I now take the logarithm for various reasons which we learned in the last lecture, one of which is that now everything becomes linear. The tf is now pulled in front, I multiply it now with the log probabilities, here I also have a log probability minus the log of this thing, which we don't know and never need. And now we could write it like this. You also did it for the exercise sheet. I don't even have the one here yet, the additional one. My document just written as a document with term frequency scores. My p-vector probabilities, just the log probabilities. This should look familiar to you. And then what I do, the log of the thing Naive Bayes computes is now this dot product plus the log of the class probability minus this thing. And now before we add this additional trick with adding the one here so that it's even simpler, just one dot product. Let's first continue with this. So if I now compute the probability, now it's just two classes because we did linear classification, you just have two classes. For class A it's now just this, D times, this is the vector of log probabilities for class A, this is the vector of dot probabilities for class B and this is just the class probability. Now if we want to know and we will do it for our example in a second then it will become very clear which of these probabilities is larger, which is the same as asking which of the logs is larger. Then, haha, there is a log missing here, right? I think it is, there should be a log here, ln. So it's just d times the difference between these vectors plus, now this cancels out, we don't need to know it, it cancels out, we just want to know which one. And we wonder is this greater or equal to zero. If yes, then A is the more likely class. So it is just d times this difference plus, and here I already took the ln of PA divided by, just very quickly I hope, I mean we use that so much but ln x minus ln y is just, let me do it with a and, let me do it with a. This is just one of the laws of logarithm, it's just ln of x over y, which is what I used here. Where this here, pa minus pp, I can now write as ln, and I'm using the same thing here, ln of the quotient. So I have just one vector here, ln of the quotients of the two word distributions for the two classes and ln of the two class distributions. And now what I do, I just define this as b, as negative b and this as my w. It already looks suspiciously similar then to what we had on the previous slides. So if I now call my document, look this here I call x, this I call w. Let me use the laser pointer, this I call x, this I define as w and this I define as minus b and then it's just w times x minus b greater or equal zero or less for the other class. So let's look at this for our toy example. This was our toy example and let's see whether we still get the probabilities right. I mean it was so simple. PA was the probability for class A. They were both equally likely so it was both one half. And then we had four probabilities we learned. It was the small a for class A. And this looks to me like it was two-thirds. And then we had, so this was the word distribution for class A, two-thirds a, one-thirds b. And then we had the word distribution for class B, which was just the opposite, it doesn't have to be. So these two are not related that was a coincidence of course it was not a coincidence it was how I constructed the example to have nice numbers here. And now let's just do this so the W is let's just go back what did I say So the W is, let's just go back, what did I say here, the W is this vector here. It just contains the logarithm of the quotients of these probabilities. So let's just do that. That's what I did here. So that's a vector ln and this should be the for word a, the quotient of aa and pb, let me just get it right, yeah it's just per word. So the first dimension is for the first word divided by the two classes. So I have to make sure I get the... It's this one, that's how I defined it. And this here was the ln of p, now it's for word b. Second dimension is for word b. I just take the quotient of these two probabilities which is, yeah, what is it? That's ln of, so what's the quotient here? This one? PBA over PB one half, yes that's one half, that's correct. Now one half again a logarithm of the reciprocal is just minus, which also follows from the, let's just do it, that's minus ln2, which means I can just, that's just ln2 times, so here I have the vector 1, minus 1. Ok, and what's the b? Let's also compute the b. The b was just, the b is minus ln the two class probabilities. What's this? The quotient is 1, yes, and the B is, ok let me write it, I agree it's good to it, and that's 0. So B is 0 here, so I actually have this special case here, which wasn't clear when we did it in the last lecture. So what does it mean? It means, let's just, and I think it's just instructive to have this view, so let's just draw it. Let's see, a perfectly straight line. So we are now in two dimensions right and now what and this is what we have here is the number of A's and what we have here is the number of B's and now let's write every document as a, right, it's just a two dimensional vector with a term frequency. And let's just do that. So this should now be, I want a vector where I have tfa, tfb here. And let me do the first one for you. It's just, it's so simple, it's two,'s, 1 b. You tell me the next one. 5-2, I agree. Next one? 3-5. 3-2. 3, 5, 3, 2, 1, 3, see this is fun, 2, 4, I'm sorry. So let's, so our largest thing is 5, so let's try to 1, 2, 3, 4, 5, perfectly spaced ticks here, 4, 5. And here we have, let's also do 1, 2, 3, 4, 5. And now let's just draw the points. And it's interesting, right, it's a picture which we didn't have when we did this last week with Naive Bayes. And that's always a great thing and mathematics do the same thing in a totally different terminology and world. When we did Naive Bayes we were talking about conditional probabilities, now we are talking about geometry and separating hyperplanes and points in 2D. So let's draw this one, the 2,1 point. So 2,1 is here, right? That's here. And that's class, this has class A. And maybe let's write the class in the same color as above. A and maybe let's write the class in the same color as above so as not to get it in any way and I think that's this red we should absolutely not confuse. So that's the point 2,1 what's the next one 5,2 I think that's here that's 5,2. Maybe it's even a good idea to write it here. That's the point 2,1. That's the point 5,2. This also A. And now we have 3,5. Let's draw the point 3,5. That's up here. Oh, what happened? 3, 5 it's up here. And that was class B. This was class B. Which one? 3, 2? Where is 3, 2? That's down here. 3,2 is down here. And you can already figure out where the other points lie. That's here. And then we have 1,3, that's here. So I draw the two points. So that's the 1,3. And the other one is 2, 4 which is up here 2, 4 and these two are B points, 2, 4, ok and you can already imagine where the hyperpoint, let me also lie, so these are B points B, B, these are the B points, these are the A points. Now where does my hyperplane lie? Let me draw it in orange. The orthogonal vector is ln2, what is ln2? It's 0 point, I don't know, something, 1 times minus 1. What's the direction of the one times minus one? I think it goes like this, right? This is my W. This is my W and this thing goes through the origin because B is equal to zero, so how will it look like? I think it should use these points here. 4, 5. Let me draw a perfectly straight line here. So that's my hyperplane. Right? This looks like a symbol. How do I make it not look like a symbol? So that's my simply just write it next to it, I think that would be good enough. That's my hyperplane, right? That's my W here. What? This doesn't go through the origin. Oh my, I'm so sorry. I have two, what did I? It should at least go through the origin. It was so, yeah. So now I have a perfectly straight line going through the origin and see that everything makes sense. So, yeah. So it's how I computed it and now let's take an arbitrary document X, so let's take X, be x in A and what we do for x in A we will just do, we will just compute w times x which is just, yeah we will just compute ln, it's strange that it does this but let's just factor out the ln2 and then it's just this vector times tfa tfb which means it's just tfa minus tfb right it's just this vector dot product of these two tfa 1 minus 1 greater or equal to 0. Let me just write it as is. So x is in A if and only if this here being greater 0. Which means equivalent to tFA greater than TFB which we already found last time by different means. And you can also see it here, right? The As are in this half, the hyperplane goes directly through the middle and this is what naive Bayes does for this example as a linear separator. There's a question or comment. I see it now that the hyperplane will go every time through the middle if we get as much points in A as in B. That means that in two 9th base, the hyperplane will the crossing of the I'm not sure, are you talking about whether this line goes through the origin? It's a good question, it's great to ask this question to understand something in detail. Why does it go through the origin here? Well what's the anchor point here? You could take zero as an anchor point and the product, I mean the reason is that the B is zero here. And it's not something I proved but I should write it again because it's actually important for the intuition. So H contains, we mentioned it in this very, H contains origin and prove it yourself at home because the great exercise is equivalent with B equals to zero. We already had a question about this. And why is B equal to zero here? Well you can answer that question yourself I think from the formula. Let's go back one slide. Here is the definition of B. When is B equal to zero? Yeah, so B equal zero if and only if PA is equal to PB. So it's not that in naive Bayes in general B is equal to zero, just in our example the two classes had the same probability and if they would not have then my line would not be, which is also interesting, right, my line would then be shifted, it would not go through the origin. It's not that it would be tilted differently but it would not, maybe also but would not go through the origin. Yes please. And if then the line would have to go to the origin, it wouldn't look very well, right? It wouldn't separate the lines with origin. That's a very good comment, it's a really good comment, so let's assume, I mean we could shift, where do you want to shift everything, to the right? To the left, all points to the left is I think a little problematic because then you have to subtract something, shifting to the right is probably easier right? I mean you can't have negative things here because these are term frequencies, so are you talking about the geometric problem or about the original problem? If you shifted to the right there would still always be a plane, it would be a little bit flatter but also if you shift it to the left, right? But I think one, I will take just a second, I think one important comment is, there is absolutely no guarantee that naive Bayes finds a linear separator of your data. Just in this example it does. It could be, very well will be with naive Bayes that your data is not linearly separable, it will still find a hyperplane and use that for prediction. And everything that's on the wrong side, even your training, even what you were given for training will be predicted wrongly. I think that's an important comment to make. Is that clear? So there is just no guarantee that naive Bayes separates the data and sometimes you can't. Yes? It just means that if you have more than two classes, the b is zero, if they fall off, you could distribute it out. Ah, now the question is if we have more than two classes then this whole picture doesn't work. The hyperplane linear separation only works for two classes. This whole viewing at this linear separation, I mean a hyperplane just divides space into two halves not into more halves. So what I showed here is just, I mean the heading should have been two class naive Bayes. For three class naive Bayes we have no geometric counterpart. I mean there is no, yeah it's just two class naive Bayes. Any other question before we, yeah this was, you already did it for the sheet, I mean, but I want to separate it because, yeah, the hyperplane definition, and this is also very interesting now to understand this in geometry, I mean, let's just go back the definition of, a hyperplane is just defined that way, so you have a W and a B here, because a hyperplane does not go through the origin, but wouldn't it be nicer if you wouldn't even have the B here, if it would always be zero. And now comes the trick how you can always achieve that. And you did it for the exercise sheet but without really understanding what it means for geometry. You can always add one dimension more and now you consider vectors not in the original dimensional space which was the number of words, but one more, and you just add a one here and add this class probability here. And now if you just do the linear algebra, you take the dot product of the two, you just have what you want, right? Now before you had this vector times this and then plus another bias term and this thing is always there and we, yeah. And so now it's just this W is just this vector here, where I have here the word distribution and here it is. So by just, and what this means geometrically is if you lift everything one dimension up, then you can always make it so that the hyperplane which you find goes through the origin, in one dimension higher. I don't have a picture of that. Maybe you... I don't know, do we get a picture for this? I mean the only thing we can imagine lifting one dimension up is from 2D into 3D. So you have a, now we have a, I don't, does anyone have the intuition right away? I didn't think about it. that doesn't go through the origin and now you lift your data up to three dimensions and now you find a plane separating the data which goes through the origin. Yeah, I guess it makes sense. You have a line which does not go through the origin for 2D points, now you lift your 2D points to three dimensions and how do you lift them to three dimensions? Well nothing special, right? They just go through a very, they just all lie in a simple plane. And now you are hyper plane, you just have a hyper plane containing the original line which goes through the origin. I think the geometry intuition isn't very useful, but you can do it by lifting one dimension up, your hyperplane always goes through the origin. And now it's even nicer, now you don't even have the b term. Okay now perceptrons. That's very easy because I don't do any math or very little math, and they are super old but interesting. It's just four slides. So far, so we have seen a bit of theory about linear classifiers and naive Bayes, which somehow arrived at formulas based on Bayes theorems. The perceptron is a linear classifier that iteratively computes the W. So what we have seen here by some right probability theory we derived at these formulas, let me just show them again. No I think that's the best slide. This one here. That's what naive base does. Naive base uses probability theory to compute this W. But now let's just compute such a W iteratively. That's what the perception does. And again we forget the B, we have lifted everything one dimension up, we just want to find a W. Like here. But we want to find it via different means. So what do we do? The Perceptron is as old as 1958, here is the Emoticon, 1958 super old and it's still pretty much the basic of what's now behind all the deep learning revolution. So all this stuff, basically most of the theory was already there 70 years ago, quite amazing. So today nobody uses the perception. It's merely of historical purpose but also didactically because it's so simple and the real thing is pretty similar, which I will also show today. Here's the algorithm. We want to find a good W so we just start by setting it to zero. We don't know what it is including the B, the last dimension which we added. It's so simple it's unbelievable. Now we have our training objects. By training objects I mean these things. We have a nice picture here. These thingies here, we have just given them as vectors now, one dimension up, we would have a one added here in the end and we know they are labels. And now we just go through them, and we go through them many times. And what do we do? We just check, ok with the W which we currently have, which in the beginning is zero, does it give the right prediction or not? It should say positive or negative depending on the label. If it's not right we do something. And here is what we do. So if it's right we don't do anything, we say fine, our w is fine for this point, it predicts the right thing. If it's wrong, so if it's wrong in this way, it's negative but it should be positive, we just add the point to the weight vector, very strange, why do we add the point? If it's greater or equal zero, but it should be negative, we subtract the point from the weight vector. That's it, that's the perceptron. You see a point, if its prediction is right, go to the next point, if it's wrong, you just add or subtract the point from the weight vector. Now your W changes. And note that it can change pretty wildly, right? You add, subtract the point from the weight vector. Now your W changes. And note that it can change pretty widely, right? You add, subtract the points, how can this even converge? And now either you repeat until all predictions are correct or you just say let me do this 1000 times or 1 million times. And note, why does it make sense if you, let's say you have in my example how many objects did I have? 1, 2, 3, 6, yeah? Shouldn't I stop after 6 because now I have seen all the points, well I start with 0, if I did this 6 times with all the points, now I have a different w, so it makes sense to start again, right? It's true for all the deep learning stuff as well. You have gone once through your training but now your W, your weights, so in a real neural network this is something huge, will be different so it makes sense to do the same thing again and again. And the one thing we can understand already for perceptions, why on earth does it make sense, this strange update rule. My weight vector is not good, let me just add the point to the weight vector. I think it's important to be irritated at this point. It doesn't make sense to add the point to the weight vector. It's like strange. And let's just look at it. sense to add the point to the weight vector, it's like strange. And let's just look at it. So we have already seen this, so we didn't call it positive, but w, no we have called it positive in our proof of the distance, right? W points to the side of the hyperplane where the positive labels are, the ones with plus one. So if an object is plus, but yeah, and let's just see it. So let's say we have an x and the current value of w says negative but we want it to be positive. And what our update root then does is, maybe I should have done it in writing but now it's already written on the slide. So it's negative but it should be positive. Now this is the update we do, you can check I don't show the previous slide again, we will now add the x to the weight vector. Which means if I now compute the dot product with a new vector, I get w times x plus x times x. And x times x is just, make sure I have enough space here, this is just x squared, it doesn't really matter what it is, but it's greater or equal to zero. Which means by changing the weight vector in this way, not in this way, the new product will be larger than the old one. And before it was negative, I wanted to be positive, I'm making a change in the right direction. That's all you need to understand at this point. It's negative, I want it to be positive, this makes a change in the right direction. point which the W points towards changes and then the type of name also changes. It changes but why does it rotate? Let's draw it since you say it and let's try to understand it. So now we are in the origin case so we can always draw our, see I also have some deficit with drawing things that go through the origin, I'm sure there is a name for this. So this is this and this is my, let's do side but my X goes to the other side. So where should my X, yeah it should be on that side which means my X is somehow going here. That's now my x but my w is, oh that's wrong, I should have done it, yes, because my w gives the wrong, right. Yes, so that's my x so now WX will be negative, right? That's, yeah, it's on the wrong side, WX will be negative. And now I'm adding the X to the W. Maybe I shouldn't make such a long, maybe I shouldn't. And now I see what you mean, so it's great that you said it. Let's make it a little bit shorter. And let's maybe also, why not draw the W here. So that's my W. And I think you see it already. Let me also draw it. And my X is, maybe my X goes like this, why not like this? And now I... If I add my X here, I get my new... That's now my W prime, right? That's my W. That's what you mean, right? Yes. And now the effect will of course be that the new... It will be like this. And now X is on the right side. It doesn't have to be on the right side, I must, I mean it's just rotating it a little bit, that's really important I'm just moving it, it's negative, now it becomes larger, it goes through its formative, I could also not rotate it enough, but at least I'm rotating it in the right direction. And the same you could draw it for the other picture, if it's on this wrong side then I will do the opposite. So it's a great picture with the rotation. And why should this work? Now I see a sample, I'm rotating this way, that way, it's wiggling back and forth, it's absolutely not clear why this should converge and we don't do that proof, but one can actually, that was the other, one can prove that it converges. So actually there's a theorem, we don't even look at it because that's really just for historical interest. If you can separate the data, but it's fascinating that you can prove it, then you can even specify after so many iterations it will converge. It will actually converge. So by wiggling back and forth, if you only go over the data frequently enough you will find the W that separates them. And instead of doing this, now we do what you would really do nowadays, you would use logistic regression. And this is a method which is still used, which is kind of the simplest neural network which you can have and everything else is just the same but more complicated. So logistic regression is like the nucleus of deep learning and it's great to look at it and try to understand it. And it's six more slides and I will need half an hour I think for it and we shouldn't rush and we will make another, we will make a break when my classifier senses that you are falling asleep or thinking of Mr Beast videos or secretly watching them. Let's start with some, and the mathematics is truly beautiful and most of it quite simple also. So logistic regression sigmoid function which is like the basis of so much in deep learning. Here is the sigmoid function and let me, it's one over, let me write it myself in mathematics, so this is one over one in nicer notation I meant e to the minus t. And let's draw this, let's look at this function because it's a... This is so important to understand. So let's look at it. Let's draw it like this and here is the x axis. Oh no, no, no, no, no, I'm sorry, I have to draw the right thing. I need here, I don't need the negative y part, I just need the, so this is the origin here and here is 1. So the outcome, I mean that should be clear, it's always a value between 0 and 1. Actually it never reaches 0 and it never reaches 1. So I can even write an open interval here. And what does it do? Let's look at it, let's do it in here I have 0.5 and it goes through there and it will look like this it comes from here and then here so it's kind of a face so that's sigma of t, so if this is t, and yeah, that's, so it goes like this, so here's some properties, so the first we see here and we can easily just check it, so if it goes to minus infinity, and I don't prove this, you can look at it. Minus infinity, minus minus infinity is e to the infinity. So this becomes infinitely large, one over infinity is zero. Which means minus infinity, this approach is zero and actually pretty quickly because the exponent goes. So it goes down pretty quickly here. So most of the things are happening here near t equals 0. What happens if it goes to infinity? Then e to the minus infinity goes to 0, 1 plus 0 is 0, 1 over 1 it goes to 1. What happens at exactly 0? it's 0.5. If you plug in e to the 0, it's 1, it's 1 over 2, 0.5. So it's this nice symmetric thing. That's the sigmoid function. And just the intuition for why you use it. You want probabilities and you have something that's any number. Minus 500, plus 2, 0, whatever and you want to turn it into a probability you can use sigma because it will give you a number between 0 and 1, not even including 0 and 1. And what it does, and that's why it's important, it's turning things away from 0.5. If your number is even a little positive, it will go to 1 very quickly. So you're making things more extreme than they are. If your number is minus 3, sigma of minus 3 will almost be 0. So it's kind of pushing things away from the middle, and that's an important function. Here are two more nice properties which we, let's prove these. So P1, proof of P1, we don't do it, you just, just plug it in. This just, I mean you don't really plug in infinity but you do the, so let's do the proof of P2. Property 2, if I do sigma of minus t and this is certainly things which you could be asked in an exam. So sigma of t is just one over one plus and now it's e to the minus t minus minus t now it's e to the t and now I want to get back to something with sigma of t again, so let's just divide by e to the minus t and the numerator, e to the minus t. I just multiply the numerator and the denominator by e to the minus t. And when I do that, I get this. So I did two things, I multiplied this by e to the minus t and change the order. e to the t times e to the minus t gives 1. 1 times e to the minus t gives this e to the minus t. And now it's almost there. Let me just do this is 1 plus e to the minus t over 1 plus e to the minus t minus. So I just add 1 and subtract 1. 1 to the e to the minus t and this is just 1 and this is just sigma of t. And this is this P2 property we can see here right? It essentially means that I have this rotational symmetry here. This part is just this part rotated. This is exactly what P2 means. And P3 will be very useful in the following. That just says that the derivative can again be expressed in terms of the original function. Which is often the case when you work with the exponential function because as you know e to the x if you derive it is just e to the x again. So let's just do it. Let's derive it. Sigma prime of t first looks a bit scary so it's this function up here so it's a what we have so I hope you know that 1 over x, if I derive it, it's minus 1 over x squared. And then I have to take a chain rule like inner derivative. So this is now minus one to this whole thing squared. One plus e to the minus t squared times, and now I have to take the derivative of one plus e to the minus t and that's e to the minus t derivative e to the minus t is e to the minus t with a minus minus e to the minus t is e to the minus t with a minus. Minus e to the minus t. Is that correct? I hope so. And that's now, let me, so that's now minus one, ok, the minus cancels out so I already have one sigma t here. It's not a plus, it's a t. And I have another one here which I just e to the minus t over one plus e to the minus t over 1 plus e to the minus t and this I've already seen here that e to the minus t over 1 plus e to the minus t is 1 minus sigma so that's sigma let me write it one below and then we are done sigma over t times 1 minus sigma over t. So this is certainly something you should do for yourself and see if you can prove it. And it's not too hard. So here I was just using what I already had here. This here and this here is the same thing and I already proved that it's 1 minus sigma over t. So the derivative, this also has an intuition but it's a little more complicated. But what's important here is that you can express the derivative in terms of the original function which is nice. And let's see some terminology then we make another break and I hope some people still stay for the last 15 minutes or so. You need it for the exercise sheet so I want it to be on the recording. I think we make a break now and then we do the rest and I hope that some of you still say thank you for your patience. So we resume in five minutes. So last five slides. Can you hear me again now? Yeah for the Zoom people you are too far away to hear the sound waves from the room. So just some simple terminology, so let's do this. So N is now our dimension of the input space, it includes the bias term dimension, this one dimension which we added. So for our naive Bayes example we would now be in 3D space, but it's not important. So for in the text world, the size of our vocabulary plus one. So now we totally go away from, that's just the deep learning view, we don't care where our points come from. That's really important, that's the transition which we made over this whole thing now, whole lecture. We started with text document, there was always this intuition documents, here these were already abstract documents, now it's just points, points in d dimensional space. Can you maybe close the door back there? I think we have enough air here now for fifteen more minutes. Now we are just in this abstract setting, points, we want to find the separating hyperplane. So that's where we are now. And I go to here. And now the labels, and typically you call them X, so that's the points, and these are the training examples from which you want to learn. And now they have labels and these labels in the general case, so K classes and here we just have two classes, so it's just 0 or 1. You call them Y. So here we just have 0 or 1 and we have N of them. N training examples, N labels which are either 0 or 1. Call have n of them. n training examples, n labels which are either 0 or 1, call them x and y. What I tell you now can also be generalized to more classes, today we only do two classes because we only looked at two class Naive Bayes, only two class Naive Bayes is a linear separator. And also we still apply to our original data and we do that like I showed on the one slide. You just do completely independently from each other, learn to separate comedy from not comedy, horror from not horror and you can just do that five times. It will be clear when you do the exercise sheet. That's just a dumb way to do multi-classification if you just have binary classification. What's now? Our probabilistic model and it's super simple look at how simple and also how elegant it is. We want to find this W, you always want to find this W which somehow helps you to separate the data and we just need, and we want to take the dot product with X because that's what we do, which is this linear thing but that's not a probability, right? Dot product is minus fifteen zero plus one million anything but we have this sigmoid which maps any number to a probability between 0 and 1. That's what I explained just before the break. So we just say, given the weight vector, the probability that the label is 1 is just the sigmoid of the dot product and the probability that the label is the other thing which we call 0 is just the opposite. So that's a probability distribution. That's just the model of logistic regression. So it can't you just turn the dot product and the probability using the sigmoid function could use any other function as well but this just has these nice properties which we will use in a second. And now what does Naive Bayes do in comparison? What it does is, maybe that's not so, I don't want to go into depth here, maybe you can do that yourself, but how did we derive, maybe we can do, if you remember, let me just write that here, there was this PA and this PA vector and that was just the ln of this P1 from class A and so on from LN, P, I don't know what my dimension here is, N of A, yeah it was this, so now I'm in, and then I get the log probability, so if I want the actual probability, I have to take the exponent again in the end. This gave us LN of this PC class for document D. In this case the class is A, how I have written it here. And so if you want the actual probability then you have to take the exponent again, which means you get something like this for naive Bayes. So instead of taking the sigmoid you take here the exponential function and now you have these two things and so there is a probability distribution. If you really want the probabilities, which we don't for Naive Bayes you need some constant factors here so that this is a probability distribution. And what's behind this is computing the softmax. I'm just mentioning this if you are familiar with some deep learning speak and how this is related. But what's interesting is both here have these properties of pushing things away from the middle towards 0 or 1. The sigmoid does this, so if this is like little very negative it will be pushed to zero, positive to one and the exponential function also does this. Small differences here will be pushed apart very quickly. And that's also what the softmax does. If you have two values, softmax just computes not strictly the larger one but it will push the larger one to 1 and the smaller one to 0. So here is the update rule for logistic regression. So logistic regression is also iterative but now we do it, this should be blue but I leave it black for now. So what we want to do, and here is another super nice thing, so where does this now come from? This probability, well if you look at it, so it's now one function before I wrote down what it is for. So if you write this down for y equals zero, then this is just one to the, yeah this just becomes one here and then it's just, yeah let me not, let me just write down what it is. If y is equal to zero, then it's just this part, right? Then it's just 1 minus sigma of w times x to the power of 1 and this vanishes and if y is equal to 1 then 1 minus 1 is 0, this vanishes and it's just sigma times w x. Now always draw the dot product with a fat dot so that you can distinguish it from the other thing. And look how nice it is, this is just turning into one function right for any y. And now I can even plug in any y in between and I can compute derivatives which is also very nice. But the two border cases exactly with what I started with. And now we want to maximize this. So we have this probability and we want to find a w, we want to find a good w as usual such that this is as large as possible. This is maximum likelihood. Right, you want to, this is the probability of the data we are seeing and we want to maximize this. As usual we don't maximize this because this is an ugly function but we want to maximize the log of this and here is the log of this. And now we could just do maximum likelihood estimation, derive this by W and then you will get a function, you will find that it does not work. There is no closed function for this. So even for this relatively simple model, I mean it's not the most complex expressions here, you arrive at something where you can't do closed form. So I want to find like the maximum, the w which maximizes this but we can't. And in that scenario what you do, and let me also draw a picture here, maybe that one. So you have a function. Let me just draw the function. And you want to find the maximum. Maybe the function goes like this. And you are here. And you want to find the maximum. So what you do is, you compute. And so this is now maybe I should... So this is the w and now I'm just drawing this in one dimensional, but the w is of course more complex, but you can do the same thing in multi-dimensional. I want to find the w where this function is the largest, that would be here. I'm here at some w and I don't know where the maximum is, so what I just do is I compute the gradient, which in this one dimensional case would be the derivative at this point, the derivative points to where do I go uphill and actually where does it go uphill the steepest. And that's where I go and now I just go a step in that direction. And as you can see in that picture here I would at least arrive at this intermediate. I think it's not optimal when it's too close to my mouth. So I would find this one here, I don't necessarily find the global optimum. It's always a problem with these methods. but at least I go in the right direction. So I can't compute closed form, but iteratively I go uphill, in the direction where w becomes larger. And that's what we do now. So we just compute the gradient of this thing here and let's do that because that's nice and then that's the last non-trivial mathematics and then we are done. So we will just, this is our thing here and we will compute the derivative of this. And this is now something which maybe you haven't done before a lot. So let me just mention some things. So I want to take this thing here, this is the likelihood function and I want to take the derivative of y times ln sigma of w times x plus 1 minus y times ln 1 minus sigma of w times x. And another. So and now I derive by a vector. And I don't know how many of you have already derived by vectors, but the nice thing is if you just take standard calculus and do it in higher dimensions, many of the things of course you have to know which just carry over. For example if you take the dot product which you can just, I mean this one you could really just prove by writing down the definition of the dot product and then taking the derivatives with respect to each component of W. But if you do that, then the same things comes out as you expected from the one dimensional space. Deriving dot product W times X by W, if this were the standard product, it would just be X is the constant term here, right? It's just X. So let's just, let's not think about higher dimensions too much, let's just compute the derivative as if this were a normal dot. And how would it look like? So then this would be like, and I hope there is nothing else on that slide, ln, let's just do some, so what's ln sigma of x prime, it's actually fairly simple, I've said that many times, mathematics, just a combination of very simple things, ln is just one over this thing and now I have to compute the inner product, so that's just this here, right? Which is, and that's just a chain rule here. Which you all know of course, chain rule. So, and let's just do that here. And it's just beautiful how everything will work out now. So that's just now y times 1 over sigma of w times x times... Now I take the derivative and the derivative is this, we computed that earlier, so that's now sigma times w I'm just plugging in this thing here but W times X so chain rule twice. This is the chain rule twice here. I have to do it again. Twice. twice. So if I, maybe I should write it upstairs so that this is clear if I do it twice. So if ln sigma of w times, now I have even something inside here, if I compute that derivative then it's one over sigma of w times x. That's just from the ln, then I have to take the derivative of this one, w times x, and now I have to take the derivative of w times x, which is just the x. Right, the chain rule is so simple. So I also have to write x here. And now I have to do the same thing for the other part. This is 1 minus epsilon times, now it's 1 over 1 minus sigma of w times x. And now I have to take the inner product of that one, that's now minus times minus sigma of... yeah, and that's just now sigma, this one derivative goes to zero, now it's minus the sigma prime of this, so I again plug in this here, sigma times w times x times 1 minus sigma w times X, and I take the, where do I write the X, again the derivative of the W X should also be here and here. And I cannot emphasize enough, this is like computing 3 plus 5 over 7 minus 4 squared. It's just this, you have to concentrate but it's nothing more. It's a fairly simple thing applied to a complex expression. And now magic happens, I mean this cancels out and this cancels out here. And what I get, not much remains at the end. This is now y times 1 minus sigma of w times x times x plus 1 minus y and this should be a minus here I think. Times and here I have the sigma times w times x times x and this is y minus sigma w times x times x times x for y. If y is either 0 or 1, I mean let's just check whether it's true, I hope I didn't make a mistake. Let's just, if y, I mean we computed with this as if y were any number but actually we are in the training where y is either 0 or 1, it's this class or this class. So if y is 0 then this whole thing is 0. And yeah maybe let's, no I don't think I have to, let's just check it whether it's correct. If y is zero then this is zero, one minus zero is one and the whole thing is just, no no then we have, if y is zero, is it correct? This is zero, here we have, ah you will see, let's just check, let's do it in our, I think it's correct, but I have, if y is zero, let's just check the two, let's just check, let's do it in our, I think it's correct, but I haven't, if y is zero, let's just check the two, let's just check, I mean it looks weird, doesn't look the same, let's just check the two and let's try to do it in our minds for y is zero and y is one. y is zero. This is now zero and here we have, this is one, so we have minus sigma wxx and here y is 0 minus sigma wxx, it's correct, right? By some magic it's correct. Let's do it for 1, if y is 1 then this falls away and here I have 1 times 1 minus sigma times x, if I plug in 1 here, 1 minus sigma times x. If I plug in 1 here, 1 minus sigma times x. It's magic but it's true, right? So this super complicated derivative of this little monster here just becomes this y minus sigma wx. I don't even need a case distinction or anything. So yeah, it's true, if you don't believe it, I'm convinced. And now comes the, yeah I think it's the last slide, so what you now do, this is what I explained on the slide before, so we just, we can't, like we did for the maximum likelihood in the last lecture, actually compute the optimum here, but we can at least compute the derivative and now I just, this is my, right, this is my W I have now and the next W prime will just be a step in the direction of the gradient here. So this orange thing here is now W of W of L. So that the likelihood, this is my likelihood of the data, this L which I defined here. And so now see what comes out if I just, I mean that's what I computed here, by just taking the derivative, I just add it so it's alpha times this thing, which my label is either 0 or 1, and then it's sigma times this thing times x, and now compare it to the perceptron rule. And the perceptron, and that's really, I mean, yeah, it's just without this, right? It was just w plus x, so this factor here was one. And logistic regression, which has many advantages, and you could never have come up with this just by guessing right? It just does this and the alpha is just some constants. And the constant makes sense because you have to say how much do I go in this direction? That's just a learning rate right? I have a gradient and now I say 0.1 times the gradient. How big a step do I do in that direction before I say ok let's look at where I am now, let's look at in which direction I take the next step. That's the learning rate. And almost everything I say here generalizes to everything of learning, not just logistic regression. So all very general principles. So the perceptron, this super old thing is just this without the theory and just a one. And here I have this which gives better results. And here is one more thing and then we are done and I would be happy if you have some more questions but I am done then. So this is now like the Perceptron, I look at one example from my training, I compute this and then I make, I update my W in that direction. What you would always do in practice is you take several examples at once and you make one big step and it's actually trivial what you do, I've written it down here, you have to implement it and understand it first, you just take all these things together, let's say you take a batch, it's called a batch now of 10, you just compute the direction for each of the 10 and take the average and you take that step. And that's different from doing the steps one after the other, if you would do them one after the other, then after the first step you would already be somewhere else and do a slightly different step. And the big advantage is, look here, you have this W times xi, so that's a dot product, and if you have several x's, then this becomes a vector matrix multiplication. So let's say you do this for 1000 samples at a time, then it's just dot product between a matrix with your 1000 samples and the weight matrix. So that's the one thing you compute and now it's a number and this will be, the rest is trivial. So like the updates trap is one matrix vector product gives you a number, you compute this number and then you go, yeah, and here you have again, yeah that's the averages of your sample. So actually what you do in the end, and you will see it in your code when you do it, I'm afraid one can implement it without understanding too much but of course try to understand why you do it. But in the end it will be very little code again. And this is called batching and M is called the batch size and so we have a number of parameters here, hyperparameters they are called the learning rate for the exercise sheet just play around with it see what happens if I take 0.1 here, one small less that's also what you do in learning you have the batch size smaller or larger batches. Oh and the other thing, the other hyper parameter as I told you, you can go over your data once over all the points and that's called an epoch, I think it's not written on the slide but it's written on the exercise sheet right. It's called an epoch in deep learning and now you can go over the data again because now you have a different W and that's the third hyper parameter. And this you have them in all of deep learning, learning rate, how often do you go over your training data, number of epochs and how big a batch, how many sample points at once to make a step. And you will play around with this for the simplest of all learning methods. Sorry for the extra time but I hope you enjoyed it anyway or learned something. Is there any question now? Yes please. Why does gradient go up? It's a very good question. So if you remember likelihood is, I want to, I'm asking if you remember the coin toss example from the last lecture, what's the probability that I'm actually seeing this sequence which I'm seeing. And now I want to find the parameters so that probability is largest. That was the likelihood. Maybe in the interest of time let me not, I would like to go to that slide now but maybe not because people want to go home. So I want to maximize, maximum likelihood is maybe not so easy to understand. I want to maximize, I want to find the parameters so that the probability for what I am seeing is the largest for that parameter. And I don't know that parameter. And in the last lecture we just computed it because the functions were simple enough. But here I can't compute it, so I want to know for which W, and here I just drew a function, is this likelihood, this L the largest? So let me say it again, I want to find the W so that this is maximal. And now I just start with some W, I am here and I want to find this point here. And then I just compute the derivative, the radiant here and go in that direction. Yeah, minus would go down. If I want to find the minimum then I would take minus. But I want to find the maximum here, I want to find because it's maximum likelihood. Exactly, if I would take minus but the gradient goes, yeah, that's geometric intuition of what the derivative does, it goes upwards. Any other questions? Thank you for the question. So, have a nice evening, thank you and see you next week. Bye bye.So welcome everybody to lecture 12, information retrieval in the winter semester 2022-2023, second to last lecture. And I will first say something about your experiences with the last exercise sheet, logistic regression, surprise, surprise. And then today we will start with a completely new topic, no more linear algebra, it's the last topic for which there will be an exercise sheet and that's knowledge graphs and sparkle. So it's just the beginning of a big topic but I think it's pretty interesting and you will see what it's about. And the exercise sheet will be, you probably know a little bit, or you should know a little bit about the database world. Exercise sheet will be to implement, translate queries from the knowledge graph world to the database world. And I will show you how it's done. But first about the last exercise sheet, here is what you said. It was an interesting lecture, it was too long, yes? You are absolutely right and I'm sorry that was, that lectures are usually too long, but this one was way too long. Also because it's the second to last lecture with an exercise sheet, many people are now skipping the exercise sheet, you can of course do that but you have to know it for the exam also the exercise sheet. So sorry I could not make time for this sheet, many of you wrote something like that. I like that out of focus parts were already implemented, we did that for all the exercise sheets like the boring stuff, reading in things and so on, they were just given to you which are a lot of work if you have to do them yourself. Somebody said they heard logistic regression the fifth time already, several of you said something like this, third time, naive based, fourth time logistic regression, but they never understood it really and now they got it. Okay, take it as a compliment or maybe it's just the fifth time. I feel like my linear algebra is not rusty anymore, numpy still annoying, epochs, yeah that's true, it was only defined on the exercise sheet, the other hyperparameters were also defined on the slide like batch slice and learning rate. We already made a note about that. There was a problem with the code template that it didn't satisfy the style requirements but it was only one line but still several of you remarked that. Even the extended lecture time is over drawn regularly and that's true, the last lecture was too long but also the other lectures are slightly longer than I announced in the beginning and I would say something about this very important and complex topic in the last lecture and the next lecture but not today. Somebody said that they grew up with somebody who is a close friend of Mr. Beast and his close friend said that he was very anti-social and when you research a bit about him people indeed say that a lot. So he is very successful, he knows how to make very popular videos but he is a terrible boss so if you work with him he doesn't listen, he tells you what to do, and if you don't do it his way, it's his way or the highway. So interesting, I mean there are many such figures in history of science also, like Isaac Newton. He was also not the nicest of human beings. And it's a very interesting discussion of how to deal with such people. Maybe they do something interesting but they are just terrible people, interesting trade-off. It's our brain inherently, so we were making our way to neural networks starting with linear algebra, now we landed at logistic regression which is the simplest kind of neural network, the other stuff is basically the same just more complicated so are we inherently neural networks? A variety of opinions, of course a fascinating topic by itself, yes but I don't think it, our brain uses gradient descent, maybe it does, how does it optimize, how does it learn? It is always our brain, might be a neural network but not necessarily digitizable, that's true so our brain maybe it's very similar to something you could build with a machine but you can't read out the state and copy it to some other device. I don't think there is more than what is physically measurable, I mean there is always the idea that maybe there is something which is beyond physics, interesting debate. Too many things in the body are saved in multiple weird ways, it's not only in the brain, a lot of intelligence is also in our gut, bowels, everywhere, cells across the body so somehow distributed intelligence and we are only beginning to understand this. The brain is a lot more complex than a neural network. I believe that was also a very clever remark I think, it's not just a neural network but maybe that's just my brain trying to be special. Yes, we humans want to be special. I also added my opinion, it's very interesting. So I love history, I read a lot about that. If you look back in history for basically any topic X in history, 100 years ago, 1000 years ago, for example life, what is life, why are there things crawling around moving. Before people understood it, for example 200 years ago, no microscopes, you had no way to know about molecules or DNA or what happens at the microscopic or even smaller level. And when you are in that situation that you can't simply can't see deep enough or detailed enough then humans always fill in the gap with mystical stories. So they always did that and I think it's only natural to assume that we are still doing that today. So today the topic is consciousness for example. We don't know, we have no idea how it works, so we think it's something special. But in the past whenever we thought that, and at some point you could look closer, deeper, more detailed, and then you saw it, and then it was pretty simple and surprising. Which doesn't mean that it's boring, it's still fascinating how it works, right? How biology works at the molecular level, but it's pretty technical, so that's quite interesting. So, our topic for today, knowledge graphs. So what's a knowledge graph? So these first seven slides, they are just a completely new topic, I will just show you some examples. So a knowledge graph, very similar to a database, so now we are going to a different world, but one which is still very connected to search. When you search, you can also search in a database. And here's an example. So that's an example of a knowledge graph. We will see on the next slide, why graph. But in the simplest form, let me find my laser pointer here, that's just triples, so it's like very simple, the simplest form of sentence, a subject, a predicate and an object, and we are still in movie world here, like we did, where for so many sheets, Nicole Kidman acted in Eyes Wide Shut. Who knows any of these two movies? Wrong generation, okay, maybe I should update the movies, but they are very good movies, they are not so old actually. Eyes Wide Shut as you can see is a Stanley Kubrick movie, Burn After Reading is by the Coen brothers again. I recommend watching them. Brad Pitt acted in Burn After Reading so these are movie names. And let's look at the example a little more because we learn a few things about knowledge graphs here even by this simple example. For what follows it's very important that when you refer to this movie Eyes Wide Shut you always refer to it in the same way otherwise you don't know that for example Tom Cruise and Brad Pitt acted in the same movie if you call it slightly differently. So the names also here when you want to express that an individual acted in a movie you have to call it acted in by exactly that way. So that's why it says unique identifiers, if you want to express this relation always use this word. That's one thing. And you can see things can occur on the subject side or on the object side. For example Tom Cruise here occurs on the subject side of that he acted in this movie. Here he occurs on the object side of being married to someone else. Also this is not complete, also important. Many more people play in these act in these movies but it's just a selection. We do have the directors here, we don't know whether it's complete, it doesn't say, it only says Nicole Kidman married to Tom Cruise, not the other way around, so maybe some things are missing. But basically that's a knowledge graph in the form of triples. I think it's pretty easy to understand what it is. And now you can imagine a lot more knowledge being in that form. Why is it called a graph? So to save some time I've prepared this. This is the exact same information as a graph. So the things which occurred as subject or object, doesn't matter now, are now these things in blue. So we have this movie here, Burn After Reading, we have Eyes Wide Shut, another movie, here we have the people, and here we have arrows indicating, note that it's a directed graph, if we go back, right, it's Nicole Kidman acted in that film and not the film acted in her, so it has a direction. This guy is the director of this movie. He married two, there could be an arrow in the other direction, but we don't have it. So very naturally if you have these triples you can also draw it as a graph. Yes please. In the previous slide there is an arrow because the last line is really the three characters so not the real, but the directed by the shot. Oh, thank you for paying attention. I could say that I did it on purpose, but I didn't. You are so right. Did you know it or did you research it? Yeah, and that's kind of unlikely, right? Copy paste error, thank you very much, very attentive, so why is it now saying that it doesn't know how to write this? Correct, he's the director of IS-Whiteshoot. Thank you, and let's go on. So what knowledge graphs, Phil, Dan, Frank are out there? And I will talk a lot more about Wikidata in the next lecture. I will not talk too much about it now, otherwise it will be repetitive until 2018. So there are these, this kind of information people have amassed huge knowledge graphs where you basically have all the information about the world. They are huge. We will see how huge in a second. Until 2018 the biggest one was Freebase, nice wordplay Freebase, contains base like database free because it was free and Freebase is also a drug, some form of crack cocaine, so nice typical computer scientist joke. This company was started by MetaWeb, nobody knows this company I think, bought by Google already back in 2010 in a very smart move, it's now what they started in this company is a major part of Google's infrastructure and data and it was acquired for the, I mean it's nothing for Google, 99 million back then it was maybe, and it's a bit hard to find but I think that was actually the sum they paid another computer scientist. But then it was discontinued, so Google absorbed it, it was open for a while and it was not open any longer and then Wikidata overtook which was a new Wikimedia project. Who knows Wikidata in the room here? Oh yeah you should because we had data sets from Wikidata. Yeah we'll talk more about Wikidata. And so Wikidata was small in the beginning and then grew significantly over time. So Freebase final size was 3 billion of these triples which I showed you three slides ago on 60 million things. Nicole Kidman is like an entity and a triple is she acted in that movie. And Wikidata by now is 18 billion triples, so a lot has happened since then, and on 87 million things. So the number of things does not grow so much, but the information about these things. And this is done, it's like Wikipedia, it's crowd sourcing, you can go there, add some shippers, correct some things. And what we did for you, we provide an extract from wiki data, simplified in reality, these knowledge graphs are pretty complex, we made it simple for you so that you can focus on the lecture. Let me maybe just, it's linked on the wiki, but let me just show you how the data set looks like. I add 23 data sets, wiki data, TSV, yeah that's what it looks like, so it starts randomly with something, so here we have Japan has diplomatic relations with Serbia. Ok it starts with all the countries, let's see whether, that's just how it starts in some order, it's a pretty big file, see a lot of triples here, let's see, ok there was something about the sun. The sun is, lots of information about the sun. Let's see what else do we have here. Is that all the information about the sun we have? Child astronomical body. Here are other. The sun is an instance of a Jeep tie main sequence star. Sun notation, solar symbol, sun part of solar system, sun present in work, Star Trek. Okay, that's an interesting, it's the only triple, one of the few triples about the sun is that it's present in the Star Trek movies, that's interesting. You get the idea, this is a pretty big file, that's what you will work with, the exercise sheet, and let's just look how many lines it has, that's just a very very small sample of the whole Wikidata, 38 million, so tiny bit of the 18 billion of Wikidata, but still pretty big, pretty interesting stuff in it. That's what we give to you for the exercise sheet. Also in case you have problems, I don't know, depends on the machine you are working on, it should work, we also have a smaller version of the data set. So if you really have problems you should work with a big one but if it doesn't absolutely doesn't work here is a smaller one with just 4.5 million triples. Ok back to the slides. This is just you may wonder how do I, can I really cast all information into this triple format. Here's slightly more complex information, for example, these guys they married maybe on a particular day and the marriage ended, it was actually pretty long for show business, they married in a particular place, so how, and this is like information connecting several entities, this person married this person at this time at this place, and you can also write this as triples like this. So you can say Nicole Tittman married two and now you have this intermediate object which I called XYZ here, you can give it any name, it just has to be a unique name and now you have xyz, I have the person here, the start time of the marriage, the end time, the place of the marriage. So I can think of it in the graph, if I go back to the graph, I could have here instead of going directly to the other person I go to an intermediate note which is an information note and now I can have all kind of information attached to that note. But that's just for your curiosity, you don't need it for the exercise sheet. So for the data set it's simple, it's just arrows connecting two real things. Now of course it's the information retrieval lecture so we want to search this data set. And the language of choice for this kind of data is SPARQL. And SPARQL is a word play on SQL so full of computer science jokes here because SPARQL is an acronym for SPARQL protocol, it's a recursive, a self containing acronym. SPARQL, SPARQL protocol and RDF query language. So this data model of casting everything into triples is called RDF but it's not important for the lecture today, just if you wonder why it says RDF here. And it's also no coincidence that it contains SQL, the data language for, the screen language for database because it's very similar. Here's an example query and we will see it live on a database in a few minutes. Let's say that's one of our running examples. We are in movie world and we want people who are married and they acted together in at least one movie. And here is how you would express that in sparkle. So it looks similar, so how many of you know sequel? Sequel you should have heard at least of SQL or somehow V. So what I have now and maybe this, so you also write triples here, so don't bother too much about the semantics, but what you have here in the query, you also write triples but in some places you can write variables. So I'm looking for a person who acts in some film. I'm looking for another person, let me take the laser pointer, who acts in the same film. It doesn't say film one, film two here. So this is two persons who act in the same film and this person is married to that other person. So that's why SQL was invented, it's kind of very natural, you can almost read it. So it's like a high level programming language. Then you have to say what do I want to see in the end, I want the two persons and the film they acted together in. And here again, in real SPARQL things look a little bit more complicated so we work with a simplified but the simplification is not so, actually you have these squared parenthesis here, here on that slide and we drop them here because we don't need that. So yeah you can look at that, I will talk about it more in the next lecture also a little bit, for today it's very simple. You just have a, also let me go back to that set in real data, you usually don't have things which you can read here but you will just have identifiers, it will say Q80 here and here it will have a number and here it will have alphanumerical thing and then you have to look up what they actually mean. So we have made it human readable for you so that it is easier to work with. And now you can also view a query as a graph. So again we are, and here is a picture, again we are interested in people who are married and who act in the same film. So you can also write it as a small graph, here you have it, where in some places, and this could also be the predicates, but here it's the notes, you put variables. So you can think of this as a pattern, that's how it's called in Sparkle world or as a template, and now in your huge knowledge graph you are looking where does this pattern fit, where do I have a person and another person, there is an arrow from one to the other and from both there is an acted in to one film and if we look at our original graph it would match here and you have to pay attention to the direction of the error depending on that the one person will be person one and the other will be person two so that's the only place where this pattern would match in this picture. Let's go back to this slide and then the result and we will see that in a second is just every possible assignment of to these three variables will be a match, will be a row in my result. So one match here will be Nicole Kidman, Tom Cruise and I.S. Whiteshot. That will be, but there could be more. And we will play around with this query and database in a second. Okay that was just introduction and what we are dealing with today. Any questions about this so far? It won't get very complicated today. I wonder where this funny noise comes from, maybe aliens. So databases, as I said, I mean this looks related to databases and actually you could just store it in a database, which is exactly what we will do today. So we will just store this in the database, no inverted index, but we will come back to the inverted index once again at the end of the lecture. And you will also do that for the exercise sheet. Now in database world, the query language is SQL. And I don't know how fluent your SQL is, but I will give you a crash course. So in case you missed everything in the database lecture you will learn the most important stuff today. Because the basics are pretty simple. So what's a database? A database for the purpose of this lecture and it's pretty close to the truth, is just a collection of tables. So here are two example tables. And one way, how do I cast a knowledge graph, oh it says knowledge base here, it's sometimes also called knowledge base, like actually they changed that because knowledge base sounds so boring, so they changed it to knowledge graph, it's somehow more exciting, it's the same thing. So one way to cast this information into databases is for each predicate, so in our example we had acted in, which person acts in which film, who is married to who, who directs which film and a lot more. And you could have one table for each predicate, if you have one table you have two columns, one for the subject, one for the object. So for example for acted in you have these here. For married to I added some more, I paid attention to diversity so we have men, all combinations here, male, female, Nicole Kidman, Tom Cruise, Ellen, Portia de Rossi, so these are two showbiz people. And I actually researched this with, on the web to find these queries, so actually Pythagoras was actually it says here in the, but it's not that Pythagoras so it's not the famous one apparently a lot of people in the old agrees were called maybe it was a name like John or Peter or I don't know so it's not the one with the triangles or maybe I don't know so apparently they they married I also didn't know that. That's one interesting thing about knowledge graphs, you learn so much, because there is so much information in it. So I now learn that there was a guy named Pythagoras who married the famous emperor Nero. And there was even a public ceremony, interesting. So that's why this row is correct, although I was skeptical at first when I saw it in the data but it's correct. So here we just have two tables and that's one way to cast this into database world. For the exercise sheet we will do it slightly differently, we will see it later in the lecture because it's not a given how you cast this into tables, you could also have one big table for everything. And if you go back to this, what we had in the beginning, I mean this is already a table right? Just put table lines around this and this is the column for subject, this is the column for predicate, this is the column for object. That's the simplest way to put it in a table and actually that's what you will do for the exercise sheet. But for now, and I did this deliberately, I will work with this so that you have to think a bit about yourself. And it actually makes sense, both ways make sense. But we now work with this way. And for what I'm going to show now we just have these two tables, we ignore all other predicates. Ok, so now SQL. And now we will do some live SQLing and before we do that, let me skip these two slides. How do we work with a database? There's a very nice, who knows, SQLite, who has worked with SQLite. I mean there are these complicated databases, Oracle, and it takes days to set it up. You have to run a server, you have to get the access rights right. It's no fun. SQLite is a super, super lightweight, that's why it's called SQLite, tool to work with database. You will see how super duper easy it is. And it's really good to know. So if you don't have it on machine, that's how you can install it. There are two types of commands, the SQL commands from the SQL language and then there are SQLite commands, they start with a dot, you will see it in a second. And you have two modes to start, just like this, SQLite 3 or everything you will do like loading some data, putting it into a table, write it into a file so that it persists so that when you start the next time and you write the same line it's already there otherwise you have to restart every time. And let's just do this together and here are some commands, this is just for your reference, let's just do this together. So let me show you, here I have some, I have prepared two files, this is just who acted in which movie, it is somehow sorted by popularity, So Game of Thrones is first year, Jonathan Price acted in Game of Thrones. So this is our two table column and file form and here we have married to Julius Caesar was also, oh ok, that's a slightly different data set apparently but also not small so this is our Mary 2 and that's all we have for now, another one example which is very small to show something later. So now I do SQLite 3 and I want to whatever I do now I want it to be stored in a file and I call it acted in and married to these two things and I call it db but I can call it anything I like. Now I am in SQLite world so it's like bash, I can also clear the screen here, I have a history of things, so it's like bash, so it's very convenient. The first thing I want to do, I want to create a table, that was on the slides but we don't have to look there. And I want to create a table for my predicate acted in and SQL is also very much like sparkle you can basically read it. I want a table acted in and the first column I want to call it person and it's text, everything is text for us today. And the second thing is a film and it's also text and you have to end with a semicolon. And now I can ask what tables do I have right now dot schema that's the SQLite command and it tells me you have created this table and it's currently empty. Now I can do arrow up if I want to I want another table and I want to call it Mary 2 I can name it any way I want but I will name it in a meaningful way. Let me call the first column person 1 and the second column person 2 and let me create that table. If I do schema now, I have two tables now and they are empty. Now I could do a first query, select everything from, show me all the rows from table 1. There is some auto completion here if you type three letters, it's nothing, it's empty because I haven't done anything with the tables yet. Just created them. Now I need, so I want to read in a file, very important, you first have to say what the separator is, tab separated file, so I have to say otherwise I think the default is comma. How do you type, this is very important knowledge, how do you write a tabulator in, so if I do tab, I get tab completion, so on bash in Linux world it's always control V, it's written here like for paste control V and then the tab. Now I get the character, control V then you can type some special character and you will actually get the character not the function of the character. So now I told it when you read something in now or you write something use tab as a separator. I import like this import name of the file it was acted in TSV, import it into this table. No semicolon because that's not a SQL command that's a SQLite command. I'm telling SQLite import this. Takes a bit and now also import the other table, married to TSV and you will do something very similar for the exercise sheet into this table, file table. Okay so my schema hasn't changed but now I would expect that if I ask this now, now I get the contents of acted in. So now I could ask give me only the person from acted in, give me only the films, give me the films from acted in, now I get all films. Give me only ten films from limit ten. Select film from acted in. You can pretty much read it right? It's ten films, okay. They are repeated, it's just a second column. Many people act in that film. I could also say, you don't need that for the exercise sheet, give me ten distinct films, you see there somehow ordered by popularity, teal at the top piece. So you have many more commands, of course these are very rich complex tools, but basic stuff is super simple as you just saw. I mean I created a database, I can query it, that's very nice. Here's some more things, indices, we will talk about this, create an index, this is about performance. You can ask a query like this, we will see more complex queries in a second, you can delete a table or you can delete an index with or without warning if it's there. This is for your reference, you don't need much more, I don't think you need anything more for the exercise sheet, but of course you can look up in the documentation. Now let's go back to two example queries here. And the first example query, so now we have our two tables, let's just, we are now just interested in acted in. And now let's say we want all actors from the film Burn After Reading. Well, here's the query. Let's just do it together. So I want something, first thing if you think about it, what table am I interested in? I'm interested in the acted in table. And what do I want? I want persons from that. I could also person comma film, then I would have a two column result. And now I want a condition where, I mean this table has two columns, person, film, and the film should be burned. Now I have to know how to write it, I think it's capitalized in the data and I need a semicolon and now I should get, yeah. I could also write person comma film, now I would get a second column the name of the film repeated I don't know how many times. I think I could also write the film twice here and then I would get it twice, I mean we just do what I tell it right. So this is, so very simple SQL query you see I loaded data into database, I can query it, it's really really easy, that's why it's called SQLite. Let's just copy the query, one thing I think that's useful to know, how do I quit with quit, it was a dot before it, otherwise, yeah. That's hard also with the debugger in C++, quitting program is sometimes the hardest part, exit doesn't work, exit dot, quit dot help doesn't, yeah, you see, can even control C doesn't work, oh now it works, now I killed it, ok. Now the question is, ok if I go back now, that's interesting, now I killed it, if I go back now, do I have a corrupted database? You see I still have my history here, yes I still have my data here. Let me exit properly with.quit and now what I can also do instead of starting it in interactive mode, I just specify SQLite 3, the database file, if I have used one, and now the command here, now I have to be a bit careful, I think I can also use, I'm not sure, let me see whether I can use these quotes here. Yeah, it also works. So I can just specify my sparkle command, my SQL command as another argument and then I can run queries on the database with a single command line. And even more, I don't even have to go in the program. So SQLite, whenever you do database stuff and you don't need super deep complex stuff consider using SQLite because it's so fun to use and also as we will see in a second it's a very nice Python interface, very easy to use from within Python. You basically say import SQLite, you say open the database, then you can ask queries. So we already said that, you just select rows from a table with certain properties, specified in the work clause, in the select clause you say, what should I display? The result is always a table in database world, also in sparkle, in RDF world, it's the same. Here's some more, oh yes please. Yeah that's a very good question and Nataly and I discussed about this. So these are my examples, actually here I have an even shorter one which is basically some triples from my very first slide. And here I don't have the spaces, I don't have underscores, I don't need them because my separator is a tap so everything else is an OK character. We could have done it for the data set which we give you for the exercise sheet but we didn't do it because then sometimes it's hard to see is this a tab, is this a space. But the short answer is it doesn't matter, it's just you don't need them but just for you to help you distinguish, we just didn't include spaces, we replaced all spaces by underscores. But it plays no role whatsoever. It actually wouldn't work with the status level because... Oh yeah, absolutely. Oh yeah, exactly. If I would do this now, it would be, I'm sorry, it wouldn't work. It's just a different string, right? Empty result and also, and that's by the way I will talk about this more in the next lecture, that's a big problem when working with these knowledge graphs, let's say I do it like this, I don't know that it's all capitalized, I also don't get anything, it's just exact string search. You can also do regular expression stuff but then you have to write regular expressions and yeah, you have to write it exactly the way how it's in the database. But yeah, thanks for spotting that. But yeah, I think for the slides it's probably consistent, right? Not quite, here I have it like this maybe I just do it like this here so here is a slightly more complex query so let's now assume oh ok Ah ok, ok let's do it that way. Let's now assume that we have everything in one big table. And now you have to pay attention, it's probably one of, it's not a very complex lecture, complicated lecture today here, it's one thing to understand about databases. I will explain it slowly, because maybe the most important single thing about databases. I will explain it slowly, because maybe the most important single thing about databases. I deliberately have a very small thing here with just five, not even all the triples from my initial example. It's just some people who acted in two of these movies and not even all which I listed and Nicole Kidman is married to Tom Cruise. Just these triples and let me now read this in another database and show you something. So SQLite 3 example DB which is empty now, I mean I don't have anything here right. So let me first create a table, it's now a table, let me call it all and now this is a table with three columns now and it's just a generic table so the first column is subject, the second column I call predicate could contain any knowledge graph and the third one is object. OK, and I apparently did something wrong. What did I do wrong? All is a keyword. So, I should have tried this before. All triples, should I call it, let me just call it triple, so on the slide it's called all, so I didn't know that all was a keyword and now let's, we need the separator again because we want to read the data from above. And now import example, the file example, the file comes first into triples. Okay, and now I have my one table and if I do a select star from triples I should get all my triples. Now I have them in the database and now think about it yourself. I hope I didn't go there yet, don't look on the slides. Now I want on this simple database all two persons which are married, so there are only two persons married here and they act together in the same movie. How do I get this? Let's maybe think about it for yourself first. How would we go about it? Select, we don't know what we select, we only have this table where. So now we have person, I certainly want to say predicate is equal to marriage to, oh no it's without the space here. Predicate is married to and in person one, yeah now what do I say, person one plays into, plays in some movie in which also the other person plays, right? So how do I do this? It's kind of like this, I don't get any further if I just have the table one. Now how do I say end person one plays in same movie as person 2. And now the one thing, and I will show it to you now, what I can do in databases, I can write multiple tables here. So I can have this table once and have this table here again as t2. And let me just do this and there will be a slide about this but let's first see what it does so let's just, what I get now is the dot product of the two tables not the dot product, the cross product, sorry the cross product I get every row of this table combined with every row of this table so it's the cross product of this table combined with every row of this table. So it's the cross product of the table. So here I have row one combined with row one. Row one combined with row two, row one. So five times row one combined with every other row. Five times row two combined with every other row. Five times row three. So we have 10 lines with Brad Pitt because I have him twice here combined. So it's all pairwise combinations, so 25 of them and that's why I chose such a small one. So if I do a count here, that's also valid. SQL I should get 25 and I do. And now how does this help me? Because whenever you do something in database world you have to specify a condition on a single line. And now you can see if you look at a single line here, let's look at the line which is interesting for us. Let me ask that question. I'm claiming that one line here is interesting to answer our query. Which line is this? Which if you start counting as 1 from 1 to 25, Is there a single line which contains all the information we need to answer our query? People who are married and acted together and stay moving? I think we need the table again, right? We need it one more time. Otherwise I mean this comes close here. It says Tom Cruise acted in Nicole Kidman, Married To, but I would also need the information that Nicole Kidman also played in that movie. So I think I need the table a third time, but now, how large is it now? Now it's very large. So let me filter this table a little bit, and let me maybe... Ok, let me just do the following. Now I just want two lines from each table. Let me do it like this. T1 person, T1 film. No, no, it's not person, they are called subject and object now. T3 subject comma T3 object. Ok this is now readable. How many lines do I have now? 125? Yeah that's correct it's 125. So let's maybe not display all lines, let me display only some lines to make it a little more or maybe I, let me by doing a where clause I can restrict, I can filter what I'm showing. So let me maybe just show lines which start with Nicole Kidman here. And there are other lines but I claim the line we are interested in is among them. So now I've omitted, I've just not printed out, I could have by doing t1 predicate here, so this is just Nicole Kidman, married to Tom Cruise, Nicole Kidman played in I was shot and played in I was shot. eyes wide shut. So let's find the line which contains the information. Ah this is a good one right? I think this line contains all the information we need to answer our query because this here is, I didn't print out the Mary 2 to save some space so this is from a line from table 1 saying that she is married to him. This is a line from table 2 which is a copy of the table saying that Tom Cruise played in that movie and this is a line from table 3 which is again just a copy saying that Nicole Kidman played in that movie. And you always have to, when you wonder how to construct such a query, you have to make copies of a query or a table or combine them such that the information you need is in a single line. And here it is. And now I just have to formulate the right condition so that I indeed get that line. And that's not too hard now. So now what do I want? I want... Ah, and now I understand that I'm not doing, I'm sorry, I mixed something up because now my, here my query was a little bit, now I understand why I started with two tables. Actually here I was not talking about Mary 2, I was just talking about, but it doesn't matter. Here I just wanted all pairs of actors who acted in the same movie. I did two things at once, now I'm sorry but it doesn't matter. So I did, maybe we'll come back to the simpler example and first do the more, finish the more complicated one. So now I have to say that, let's say the first table is the one with the married to, so now I want t2. Oh, you can sorry for the confusion, let's go back to the slightly simpler query here, which I think, everything is correct which I said so far, but the query here is just all pairs of actors who acted in the same movie. And let me just show you what I showed. Let me just show the two copies. Now I just want all actors who acted in the same movie. So here we have two copies of my movies and a cross product. I can just check every line. So I have an actor here who acted in some movie, an actor here who acted in some movie, and I just have to check whether it's the same movie, and if yes, I output that pair. So let me just do that, so what would be my where condition. And you tell me, so now I just want pairs of actors who acted in the same movie, and you tell me which conditions I should write here. And I can add them together. So I can have many conditions, write conditions so, and conditions so, and conditions so. Any suggestions? You can also write it in the chat for the first condition. t1.object equals t2.object, yeah that's a very good idea, so I want that the objects this one and this one is the same, exactly. And I can even try it out, so what do I get now? get, ok t1.person is not equal. What? Oh, subject, yeah, not person. Confused myself, okay. Yeah, that's important, right we only have acted in here but that's kind of a coincidence because our thing is so small here. I should also say that t1.predicate is acted in, it is already now but that's only because I don't have any more. And t2.predicate is acted in. It will not, thank you for being more attentive than I am. So now we have these. And now, ok we are not interested in the movie actually we are just interested in the people so we will just say T1 subject here and T2 subject here. Oh my, what did I do wrong? Oh my, maybe we should switch. Okay, now we have, and let's again look at the original data set. Let me just do a select star from the original. So, here we have only one triple, so we don't have a pair of distinct people who played in Burn After Reading so I really have three actors in Eyes Wide Shut and I have all pairwise combinations except the people themselves. So how many should I get? I should get three times three minus three right? There are nine combinations, 3 of them where the person occurs together with itself, so 6. So I have all of them here. Sorry, we first went via the more complicated query which we will come back to after the break but this one is actually simpler. Is there any question about this? Because that's really, you have to understand about databases that you have to copy the table here twice so that you can then express this condition here. Is there any question about this? Yes please. That's a very good question. The question is how I showed it here. We first copied the table, like two copies of the table, let me go back to this. Right we had this here, let me show it like this, these 25 rows and then I filtered from them and like I explained it and also how I wrote the queries it seems that this is also what the database engine SQLite does, it first creates this huge thing which even for this very small example is large and then filters it. And the last four slides of today's lecture are exactly about how it's really done. But the thing to understand it conceptually it's done like this but you can get the same result differently and that's the last part of lectures about this. But to understand what it does in principle, you have to understand that it, in principle if you define the semantics it does this. But you can compute what you want more efficiently. Is there any other question? So I think this is what I have here on the slides is correct. So it's just the cross product and we will talk more about the cross product in a second. And also note, so I wasn't able to call it all because it was a reserved word. If you have copies here of the same table you can just write AS and you can write the same table again and just give it a different name. So this was triples one and triples two. So this we already talked about some SQLite commands and now this is what you need for the exercise sheet and this we do after the break, so five minutes of breaks. So the exercise sheet will be to write a Python program where you can input the SPARQL query and it gives you, it computes the corresponding SQL query and executes it on a database where you imported the dataset which I showed you initially. And let me show you how it's done in principle and I will actually show you a lot but you still it's still some thinking it's a nice exercise. So and in the following we use yeah I will explain it to you for one table per predicate and then for the exercise sheet you will do it for one table for everything so that you have some transfer to do and it's not just copying what I explained here. So we assume that's what I showed you earlier these two tables acted in Mary 2 we will work with those we stored it in a file so we don't have to do it again. So, here is an example, and let's do that, so let's do that query together for that example and then you get I hope the general idea. So let me quit here and let me start again with this thing here. So this was how we started two tables and let me again do the select star from, these were bigger tables than in my example. This was the, see now that separator is the original thing, it's just a pipe, fine too. Select star from acted in, select star from married to, these are my married to. Okay and now this is my sparkle query. The sparkle query is, that's our running example query, people who acted together in the same film and who are married to each other. And now the question is how do I do it in SQL? So we have these two tables and this is what I started earlier with the big table but now we do it for real but with these two tables. And the question is, it's always good to start with the table. So I certainly need, let me start writing here, we certainly, we need each table, we need married two and acted in. How many copies do we need of which if you think about it, do you have an idea? One copy of each? Okay, let's try one copy of each. And then let's try to, and it's good to think about it and see where it fails or does not fail, so let's do married. Oh my. fails or does not fail so let's do married, oh my, married to, and if I just want to use the table as is I don't have to use the, ok here write it with underscores again, fine, married to and, but I think I can give it a name if I want to, let me call it M comma acted in as, let me call that A. I can just use it as a shorthand, I could use acted in. Okay, now what are the conditions, what's one condition? Let's just start with some conditions. And then in the end probably we want to select, yeah we also write it in the chat. And it's not trivial, one has to think a little bit, but it's similar to the example which we saw earlier, just a little more complicated now. Another acted in, sure. I'm like chat GPT, I do what I'm being told acted in as a2. Now we have another copy of, now we have married2, table is called M, now it's like cross product of 3, right? Now we have a line from married2, a line from acted in and a line from acted in again. That's what we have in one line now, all combinations. Now you can do something such a line. And how would that look like? Don't have to say all the conditions at once but we can add them one by one. What's which we want. A1 dot film, A2 dot film. So same film, two rows, somebody acted in this film, somebody in this film. Okay, now and I guess we want more. So we can just, could be in the same line, new lines don't play. Any role for sequel? Other conditions. A1 and A2 not what's not not equal ok A1 dot and I think we called it person right person not let me just write it It's not, it's A2 person. Okay. What else? And? A1 dot person is m. A1.person, the person who acted in the first film is person one from the marriage relation M.person one. Or the other way round, yes. But let's, so now what else? And? And I mean the second person I guess should I mean of the second film the person should be the second person. And it's a good observation I mean we could also have this and this or the other way around, but let's maybe for now assume if you looked in the data carefully you would see that usually it's in both directions, but not always, but maybe to keep the query short let's just assume that merit two contains the information in both directions. Are we done? Yeah person won't be married to themselves, you never know, and the data sets contain all sorts of, yeah let's assume that. So you don't really need the second line, I agree. But if we don't have, are we complete or are we missing something? Well maybe we are, what should be in the select, so we want in the sparkle query just says the one person, the other person and the film. How do we write it? And actually we have a choice here right? If we want the first person we could write A1 person or M person 1 by the condition here it's the same thing so that's very typical. A1 person comma and note the syntactical difference in a sequel you have a comma in sparkle you don't it's really like this A2 person, comma, and then the film. Which one do we take? It doesn't matter because they are the same. So let's just go with one that's similar like you have it in programming sometimes. And now let's just a1.film and let's try then and maybe yeah this remark a person is not married to themselves. Yes, yes, that's true, that's a very good question, but by these two conditions, if married doesn't have this property, I will be guaranteed that they are different, right? Because here it says this person is person one and this is person two from a triple here and these can't be the same. So this follows from these two in a subtle way. But only if my table has this property. But let's see, we can check. Now let's memorize this and let's try to do it here. So we do select something from, and we needed two copies of the acted in. It wasn't clear from the start but if you try to do it with just married to acted in you will see you are missing something and there will be a slide which goes deeper into why did you need two copies, it's very related to the exercise sheet. So we have married to, let's just do it the same way as we did it on the slide. We have acted in one copy and we called it A1 and we have acted in another copy we called it A2 and then we have where and now we did A1.film is equal to A2.film and a1.person is the first person of the married 2 and a2.person is mperson2. And these things here after the dot are how we named our columns right. And let's just see and now we wanted, what did we want, we want a1.person, a2.person, a1.film and we had a choice here how we, yep. And now let's see what happens. It takes a while but then it works. There actually quite a, okay, let's maybe, okay. And now let's maybe make use of what I showed earlier, let's copy the query because actually it's copied now I hope, let's exit this and let's execute the query from the command line. It's the same thing but now I'm just in the shell in the bash console and I'm doing the same thing. It's loading the database, I'm getting that. Now I can count how many lines I have. 6,992, let's just see what happens if I add the conditions that the people are not the same. If Mary 2 doesn't have person married to themselves then we should get the same number now, let's just check and A1 person not equal A2 person same number, who thinks it's the same number? I don't know, I think it's the same number but so we don't need it. If it would be large enough, in wikidata 18 billion triples, any kind of nonsense you can think of. So it's this. Ok and let's see whether we have, let's do less so we can search it, let's see whether we find Nicole Kidman here, yeah we have Nicole Tom Cruise, Nicole Kidman, Stanley Kubrick, ok Ice White Shot there here. Let's see whether we also find them in the other direction? No, only in this direction. So our merit two is not, so actually I will not write it on the slides but if you want to be more complete here you also want to write this and this or the other way round. No, no you don't need it right? Because I mean it's not necessary that you get the same result again in that order. So it's actually, if it would be there also in the other direction you would get all four rows again with the two firsts which you may or may not want. So it's actually fine like this. Yeah, as I said we can actually look at the marriage too and we will probably see that for, yeah I don't know, let's look at it, it's also sorted somehow by, sorry, Barack Obama is up here, let's see, so here it's in both directions, right, so for some, that's also typical, sometimes it's symmetrical, sometimes you have both of them there, sometimes not, it's of course not nice, but apparently for Nicole Kidman we only had it in one direction, otherwise. Yeah, ok, so this marriage only in this direction, maybe there is a deeper meaning behind this direction, which eludes me, but maybe, yeah. Ok, so that is that and it worked. So that is what you should do for the exercise sheet and it is actually very interesting because you often read that in papers that you can translate between the two languages but how do you actually do it and this is a super complicated algorithm, it is tricky, you have to think about it, but then it's surprisingly simple when you have the result. So let's see how that works, let's go back to the example in a second. Maybe let's do it for the example, if we ignore this triple, if you look at this and look at the sparkle query and now if you leave our thought process aside and just look at this syntactically then you will see a few things. So here you have twice acted in and that's no coincidence. You have two triples with acted in and you add two copies for acted in and that's actually if you think about it that there's a reason. I mean this is just like this if you would have three triples here acted in you would have three copies of that here. And now look here it says acted in that film acted in the same film. It uses the same variable here and because it uses the same variable here that's exactly the reason for this triple here. A1.film equals to A2.film. So this triple, so it's kind of dual to this here, here you have the same variable saying that implicitly and here you say ok this object and this object are the same so the fact that this is the same variable is expressed by this equality and you have that twice more in the query. Here you have person one and person one. The first copy of acted in subject is the same variable as the first, as the subject of marriage to. That's this condition here, right? And similarly person two, person two, you have it. So the same variable being used twice here corresponds to an end condition here. And this you can generalize if you think about it, that's what's written here, actually if you, what would you do if you have another, if you would have another acted in with film, now you have three things, three copies of acted in and you want to say the object is the same and how do you express that three things are the same, that they are all equal. How say X equal to Y and Y equal to Z and if you have ten things you need nine conditions. This is actually what's written here on the slide so if you, yeah. If a variable occurs m times you need m minus one equalities and it also makes sense for the border case if a variable just occurs once you don't need any conditions. So you can check this out when you do the exercise. If you have a very simple sparkle query where it's just one triple each variable mentioned once you don't need any and or where condition, it just works. So actually it's when you first look at it, it looks complicated, here's the advice, I won't go into detail, it's there for you to look at when you do the sheet, but if you understood the example, you can generalize it. And it's surprising how easy it is once you got it. And it also has advice here for how to implement it. So it's not a lot of code, it's a really nice tricky coding exercise. And now we won't do this together, but just so you see it, that's how you do it in Python. And that's really easy. So you do import SQLite 3, you have to install the pip with pip the module, then you say use this database from this file, that's just database stuff, you always have a cursor, so you can't just execute directly on the database, you have a cursor into the database and on that cursor you execute a SQL command. I think it's correct that you don't have the semicolon here, it's redundant. And now, yeah, and here one can actually understand why one has a cursor because look, you execute a result here and maybe it has 10 million rows as a result. The result is always a table, maybe you don't want ten, maybe you want the ten million rows then you can iterate over them but what you usually want to do is how many are there, give me the first ten or something like that. So the cursor kind of tells you okay I'm now in that result what should I do and here it's the code for okay actually give me everything, and then you can just do a for loop, but you could also iterate, which is very meaningful, maybe you want to do something for every line and then do something else, and then you want to iterate it, so one by one and not load the whole result at once. That's the reason, one reason for the cursor. Okay, but you will see this if you have any problems, there's the forum. Okay, now the final part, four slides about performance. But before we go into that, and that connects to a question you asked earlier, any question about this part? Maybe let's also see the exercise sheet so that you can... It basically says this again with an example. And it's the simplest form, sorry, of sparkle, where it's just select some variables, some triples. Sparkle is way more complicated in reality, but it's just basic sparkle. And then exercise two is, we give a query to you in natural language. All European YouTubers and their birthplace and where they studied, if they studied, interesting query, you have to figure out just by looking at, you have to look at the data, you have to see, ok which predicates do I use, how do I formulate it, then you formulate a sparkle query, feed it into your program, you get the result and exercise 3 is just think about any query yourself, one with many results. I have a question. Where are the studies? Does it imply that what happens is that they didn't study? Then they are not in the list. We don't want YouTubers who didn't study. It's a very good question because sparkQL like SQL also has a construct which says if this is not there also show me the line with a null value or something like this. This is what in SQL inner and outer join is about. We will talk about joins in a second and in SPARQL there is the keyword optional which says, I can actually it's so simple but it would just complicate the exercise sheet so if you have a, here in this sparkle query for example here I could put optional around this triple and then if it's not there I would still get a line in the result but without this. Around everything I could put optional. But you don't need that for the exercise sheet. OK, any other question about this part? Sparkle, SQL and translation before we talk about efficiency. And I'm very sorry that it looks like we are going to finish on time but maybe something will crop up. Actually I'm optimistic. Ok let's go to performance. So there was a very good question asked earlier and before I go to the question let me just repeat. if I do something like this where I have multiple tables here, which all or some of them could be the same table it doesn't matter it's the cross product, before I proceed what's the number of rows in table I. Yeah? Number of rows to the power of k, yeah that's good but the tables don't have to be the same size. Each number of rows multiplied, yeah. It's called cross product for a reason because if you look at size it's just a normal product exactly. So it's just a product of the sizes. Which means it's pretty big right? If you have a lot of, and think about it for the exercise sheet you get a table with 30 million rows and for most queries you will need multiple copies and you make a cross product of a table with 38 million rows and then it's 38 million. And that's what you said, then it will be the same table, then it's 38 million to the K and that's just very big. So any engine that does something meaningful will not manifest this product but it will do something more efficiently and this is what I will show you now and I will explain it to you by an example. Let me just draw an example, let's see whether I get this right, let me think why, let me see whether I, yeah. So let's have two tables and let this be abstract tables now. Let me just call them table X maybe. And this is just, yeah, let me just call, do it like this, so I have a column here and here I column, yeah one column. I don't know how to call this now but let me a little, this is one column, and these are some other columns, one or more, some other columns, oh no this is wrong, writing down here is hard, I'm sorry. Some other, also columns is hard to write. And I don't really care what's in these other columns but I will write something, let's just say here I have A, then I have I don't know B, B and C and let's these things have an order for a reason and here I just have some x1, x2 that's some data, that's data over several columns and I don't really care what it is and I hope if not please ask, this could be several columns. Also this column which I wrote here is one column, doesn't have to be the last column but I just made it the last column for this example. And now, let's say I have another table Y here and let me here make this special column the first column and let me maybe have A here and maybe no, maybe C, C and D and here I have some Y1, Y2, Y3, Y4. So again this is, I should write it again, that's again one column here. And these are some other columns. And now what I want, I want to, I'm computing the cross product of the two tables, maybe I should write that here, maybe I should give this column a name, I don't know how should I call this column, maybe I should call it C, and maybe here I also call it, I can have the same, and what I want to compute, I want to do select star from, and that's called x comma y, so I'm taking the cross product of the two tables where x dot c is equal to y dot c. And now I want to compute, so I'm computing the cross product and then I'm only filtering out those lines where these two columns c, they don't have to be called the same, I just call them the same, are the same. Before I do this, how many rows does the result have? What do you think? How many rows does the result have? What do you think? How many rows does the result have? Now I want all combination of a row here and a row here, so that column C is the same. How many rows are there like this? 2, 3, 5, 2.5 Euler's number, I should have, 4, 6, wow, who has more, let's do 4, interesting ok, so the result looks like this and let's say select star Maybe, let me, in slide abuse of notation write other here, this could be of course more, I mean I can't write this, I can't give write x.c, x.other and y.other. So I, yeah, and this I should write it here, that's a slight abuse of notation. You can't actually write this, if you have several you would have to write out x.name of this column and so on. Abuse of notation. So, and now actually let me do it algorithmically. So let's see how much I get. So here I will get c, then I will get x.other, and then I will get y.other if I want them in that order. If I look at the first rows from each this with this, is that a match? Yeah, I think it's a match. So let's just write this one. And now let me not, so this would be an a here and it would be x1 here. This could be several rows and y1 here. That's how join works. And now let me not do it the conceptual way, let me not actually look at all combinations and see for which ones this condition holds, but let me do it how it would be actually done and what's written here, it says if these two are sorted then we can just use list intersection from lecture 3 and that's actually true, I mean here we have a sorted list, let's assume it's sorted and actually if you create an index with a database that's pretty much what it does, it will create a copy, not really a copy of the table with that sorted by that column. Of course we will not create a copy, it will have a sorted column here and it will have references to this other data because it might be very large, you don't want to copy it. But conceptually you have something like this, you built an index, now I have that table ordered by that column and that table also ordered by that column. And now I can do zipper and it's very nice that zipper from inverted index comes into play again. So now I'm here and here. I start at the beginning, they are sorted, I check whether they are the same. They are the same, I output the result. I advance in both lists. So now I am here and here. They are not the same, so I advance in the list where I have the smaller one, that's how Zipper worked. So I am advancing here, B, C is not a match, I am still advancing. And now I have C, C, C that's a match right. And now you have something which we didn't do for zipper but which it would be easy to extend. So now I output X4, Y2 it's the combination of these two rows. And now if I have, here's another, so that's also a match, right? So I also have this as a match. So I have C, again X4 and Y3. And that's it. Now I have exhausted here. Actually if I would have three Cs here and two Cs here, I would get six rows. And that's what you get in database world. I would get three times two. I don't have it in this example. And it would be, so actually you need a mini nested for loop here if you have several here and several here and then, this is pointers from zipper. That was our simple linear intersection algorithm. And if you remember the complexity you always advance in one list, so it's linear, it's linear in the number of rows, right, you look at each row only once, with the exception, if you have several here and several here, then you have to do something, but if this doesn't occur too often, which it usually doesn't, because if it does, then you would also have a huge result, and it's linear. And you can prove it's like linear in this plus this, the result size. So that's how you would actually do it. it's like linear and this plus this, the result size. So that's how you would actually do it. There are also other ways and this is called a join. Let me also, yeah, this is called a join on column C. On this column, so I'm joining on this column, you can also join on multiple columns, it works the same, then you just have to take the concatenation of the columns as the thing on which you join. Is there any questions about this? I mean that's almost the last thing I will explain. And for the exercise sheet, now the question is how do I tell SQLite to do this, do I have to tell it to use zipper or something? No it's more indirect in database world, it's already hidden here, you tell it create an index on this column. And if you, it was on an earlier slide, and if you create an index and it will come in the situation where you have a condition which does something like this then it will do a join and it will be faster. And you can try it out for yourself when you play around with the data. Create indexes, don't create indexes, do the same operation. You will see a big runtime difference. Here's one more thing that's just interesting to know. I mean there are multiple of these joins which we do, if you have here we just have one condition, each condition is one join. If you have, for example our, the query which we had, acted in married to all pairs of actors who played in the same movies, you can do it in two directions. Think about it, one way is that we first look at all pairs of actors, that was our sequel query number two which I first messed up a little bit, now I get all pairs of actors. All pairs of actors from the same film, that's pretty large. And now I go through each pair and check are they married, are they married. Most of them will not be married. I could also go in another direction. I could first look at all married people. There are probably not so many couples, much less than these and then for each married couple I check which films did they play, most couples don't play in any films and then check do these lists intersect. So I could go about this two ways and that's like first the one join, then the other join or yeah. And that's something for you to play around with. If you wonder this query, could I make it faster? And the question is how do you influence join order, but that's just for your curiosity. It seems to me that SQLite just takes the join order as you write the query. That's of course the simplest way. I'm specifying this is the query we came up with earlier. With slightly different names here, act one, that's what we call A1 earlier. Three conditions, three joins. And I think SQLite, if you don't do anything special just executes them in that order. So if you change the order you can get very different running times. And of course databases are complex beasts, you can tell them please analyse this, try to figure out which join order, I mean how do you figure out the best join order? One way is to try out all of them and then take the one which is the fastest but then you have already tried out the slow one. So ways to do this is to do sampling, just do it a little bit and see oh this looks like it's going to take a long time, this looks like it's going to be faster and stuff like that. So there's a lot of database theory about this. So you're welcome to play around with this when you do the sheet, because you will execute queries and maybe you want to make them faster, but you don't have to. And I think that's it, there are some references at the end. So I'm sorry that we finished on time today, it won't happen again. But are there any questions? Yes, there's some. we can join this table sufficiently. But if we work with the shivel table, we still have information. So if we are joining, we have each information. If it was three times the same table, we have the same line three times. And this is still not necessary. I think that's what is still not necessary. So yeah, okay, we have the table three times, the big table. It's actually not that easy from what I explained, but what the engine will do, I mean one condition it will evaluate first. And this condition will just join two tables. So it will be in that situation. It will first evaluate the first condition which will be a join of two copies of the table. So you have this picture here, where you have one column and now you have the table sorted by that column and this table is just the same table also sorted by it. So it doesn't have to make the copy, that's a self-join then. I mean this could be the same table right? And if you, but it could be another column of the same table. I mean we had such conditions, maybe you want object of my one table equal subject of my one table. Then you want like two copies where one time it's sorted by the object, one time it's sorted by the subject. And that's what the engine will do internally. Then it will just go through the subject list in sorted order and the object list and find where it matches, find the equal ones. And from that it will produce a new table. And now comes the next condition and now it will join the new table with again a copy of the table. But it will never manifest the cross products, it will always do something like this internally. Ok, but it still would make sense before I ask for objects in the objects to check that in the first table I have to go to the first one, otherwise I am going to match the wrong Oh yeah, absolutely, and this could be another example, this is actually an easier operation, let's say I would have select just one copy where x.c is equal to b. I have my whole table and I just want the rows where the predicate is acted in. And this would be maybe the, yeah. And now, but again if it's sorted then I can quickly find the position where the acted in starts, they are all together and then I can just read them off. So that's actually, I didn't show that here. Just imagine this row here, this having many rows and I just want the Bs. And I can quickly find where the B is, I could do that with a binary search actually, I don't even have to go through it and then the Bs will be all together in, because I have a version of that table which is sorted by that column. So the conditions predicate is equal to something specific, they're even easier to evaluate. I just take the table sorted by that column and just find it and they will be all together. But there's actually, I could give a whole course only about this, this is database optimization and also sparkle engine optimization, it's very interesting, here this is just giving you a glimpse, so that it's clear that it's not actually this cross product being manifested but you can actually do it more efficiently. So I just gave you a glimpse, I didn't explain it in full detail. Any other question? So next time, last lecture we will talk about the evaluation, about time management, about the exam, we will do some exam questions together and I will tell you how it is to do projects or thesis or other stuff at our chair. And I hope to see many of you there. So see you next time. Bye.So, welcome everybody to lecture 13, Information Retrieval in the Winter Semester 22-23. Maybe we can close the door. I think it's a little bit loud. It's the last lecture today. But anyway, we will talk about your experiences with your last, which was the very last exercise sheet. Exercise sheet 12, right? It's 12, not 11. I think so. Just occurred to me. We had 12 exercise sheets. Very last. The official evaluation of this course. This will be the first third of the lecture. Then info about the exam. This will be the second third of the lecture. and then how it is to work at our chair projects, master thesis, spatula thesis, maybe you want to do a PhD with us at some point. So no new contents today, no new exercise sheet and hence also no Q&A on Friday. So very quickly about this is also not, this should be experiences with, copy and paste arrow, I'm sorry, experiences with ES12. So many of you skipped this sheet, this is not unusual for the last sheet or sheets, but most of those who did it found it very interesting, doable and not too much work. It was a completely new topic after a lot of beautiful linear algebra. We now had beautiful database stuff and knowledge graphs. Very interesting. Always wanted to learn more about DB stuff. DB stuff can be boring, depends on who teaches it, but it can be a super exciting topic, databases. Learned a lot, never had to do much with databases so far, now you have. It was nice and not too stressful assignment. I had fun. The lack of linear algebra felt very good. Okay. It was trial and error for very long that I could not solve. So a few people had problems with it. Sorry about that. And a lot of you wrote something like this. Sadly, I couldn't do it because of reason. So different reasons, but I think it's just a lack of time accumulates towards the end of the semester. And thank you, tutor name deleted here for everything, really appreciated your feedback so the tutors did a great job in giving you feedback, personalized feedback to your sheet and many of you were grateful about that. Okay, so first third course evaluation here, the summary of the results. So 80 people registered for the exam, the deadline is over for a few weeks now, that's a little bit less than last year, but that's just within the, it's just a typical average number. 71 of the 80 participated in the evaluation, that's a great ratio, most of you computer science, a few math, embedded systems engineering and others. 53 nominations for the teaching award, thank you very much for this generosity despite the overtime, so I have a complex topic of time will be discussed on two slides. But I'm happy that you liked it anyway. So in the following summary of your feedback and as usual all the details, let me just check that this is true. On the, yeah, let me just go to the Wiki page which is here. So you can, if you go to the, yeah, there's a link here at the top in red. If you click on it, you can read like how we did it, like just the, yeah, when it was announced and so on. And then here you have all the free text comments. So as you can see, a lot of free text comments, very interesting read, like a novel. So feel free, so what you liked, what you didn't like, suggestions, how you evaluated the tutors and so on. So we will not read this together now, but feel free to read it. We always put it publicly on the web and it's anonymous of course. It also says there, if for some reason you see something and you say, ah, I don't want it to appear there, just drop us a line anonymously or not and we will delete it. Usually it does not happen. So first some of the numbers and I like to compare it either with the course from last year which was similar but not identical or with a department average. So learned a lot pretty much like in the last semester. All of you said that you learned a lot. 80 even said gave the maximal score. So that's great. Not much different to the last semester. I'm happy about that. Explained well, similar score. Level of contents, that's a very interesting question. Also please note that the sheet was simplified a lot. I don't know how many of you have noticed that compared to years ago the sheet has less question and is simplified. Okay, you are in the study commission where we decided this. The others didn't notice this, so a year ago the sheet like contained 100 questions, some of which were a bit strange. Now it's only 10 very good questions. Level of content. So how hard was it too hard? Was it so 50% said right on, 45% that's a high number said it's high, so rather high. Exercises, so here it was, that's interesting I think and I think it's somehow related. I think most of you telling from the free text comment like the exercises but maybe found them too much work, that's why we have this. But good, still most of you found it good or very good. And also note the comparison to last year is not fair because the question we changed. The question in the previous years was weird. Now it's a clear question, now it just asks how did you like the exercises, if there were exercises. Last year it was a weird question. Apparently you get better scores for weird questions. Quality overall close to 80-20. That's also a great score. Thank you very much for that feedback. How much work was it? And I'm sorry that's not the amount of work. That's how hard it was. So was it right on level or was it too hard? Now it's the amount of work relative to the ECTS. So here we have this distribution. So again about 50% said fine, right on 38% and I think this is okay to find it a little too much. This is not okay. So if this is a high number that's something to worry about and it's 20% this year and it used to be 10%. So I'm a bit surprised by that because if anything, we made it easier, not harder. So we reduced the work for several exercise sheets and compared to the year before, we took away a whole topic even and still the scores were not as high there. And it's also compared, I mean it's along the lines of the department average but certainly not better so maybe yeah this is something to discuss, think about or change something about. It's an interesting statistic. I mean let me just mention that we are super aware of this and we try to keep the workload reasonable. A lot of measures we have taken over the years, we try to provide all the boring stuff which also costs time like parsing something or... And also if you ever do real research or even if you do your master's thesis or bachelor's thesis, data sets you get are not so nice. We gave you the nicest, most beautiful, well-curated data sets ever and real data is much more messy and it's part of the work, actually a big part of the work to deal with messy data. I want to mention that I personally like the boring part and dealing with messy data and it's an important part of research. Research as you know is 99% perspiration or transpiration, 1% inspiration. So it's also important and I like it. It's very meditative to clean data or stuff like this. Materials and service, similar scores, 75%. 6%, I would be curious. I mean the materials I think we put a lot of work in that. But maybe not so much look at the scores for us, but if you compare it to the department average, it's way above average. So that's, I, that's, yeah. So some of my colleagues apparently are a bit, yeah, don't do a great job there. And it's similar for the service. So also very good scores and way above average for the department average, which says maybe more about the department average than about our lecture. But we do put in a lot of effort. That's how you follow the lecture. That's also what I expected. A fifth or so were here. One half, bit less, mainly via recordings than many of you mix, either sometimes live, sometimes recording or both. And a very few, I think this is also much higher if you look at department average for other lectures. If the lecture is not so worthwhile, people just look at the materials. But for our course, that's very few. Thank you very much to Frank who did all the technical support. Thank you, Frank. He's sitting here. He did a great job, the videos, everybody praised, who said something about it, praised the video quality. Alexander, who's maybe listening to it when he edits this, also thank you to him. He did the editing, so we have someone who we pay for the editing and I think it's an important part of the experience. So thank you very much, both of you. Tutors and assistants. Tutors were a mix of postdocs and PhD students. One of the tutors is also here, Johannes. Hello Johannes. He's with us today. So here are the names. Johannes, you see him on video. They did an excellent job. Your feedback was very positive. Let me mention some of you wrote, I don't know my tutor, he just gave me feedback or something like this. He just gave me corrections. It's actually a lot of work to give these corrections and I mean giving individual feedback to between 10 and 20 people. So the tutors and we had weekly meetings and so on. So a lot of work and they put a lot of effort in them. So big thank you also to the tutors. And please, if you liked it and you find it worthwhile, consider to give back and become a tutor yourself in the future for this course in the future or other courses we teach. So we would be happy to have you as tutor. There's a question in the chat. Does department mean the faculty? No. Department means computer science. So we have three teaching units, computer science, microsystems engineering and sustainable systems engineering, microsystems and sustainable systems, two different things. The averages here are just computer science because very different cultures, not so easy to compare scores across departments. Assistant was Natalie, you know, Natalie, she's with us every... Yeah, her job is different than the tutors and not so visible, but people in the second row were not like the front person like me here, but like doing all the job behind the scenes. It's like the most important job and she did a lot of work and a great job. Let me just list a few things. I will probably, it's not complete preparation of the exercise sheet and master solutions. It's something you have to do every week. Proofreading of the slides, attending to the forum. So if the answer speed, many of you praised it. That was Natalie, the weekly tutor meetings. She organized the video timestamps, which are super useful, checking for plagiarism, which is always a topic. So it's like many, many things which you have to keep in your mind and take care of. And if you do it perfectly, then nobody notices. So it's kind of an ungrateful job, but Natalie did a great job. So thank you very much. And yeah, thank you. So let me summarize here some praise and here's some PowerPoint animation. I think you can read it yourself, so a lot of things were mentioned. Positively fast replies, feedback at the beginning. So basically we added to this over the years and I'm happy. It's always nice when you notice the effort we put in and it matches. So there were a lot of these. I won't go into details. Read the feedback, the actual concrete feedback. It's really interesting. Criticism, not true to surprising the lectures were too long. Some of you said that not everybody, but several people, too much material, some sheets, not all of them are too much work. Math proofs should be at a lower pace that not a lot of people said that, but a few apparently. I mean I understand those with less background or more problems in math. And I want to distinguish between, there was not too much criticism and it was fair criticism and also suggestions, not like this was bad, but maybe you could do it like this. Here's an incomplete list. So there was this problem with students studying to become teachers. They don't have mathematics in their bachelor, which is a big problem because it's just expected that you had no at least basic mathematics. So some of you had conversations with some of you, and whoever, if you listen, I don't know if you listen, I know that you wrote something in your experiences for the last, second to last sheet, Logistic Regression, I will answer, I just didn't have time yet. I read all the feedbacks. Don't say that mathematical proofs or calculations, oh they should be proofs I think, are super simple, yeah I'm sorry I got that feedback in the past. Let me maybe briefly justify when I say that what I mean if you find this, I don't mean that if you find this hard you are stupid, That's of course not what I want to say. What I wanted to say when I say it's super simple. I mean there is math which is like more involved math and there is math where you do just calculations, right? Like doing a singular value decomposition or something. And there are like deep proofs where you sit for hours and wonder how it's done or you have no idea how it's done. So I think it's useful to distinguish between the simpler stuff and the more complex stuff. And I especially think it's important because many lecturers and also many scientific papers make the math appear super complicated when it's actually not right. You can give a super complicated proof where you could give a simple proof and so on. And we try to make everything as simple as possible in this lecture. Alternatives to NumPy, SciPy, we discussed this. Looks like it's still the standard so we'll probably keep it. There are always one or two who hate the 80 characters and can write a book about why it should be more. Come to me personally and discuss but there is a reason why the standard checkers still have it. There are really good reasons and just because you have a very wide screen doesn't mean, I mean, a lot of tools just assume that your lines are not very wide. Let me just remind you again of code reviews in GitHub side by side code that even if you have a wide screen, you have long lines, they break and you can't see anything anymore. It's a typical screen layout. You have some panel here. You have one window, another window and you want your code lines not to break in one of these vertical windows. And some more suggestions, also there are always some who would like more advanced stuff, I completely understand, but that's of course hard when most of you already find it a lot. And here's my slide about time management. I want to explain one thing just to tell you how difficult it is. It's not that I don't know how to manage my time or that I don't think about it. I come to the lecture and I just talk until it's over. Actually I've thought about this problem a lot over the years, over the 30 years of teaching I'm doing and I'm also experimented with it a lot. So it's not that I'm not thinking about it or I've kind of explored everything from this extreme to the other side. And I just want to explain there are two sides to this problem. So one side is if one looks at what you praised, and this is stuff which we also collected over the years from your feedback. Feedback in the beginning, that's nice. Many of you said that's nice. Just repeat a little bit here about what's in the, what was, what were your problems or not problems in the exercise sheet, giving intuition and examples before doing the math. All these little crash courses, I mean, I could just leave them out, right? But probability theory could just say linear algebra, matrix vector multiplication, you all know this, and then I've already lost 50% of you or more, right? So I always have these little crash courses in there which take sometimes 15 minutes, half an hour. But also don't make it too easy. That's also really important. There are some people who just want that. They know the basic stuff well and they want to know the deeper stuff. So I have given lectures in the past where it was much more on the basic side and then the people who wanted to learn more were disappointed, which I understand shouldn't happen. The live coding and live math takes time. It would be easier for me to just throw the code at you, right? I have it. I just, here it is. Look at this code. If you look at 20 lines of code, I mean, you don't understand anything. Here's the program, right? And then I explained something. Yeah, you get lost. It's important to develop these things, same as math. Don't rush. That's a common lecturing mistake. So you spend time and then, oh, now we have to get 20 slides in 10 minutes. And then you just do karaoke or you just talk three times as much. I have a funny story to tell. There was a conference where they had the ingenious idea to cut the talk time from 20 minutes before to 10 minutes. And then what happened was that people just gave their talks at twice the speed. This is just a, I mean, yeah, I mean, you didn't understand anything in the 20 minute format because talk quality was not so great, but then you didn't, you for sure didn't understand anything. So if you just talk very fast and rush, you can as well leave it. I mean, doesn't, nobody can follow. It doesn't make sense, but it happens very quickly when you want as well leave it. I mean, nobody can follow. It doesn't make sense. But it happens very quickly when you want to keep the time. Breaks, we had breaks, so we just took the time for breaks even when we are late. And now, what's the other side? So the other side is the lecture should finish on time. Ideally, that's the norm in one and a half hours. And there, there are suggestions like, okay, if it's so much, why don't you split parts off to a second date? We could just split it in two lectures, two dates. But then the experience is hardly anybody comes to the second date. So we even see it for the Q&A session, right? It's there every Friday. We thought about this. We did something like this in the part, so all the not so crucial or not absolutely essential stuff, you move to a separate session, but then there are only ten people in that session. And so most people miss it effectively. And then, yeah. So it's kind of an impossible problem, right? You want all that. And now here comes the crux of the problem and that's a much, so, yeah, I come to that in a second. And that's also a problem in society in general, that's why I call it below here. It's the human society problem, you also have it in politics and I think it takes some time when you grow up as a human being to become aware of this. You as an individual, you know very well at some point, certainly when you are in your 20s, what's good for you and you say, I don't need that feedback in the beginning. I don't need these crash courses. But yeah, live coding is great. And please don't leave out the deeper stuff. So that's person A. Person B, oh, the feedback in the beginning is great. Please also repeat something from the last lecture. And the intuition part should be a little bit longer. And person C says, I need three breaks or this or that. You see? So there are all these individuals and it's perfectly understandable and everybody needs a different mix of these things. But the one person standing in front here, or we, who organized this lecture, we have to choose one mix for all, right? And the result, and that's like politics or democracy, then it's like equally bad for everybody. So you have to somehow go in the middle and then maybe there are some people, a few usually for whom it's just right, but then you have people on this side, it's a little bit too much, too little of this or of that. And yeah, you might argue, or many of you propose solutions, but I would argue, and I think it's true, I really think it's true, if you're long enough, you live long enough, it's kind of inevitable that you arrive at this truth. It's just impossible. You can't make, it's just a human society problem, right? You have all these individuals, what they want, but yeah... Also as a government, as a teacher, as whatever, you have to fix on one thing. You can't make it super individual, all the time. Although advanced cultures try to do that, right? They try to make it fair for everybody, but it's very hard sometimes, impossible. Always have this slide, so what will be improved for next years, just because we look at that when we do it next year. Here are some things I've written up. So yeah, I mean, this time management and too much problem, I'm considering splitting the whole course. Look, there's a lot of material which I've taught in the past, which I've left out, and there's also exciting new material one could add. And if one takes everything which I've ever included in this course and the new stuff which one could add, one could easily make two courses out of this. But then the question arises, how do I teach these two courses? And here your feedback would be valuable. So one option would be, I just change the subset of topics every year. I always call it information retrieval. This year I do a little more learning. Next year more knowledge graphs. I could give four or five lectures on knowledge graphs, very exciting lectures. Or one basic course and then one more advanced course. But then of course you have an interleaved them. And you have the problem that you have the basic course only every two years, the advanced course only every two years. Also not perfect. Or that's option three. It's probably the most work for us. I can't give two at once every at the same time, it's too much work. Have both every year and for one use the recording from last year. So this year I give advanced and then you can hear those who want to take the basic course, take it via recording. I guess it's hard to have a, does anybody want to say something now? I mean, you could if you wanted to or write something in the chat. Can we attend those two as well? Those two, so one of the two would be like this one or variant of this one. But it's an interesting question. I mean one thing I'm considering if next year I give a variant of this one, but then it will also contain lectures from this one, what happens to people who took this one? Maybe it's just less work for you then. That would be one option. So it's some stuff you don't have to do it again, you get it for free. each by its own because it has some more styles, massive review and it is an idea that one form, one and the other one, recording something like, something like a sub-name, like information to people, and for example, focused on learning techniques. Yeah, thank you for this suggestion. I think so what you suggest is have two courses which are just different and independently takeable by themselves. And I think it's pretty much what I meant by option one because if you do that, there is some stuff which you want to explain in both courses. I think web stuff, okay, some of you don't like web stuff, I would probably want to include something like the inverted index or evaluation measures or something. Statistical tests which I have left out, they belong into every course. So I think if I do that, even then there would be some overlap because totally disjoint I think doesn't, for that there's just some basics which you should have. But then even if, I mean, I think it's even close to the option I'm considering, because then if people take both of them, yes, there will be some repetition, but it will be just free lunch for them. Then they have less work if they both do both courses. They don't have to do that exercise sheet or something like that. I think that would be okay. If you do both, you have less than two lectures, a little bit less. So yeah, please feel free to write something in the forum about this or tell me by any other means. So I think that's an important topic because so much great material and limited time we have to find a solution or try to make it better. Yeah, there's these problems with the missing mathematics. We are painfully aware of this. So it's only a few people, but it doesn't matter. It's important. We would study to become teachers and in their bachelor they just have no mathematics lecture. Maybe one thing I could say here, just to show you how crazy politics is sometimes, which is sometimes due to people, but also sometimes due to the impossibility theory which I pointed out earlier. Why is there no mathematics in the teacher's bachelor? It's not really the teacher's bachelor. It's a bachelor where you study two subjects and teachers have to study two subjects. And so you only have half the ECTS per topic. So you have to leave something out. So you can't have the full curriculum. And then you have, I don't know, I think it's the cultus minister conference if they are listening. Then you have, I don't know, I think it's the Cultus Minister Conference, if they are listening, so something from the state, and they say this must be in the curriculum. And these are people, I think average age is probably 90, I don't know. And I think they come from a certain walk of life, maybe not so technical, more humanities. These are terrible prejudices, but I think they're partly true. And so in the past, like 50 years ago, every computer science curriculum must have theory at automata theory, right? So people, I don't know if some of the aspiring teachers are here, you have to listen to Automata Theory, PNP and stuff. I mean that's interesting if you ask me in 2022 this is not essential, right? It's more essential that you know linear algebra, certainly. And the same thing for, what's the other one? Which, oh yeah, logic. That's also, it's like a classical thing. Logic, yeah, it's good propositional calculus and so on. It's also not essential, I would say. Having a basic mathematics lecture one second is much more important than having theoretical automata theory and logic. Yes please, you're one of the aspiring teachers? It's like, it's the morning of these topics, but you can't take math 2. Take the publishing across the system. Of math 2? You can take math 2, but you can't take math 1? You can. It's not so bad. Yeah, so we were actually aware of this when we designed this, but it was just an impossible situation and this came out which is not a good solution. So yeah, I'm sorry. Yeah, so we are aware of this, so we really have to improve this. Like this, because the fewer of you did this, we were in a terrible situation. I'm sorry. And here are some minor things. I just wrote them up so we don't forget them. This was just some minor things, nothing to discuss here. The Q&A session, Natalie complained and she was right that few people came and those who came were usually very quiet, one or two people who asked questions. And so the question is, should you have a session where a few people come and even fewer ask questions? But the same happens with lectures sometimes. I had lectures where I had an audience and they were quiet the whole semester and I was wondering if anybody is listening to me. But then in the final feedback everybody wrote their like it. I'm never sure what to make of that. You let me also give some feedback to you. I think you were a great audience. You were, I mean, I have a comparison over decades. You are a very pleasant audience, so interactive and present. And so thank you, it was a pleasant experience this semester. And yeah, and there are some suggestions from two slides ago which are easy to implement and the stuff which is easy to implement and doesn't make things worse, potentially better we just do it. Okay, that was that. If there is anything you want to say, you can say it or maybe later or you can just write it in the forum, any further suggestions or anything important which you think I missed here. And I think, yeah, we are all, nobody is super tired as far as I can see. So let's go on with this part and do a break. Oh, there was a question in the chat. Is the audience better that COVID is now over? Oh, thank you. That's a great question. Yes. That's a great question. I actually wrote it to my colleagues, not to you. So it was very tangible in the last two years that not everybody, but a significant fraction of students, maybe one fifth were really stressed. I think it was hard for everybody, but one fifth it was super hard. And you could see that, you could sense that in the atmosphere and also in the feedback. So especially the, all the bachelor lectures in the last semester, but also one year ago, got worse grades. And not because, I mean, maybe some lecturers also had problems, but not all of them. There were some lectures, same quality, high quality as usual, significantly worse grade. And you could just sense it from the feedback. Some people were pissed, stressed. If you are stressed, you also get angry and so on. And this has gotten better. And I attribute it to COVID, not so much to the virus, but everything it created and all the environment and atmosphere. So my feeling is, yes, it has improved significantly, atmospherically, and I'm happy about that. So the exam. And yeah, let's do some exam questions, some info, some exam questions, then we have a break. And then the last part. So the date is February 28th, Tuesday, easy to remember, it's the date of the lecture, but earlier. 11-2-13-XXYXX, we haven't decided on the exact duration, maybe two hours, maybe two and a half hours. This depends, we haven't done the questions yet, we look at the question and say, yeah, we don't want to induce time stress or anything, so we will just choose it so that you should be able to do the task without time stress. It's rooms, this room and the one below. I think it's a little bit narrow, but that's what the Prüfumstam, the examination office, assigned to us, because we have 80 people, which means 40 people. Yeah, 40 people writing an exam. You're sitting quite close, right? Maybe some of you like that. I don't like it, but that's what we got, so we have to work with it. But we will have surveillance, of course. There will be people from us sitting in every corner at the top, and so you can't. You should just, yeah, and there will be always one in the back and in the front for you. And anyway, most of you don't cheat, but I know that there are always a few percent who do. Don't do it. The exact assignment will be announced in time in the forum where you will sit here or below, but it will also be no problem if you just come 15 minutes early and just figure it out. Bring your student ID, passport, and pens in two colors, because sometimes there's just tasks where it's useful to have two colors, just easier also for you. Don't use red, that's our kind. What else to bring? Okay, we have changed from, we did open book in the past and I have a message here for you too. It's no longer open book for some time now already. But you can bring one page bottom, both sides, you can write on both sides, you can use four point if you really like. You can't use microfish, we discussed it yesterday in the microfish, is it called microfish also in, I'm not sure, it's this kind of film which libraries have which you can then magnify, but you need a special device and the device usually weighs 100 kilograms. As far as we know there is no portable micro fish. So if you go in old libraries they often have this stuff on like photographic tape where you can and then it's in miniature format and then on such a page you can fit like a whole library. So you're not allowed to use that in any way the device is too big. Why is it no longer open book? It's interesting that in the past many people say yeah, open book, that's great. Open book means you can bring everything, lecture materials, all the books in the world, all the books. And I understand people think okay okay, that's great. I can bring everything so I don't have to learn all these stupid details and I understood this stuff anyway. But the problem is if an exam is open book, we can't ask you simple things, right? That's obvious. We can't ask you something where you can just go to page slide 12 of lecture 5, read it there and say, oh yeah, I just copy it more or less. You can't ask simple questions, which means all the questions are a little bit harder. And the harder questions is what many struggle with. So actually an open book exam is harder. And, yeah, so it's actually, and really that's the only reason. I think we do it for you, not for us. I wouldn't mind. It's easier to pass a not open book exam for the ones who have problems. But you can bring that page and I think that's just, yeah, maybe you have some particular problem with it. You always forget that formula or the sigmoid function for some, I don't know, past trauma or something you can't remember this particular calculation. For that you have the page. Anyway, when you write the stuff on the page you will remember it without the page and it's also just placebo I think, a good placebo thing. It can be handwritten or a printout but it must be created by yourself. So if we find two identical copies of something that's not okay. So you must create it and it's also good for preparing. Electronic devices, sorry, not allowed, including micro fish readers. FICHE, it's F-I-C-H-E. There will be a sub forum for questions, answer speeds, response times may be slower in the semester break. And yeah, there's some who ask all their questions on the morning or night before the exam. It's too late. Don't do it. We might not be able to answer in any way. It's too late. If there are any questions, ask at any time. Just shout or talk or write it. Yeah. You can write on both sides. Oh, yeah, yeah. It won't take longer than 30, 30. Oh yeah, sorry if you have two exams on the same day, that's tough. But you have a big brain, so you have to... This is important, and I will give you, give you three types of questions. One is basic understanding and we will see an example of each and do it together and let's see whether I will pass. So I'm not sure I hope. The first one is basic understanding, which means either you state a basic concept of the lecture, a definition of algorithm, or you apply it for an example. You will see it in a second. Second one is coding. You did a lot of coding for the exercise. There will be coding tasks in the exam. Of course, it will not be 100 lines of codes. It will be individual functions, small functions, 10 lines, worst case 15 lines. And it will be written on the front page of the exam that your function should not have more lines. So if you do more lines, it's also to help you, you know that you're doing something wrong. So it's good to know that you can do it in a few lines. And type 3 is deeper understanding, not very deep. At numerous occasions on the lecture I said just a second, look here something, this could be an exam question, showing that the edit distance is symmetric or something. You can prove it in one, two, three lines. It's not necessarily something we did in the lecture, maybe a variant, but requires some thinking. But importantly, it's not something where you have to think about for two hours, like for some exercise sheets. So it's stuff which you can do in an exam. Yes, please. Very good question. So the question is, I mean, you don't have a Python or NumPy manual in your head and you know, ah, there's a function for computing the length of a string, but I don't know if it's len length or strlen. It doesn't matter. Just write, know if it's len length or strlen, it doesn't matter. Just write, I think it's called strlen, if it's called len, use that. So that's fine, we won't subtract points for that. But if you write code which looks like you have never written a Python code in your life, like everything is, and you say this is pseudo-code or something, then we will subtract points. So if it's like way off the Python syntax, but not exactly knowing what's the name of the function or the operator. If you are unsure, just write it and say, I'm not sure and say what you mean. That's important. Say this is supposed to compute the length of the string. I know it exists. I just don't know how it's called. Then we will not subtract any point. Yeah. Yeah. So, very good question, indentation. So we will have a ruler and then we will have a tolerance of one millimeter and if you're for each millimeter you're more of, no we don't. I can just tell you from experience we've been doing this for a long time and it was never a problem. So yeah, you can do, I understand the question, you think, wow, I've only coded on my screen, can you code on paper? It actually works very well. We also do it for the oral exams, it's the same thing and it's no problem. So you can also do indentation there. It works, it just works well, especially for short programs. And yeah, we will do examples of this in a second. Some general pieces of advice, don't learn things schematically, try to understand it. More advice about this on the next slide, that's really important. And yeah, this is important, and that's how I basically started every lecture. Every lecture started with a simple story, a simple example, and it was easy to understand. That's also important, but you don't pass the exam only if you can do this. You also need Type 2 and Type 3. It's not necessarily one-, one third, one third. Depends on the exam, but each type will be there. We will pay attention that there are simpler ones also to warm up and take away the fear, but it's not only type 1 you don't pass if you leave these out completely. Yes? if you leave these out completely. Ah, good question. JavaScript, that's included in type 2. Yes, so simple HTML, JavaScript you should know. But it's the same thing there. I mean JavaScript is a huge language. HTML has I don't know how many tags, just the simple stuff. HTML head, body, H1 for the header, like what you did for the lecture. And again, if you are not sure, just write. And if it looks like you should know how HTML works, that it's an XML-like thing with these square brackets, right? If you write something which looks like you have never seen HTML in your life, then we will subtract points. But if you are not sure about the name of a tag heading or h1, no subtraction. We just look at it, you understand it in principle. Same for JavaScript. But yeah, it belongs to the coding category. And very important, and the contents from the exercise is also relevant, and we will see that it pays off that you have done the exercises. Because many of you said, well, I've spent so much time on the exercises, I'm sure it will not help me in the exam, it will help you in the exam if you did the exercises. And this is also something which you can't do in an open book exam because then you just print out all the exercises. We would ask a question there, we would just copy it. So, yeah, we will take care that if you did the exercises you have an advantage because yeah, you should do the exercises or have done them. Before we do three examples of tasks, check whether you really understood something. Super important. I said it several times and I say it again. One, I know that people do it. You learn like, I think I have I think it's always best to show something by example. Let me just mimic it. It's a common way to learn, especially if you have problems with time or something like this. Look, you want to learn this. Not important now what it says. And you say, okay, there will be a question about fuzzy search and the algorithm for AD. Okay. Q, Q, X, and Y. Okay, I understand. You look at the slide and you try to understand what's written. Okay, why is it greater equal? Okay, I see. Minus Q delta. Okay, why is it greater equal? Okay, I see minus Q delta, okay I see and you understand it. You look at the slide and try to understand. That's okay for the lecture, but not okay for learning because you are conditioning yourself on if you see it in front of you, you can explain it. This is so wrong. This is so wrong. Don't do this. What you have to do is you study the slide or the slides and then you put it away. And now you have to, okay, let me see if I can reproduce the formula. And it's even good if you don't do it in the exact words or formalisms of how it was on the slides. If you do it, yeah, there are many ways leading to Rome. You do it a little bit differently and afterwards you can check. But this is so important for learning that this simple three line proof, don't look at it and verify it. Try to understand it. Put it away and try to do it yourself. And not by learning it by heart very quickly in five minutes and then forgetting it again. Understand and many examples for this. Computing the distance from a point to a line. This was an exam question some time ago. Yeah, don't learn it by heart, but yeah, we did it in a certain way. We did it two lectures. Here's the point, here's the line. Just take the normal vector and the multiple of it leads to that point and try to figure it out. So try to learn the principle method and then develop the formalism yourself. That's how you learn. That's super important advice, I think. Yeah, that's what I just said. Doesn't work like this. And let me also show you, because it belongs to that. Yeah, actually I say it's, yeah, all our old exams, I know that the Fachshaft has a database and that's great, but I also, and they are secret channels where people sell old exams or for, I don't know, I think to make it fair for everybody we just put all our exams on our wiki. So all of them, which means a lot from past years, exams from previous semesters. So here you can do exams all day long. And it's a great exercise. And very important warning, we even put the solutions there and I have to extend my warning and it says here and yeah, this is the version of the exam. So first don't give this to other people. Don't use it when you do old exams for training. It's the worst way to learn. It's the same thing. Here if you task and then you have the solution, right? I mean if you learn like this just you read the question and see, ah, yeah, yeah, yeah, yeah. If you learn like this, you look at the question and the solution and you think you are learning, you are not learning. This is just for checking after you did it yourself. So that's why it's written here. Yeah. So just for checking. And it's not even important because if you understood it, you know yourself that it's right. Okay, I think it's a good time to have a break now and then continue with a fresh air and a fresh mind with three example questions and then introduction of our chair. So five minutes. So one example for each of the three types and just see how I think, how I would do the questions and it would be more interesting if it were random and live. So this is a type one question. Let's see. It's not the text from the exam is a little bit longer, so some details are missing here. And maybe one remark in advance. If you are unsure about something, there are always people to ask, right? Just ask. And also a typical error is to read too fast. Don't save time on that one. Read the question and see, okay, what's the question and not, ah, okay, and then start doing something and then you realize, oh, and or even realize it afterwards. So the first one is, and this is type one, so it should be simpler. So I have four documents. I have some words. Let me maybe write the words here. So it's four different words. Let me write them in alphabetical order. Let's see if I can do that. So it's blah, bleh, blow and blue. Okay and I have to write tf-idf scores. And let's first write the IDF. And maybe you can. So IDF is, yeah, how many documents do I have? I have four documents. So the IDF of blah is, it occurs in two documents. So for example, let me write it here, IDF of blah is log 2 of N divided by, and how many documents does it occur? It occurs in two documents, right? It doesn't matter how often for the IDF in the first document. So it's n over 2 and that's 1. So I think the IDF of BLA is 1 because it occurs in two documents and BLEE, okay BLEE occurs in one document so IDF, let me write it here, IDF of BLEE is log 2 of n over 1 and that's a 2. So BLEE is a rare word so it should get a higher IDF. So to do this you need, yeah that's typical type one question, IDF. I mean, you should really know what IDF is such a basic thing and then you just do it for an example. Nothing deeper or you just need to know how to compute it. Blows again in two documents. Also important, and while I'm doing it, I'm telling you how I think, I'm also giving you advice. Also important, you don't have to compute it again, right? It occurs in two documents, so it's the same as blah. No need to do the computation again. You can just copy it. And blue, it's also, and you have a lot of these shortcuts where you can make yourself a lot of work, but you can also make it simpler. So take a minute to think or maybe half a minute. Whenever you're doing a lot of work, this was also an advice for the exercise sheet, you're probably doing something wrong. You never have to write a lot for the exam. So it's again two, so it's one. So these are my IDF scores. Okay. So now I can write down my term document matrix, D1, D2, D3, D4. So it's now just TF-IDF. So I just, so blah has term documents, so the terms are the rows, the documents are the columns. D1 just has blah three times, blah has idf3, so I would say that it's 3, 0, 0, 0. I hope that's correct. D2 has BLEE, BLUE and BLOW. So it doesn't have blah, so it has 0, and then it has one, one, one, but the BLE actually gets multiplied by two. So it's two, one, one. Of course, you could do an intermediate step here. I have to transpose it in my head. Blah and blue, and blah and blue, just IDF is one, so it's just a TF score, so it's a one for blah and a one for blue and 0, 0 here. And D4 is blue, blue, so it's a TF of 2. And IDF 1, so it's a 2 here, 0, 0, 0. So it's not hard, but you have to concentrate. You shouldn't make stupid mistakes, right, Which are easy to do, of course, under exam conditions. Because you have to, okay, how many? And multiply this with that. So I hope it's correct. Do you see a, you can grade. I do the exam, you grade. Okay, let me just assume it's correct. Compute the dot product similarities of the query blah-blie. There was more information in the exam, so let me just, so the query blah-blie is, and this is, let me call this matrix A, that's how we called it, and blah-blie, let me also write it as a, now this is my matrix here, A, and my vector, so this has a 1 for blah in the same dimension, a 1 for BLE, and 0, 0. So I can just compute the scores by taking the transpose times A, and this is nothing else than taking the dot products, we did this several time in the vector space model lecture and the latent semantic mixing lecture so it's yeah I'm just taking one after the other dot product of this and this dot product of this and this by trend posing it just boils down to matrix multiplication I think you can you help me. So what are the scores? Let's do it together. First one? Three? I think that's correct. Second one? What? Two? That was... Okay. I agree. Third one? One, yes. And the next one? Zero. Okay. And ten points. Ten points. I agree. Third one? One, yes. And the next one? Zero. Okay. And ten points. Ten points. So that's type one. Type one, if you understood basic stuff, you should be able to do it. You have to concentrate, read it correctly. You don't agree? Yeah. Yeah, you have to concentrate. It's easy to make stupid mistakes, so better double check. But apart from that, it's a relatively simple question. It's a type one question. There will be some of those. But you cannot pass the exam with only those. Now tension rises. Also for me, task, okay, write a function top, Q, U, S, oh my, this looks like latent semantic indexing. And of course we are so nice, I mean we could of course use different letters here, X, P, L, and M, but we use the letters from the lecture. I hear you talking if you have a question you can ask. So these are, it says here, the matrices from the singular value decomposition of the term document matrix A. So let me just write that here so I know that A is U times S times V. So this is my term document matrix here, yeah? Maybe you also feel comfortable doing this, so before you even think about it, you just write down the stuff which you know. So this is then M times R, where R is the rank of the matrix. So I haven't even fully read it, I just write it down. And r is the rank of the matrix. And I'm giving a rank of the approximation k, which is less or equal, typically strictly less, can also be the same. Okay, and now what I'm supposed to do, write a function that gives me the top-ranked document, so just the top ranked one for a given query. Okay, and that's also a nice one because it's writing a function, but to write the function you also have to know how to do it, right? So we had several ways of computing, first of doing LSI and then using it for document ranking. So the simplest way, let me try that one, was this, where I take uk, sk, vk, and that's the best rank, k approximation of a. And uk, this was the, let me be a little more verbose, this was the first k rows or columns, so that's something you have to know now. It's the first k columns or the first k rows of U? Columns, I agree, it's the first k columns. And here it's the upper kk portion, that's a diagonal matrix. And here is the first k rows. That's easy for k, it's just the other way around. Okay, and the scores I get, well, that's easy, it even matches the first exercise. It's a QT transpose times A. You get the scores for the original matrix and with QT times Ak I get the score for this one, and then I just want the largest one. Okay, now I think I can, let's see now, let me write the Python code, def, and I think I will write it to the side so that I have a little more space. Yes, please. The SVN, yeah, this is when you have... Ah, there it is. Okay, we found the right letter. Thank you. The SVN, yeah. Oh my, thank you. I hope everything made sense, I said so far. Anyway, it was the SVD. So let's write the function. I write it here. And let's see how I, so it's def, let me write Python, so I am now a language model, chat GPT, just writing something, Q, U. Actually thinking process is a lot like chatGPT sometimes, colon. Okay, so first thing, I don't even know what I'm doing in the end, I will just compute these. And now it's important, as I said before, this has to be exactly 13.5 millimeters, U, K. For example, if you write an index here like this, we will not subtract points. You can't write indices in Python. So this has to be a now, okay, how do I take, so we, I already, we already established that it's the first k columns. So I guess it's u, it's, and this is the row index, and this is colon, I think that's one way to do it. So that's the first, first index is the row, then column, so first k columns. Now, if you put a semicolon, we also don't subtract points. If it somehow resembles Python. Sk is equal to, so that should now be colon k comma colon k. Could also write 0, 2, but it's not, you can leave it out, I think. And vk is equal to v and then it's just the opposite above colon. You tell me, you can grade live. So now I compute the scores. And I even write it like this. Scores equal to q. Now I don't know if you write qt, that's not fair to write it like this. I mean it has to look a little bit Pythonic. So I'm also not sure. I think it's transpose. Transposed? I don't know. Yeah, let me write, let me do it like in the exam. I don't know it now, not sure. Maybe no D in the end, yeah? We don't subtract for any, maybe transpose, transport, if you would write transpose of Q, we would also not subtract point. But you should know that there is a function which transposes a vector, and you shouldn't leave it out unless you explain, okay, I'm assuming that my Q has a certain direction. It's written in the exam, what's the direction. Okay, and now I multiply it with AK, and AK is just, so it's just a dot of uk, and then I take the dot product with sk, and then I take the dot product with, and yeah, 80 character limits, of course very important. You see it here, here would even need less. Now I have the scores and now I should return the index of the largest one. Now I'm also not 100% sure. Return, it's probably, this is now a vector and I think if you want the index of something, it's usually called argmax.scores.argmax. Now what do I want? Do I want, or is it, hmm, scores.argmax, something, argmax scores. So I want the maximum index of scores and I know that there is a function arcmax, maybe it's called numpy arcmax of scores. Okay, let me just leave it open, it's real conditions here and yeah, I didn't want to learn it by heart, to set an example, not sure. So maybe you can look it up, but it's not important what the exact syntax is. Yeah, it's not that I write Python NumPy code every day these days. I'm not sure what the exact syntax is, but I know that there is an argmax function I want to apply to the vector and it returns the index of the largest element and this is what was asked for. So this, despite these insecurities, this would get 10 points. I hope, unless I made another mistake. Well, what do you think? Do you see a mistake here? Okay, that was a type 2 question. And also note that's rather typical. There is some coding involved. You have to know how you write stuff. But it's not only coding. You also have to understand what you're doing. You have to first somehow, okay, that's what I want to do, like always with coding, and then you do it. Yes? To the ArcMax? Yeah I don't know. I'm not sure. It's I think it's wrong, but I don't know how it's right. I don't think it's like this. I think it's something.argmax scores. So what is it? It's numpy.argmax of scores. That was also... Oh! Natalie says, both work. Okay. Okay, both work. I also had the feeling that somehow my photography... Oh, Natalie says both work. Okay. Okay, both work. I also had the feeling that somehow my photographic memory said it's something ArcMax scores and then something is probably can't be, must be a generic thing, NumPy. But the point is it doesn't matter. These small syntax things don't matter. It's clear that I know that there is Arcmax and it's not an exercise of remembering a manual. Okay. Last one. Yeah? Now, if you would have written transpose of Q or transposed of Q, it would have been fine too. But just ignoring it or writing QT as a superscript, that would be kind of not okay. You should at least, okay, I know there is a function which does it, but I'm not sure whether it's a method of the class or argument. These little details are not important. These are the things which, if you are sitting in front of a computer, you just look up in the manual. Yes? Yeah? Yeah, you did something like, should I write Q-sort? Yes, that's important here. That's also an important question. The question was, would the direction of q be given? Either it's given, then of course you have to consider it. If it's not given, you never have a disadvantage from information that's not given. If your solution is compatible with an interpretation of, yeah, then you get full points and you can discuss this with us. So we try to give full information. Sometimes we don't foresee everything. And if you say, yeah, that's a plausible interpretation, you will get the points. But when in doubt, just write it. Just spare this little extra time and say, I'm assuming Q is in row format and then if everything is correct under the condition you wrote down, of course it has to be reasonable, then it's okay. So last. Yeah, okay. That's a good question. If the comments are not there, would we subtract points? It depends. If they are there, I mean, the important thing is we want to read your stuff and we want to see, yeah, you understood it. You just didn't remember some strange detail. So sometimes if you write something that just helps, oh yeah, that person obviously knows what ArcMax does, just isn't sure about it. So yeah, you kind of have to convince us that, yeah, you know this stuff. It's like writing, there's an exercise, 10 points, and it asks for a result and you just write 42. Yeah, it's the correct result, but you haven't convinced us. You should always give an explanation. It has to be clear how you arrived there. It also says this with all questions that ask for a result. Don't only write the result, also how you got there. So in case of doubt, rather write this few more words than not. But I can't give a definite answer to the question that if you omit it. It depends on whether we are convinced that you understood it. So here's the hardest one. Oh my. Let's see. Given an array with integer values in ascending order, we want to locate element x in A with galloping search. Okay. I want to locate element X in A with galloping search. So I have my A and I have my element X and it's contained here, so it's here. X is contained here and it's at some position, yeah, it's at position D. Yeah? D and let's say it starts at position 0. Could also be 1. This is also something, if you make an assumption, you just write it down. I'm assuming the indices start at 0. So let me just do that and draw a proper arrow here. So that's my... And you see before, it's always good to start with a picture to just get clear what we are talking about. Always, yeah, don't think too fast, also not too slow, but okay, this is our setting here. We have an array, we have an element. And also notice, this is related, this is something we did in the lecture, but it's not something you can copy from the lecture directly. It's a small variation. I mean, we did, we searched a lot of elements and here it's just one element which we searched from the beginning. So it's kind of the first step you would do in a Galloping search when you have a smaller list searching the first element and it's given to us, we know it's there and it's at position D and now I have to show that it's O of log D. And now of course it's good if you know how did this work in principle. In principle this was like this, so this jump here was I think think, at position J1, and then we had a second one at position J2, and then at some point it jumped to J3, exponential search, and I think, yeah, maybe, let me call this one J0, which is j0. So let j0, jk be the positions where the exponential search looks. Looks. And this is, so, and this exponential search, so it's Ji is 2 to the i, yeah, if we start at, and then I think it's minus 1 or something like this. I think that's how we, so it's kind of exponential. Let me just assume it's like this. So it's an exponential step. So it's first, yeah, I'm starting with zero, so I'm subtracting a one. So now what do I know? I know that just before the end, so Jk minus one is less than D. If it would be D, I wouldn't have jumped further. So this is, and that's 2 to the k minus 1 minus 1, and that's less than... Yeah, I can just bring the minus 1 to the other side. It's d plus 1 and take the log. It's log 2 of d plus 1. Or, so k... Yeah, and actually I'm having 2k plus 1 steps here, 0 to k, so k plus 1 is less than log2 of d plus 1 plus 2. So that's the number of steps. So, yeah, this is the number of steps here. Number of, I'm a bit more verbose than maybe I would be in the next time. I'm sorry, number of steps in exponential search. Number of steps of exponential search. Number of steps of exponential search right, it's from 0 to k, so it's k plus 1 and this is O of d, I mean that's log 2, O of log d so this is O of logD. So this is O of logD. And now, and then binary search is, yeah, so what about, yeah? Say oh yeah, but equal is common notation, but you're right that, yeah, why not? But it's so, yeah, it's standard in computer science, but it's more correct to write it like this. I agree. So, now we do a binary search in the thing until J3, maybe only from J2 to J3, so we need to limit Jk. So Jk is, I mean, Jk is, it's less or equal than, I mean, how do the jumps look like? They go, okay, it's 1, 2, 4, 8, 16, and minus 1, it's 0, 1, 3, 7, 15. So it's always, so jk is equal to, I think, yeah, it's just twice the previous one, Jk minus one, minus one. And Jk minus one was less than D. I stopped before D. So this is less than 2D, right? I mean, that's the whole point of the exponential search. It goes beyond D, but at most by D. It's at most at 2D. And these of course, if you would see this for the first time in the exam, this would be a little hard, it would be definitely too hard, but we have done this in the lecture. So the basic ideas of the proof, you should know them from the lecture and here you are just applying them again in your own proof. So that's not an easy question. So, yeah, binary search in array of size less or equal to 2D. So now I'm searching maybe only even from here or from the beginning, which is definitely less than, yeah, takes time. Takes time O of log D. Yeah. Say it again. and it's, say it come again? It's not exactly D. No, no, it's something, yeah, it's JK minus one or something like this, but we don't know that value. So it's related to D, but not exactly D. And what you need to know here, log 2D is log2 of d plus 1. So we also have some of the typical logarithmic. And you would get, now there are always these questions which you may wonder, what do I have to write down, what do I not, when is a proof a proof? I think the stuff, I mean, maybe you could write a little less and you would still get full points. For example, it's, if you just 2D and you say it's log D, we would accept it. I don't think we would subtract any points given that there's enough other stuff to do. But why not write it down that log 2 of 2D is log2 of d plus 1, just to show that you know that the 2d doesn't change the order of magnitude. But if you write too little, if you just write one line and you say, look, I knew it. And in case of doubt, you can ask during the exam. Of course, you can't't ask is this proof correct, but if you have a concrete question like is this enough, then we can say yeah it's okay. So yeah, this is the harder parts, there will be some of those too. I think these were good examples for the three types of questions. Any question about that before I go to the final part? It will take 20 minutes or so. And as I said, do the exams. That's a great preparation, just do a lot of exams. One comment, the old exams, most of them or even all of them, were still in open book format, except I think the last one, I'm not sure. So questions are a little bit harder there, but still they are good questions. Yes? No, we also changed that. So it will no longer be something out of something, but there will be something about every topic. But the questions will be simpler. So we... This also sounds tempting, like you only have something out of something, but yeah, it just, I think it's better for most. You just have something about every topic. There will be a question about every topic. We, sometimes we have two lectures. We declare them as one topic, then it's 10 topics. For example, the web search thing was two lectures, that's like one topic. So we have ten of these. Vector space model and inverted index is also one topic. And then for each of them you will get one question. Or two. So it's, we have ten points, it's's 100 points in total. So yeah, not exactly 10 points for maybe there's some, maybe a little bit less about this, a little more about this, but each topic will get a question. Yes? So yeah, let me just check myself. It was, I announced this in the first lecture. So the only thing where we divert from what we said, but we discussed it yesterday, and I think it's completely uncritical. So it's 100 points and 50 is the passing grade. So it's just linear until 100. So we will not do it like this, this part, but anyway we realize that this contains no information, right? We anyway have subtasks and how we group the subtasks, yeah, it makes no difference. So we probably group them in five tasks are 20 points with subtasks, but it doesn't matter. It makes no difference for you whatsoever. The important information is that there will be a question on every topic, so, but there will be, yeah, we will try to make it not too hard. Any other questions? And please, yeah, there's the forum. not too hard. Any other questions? And please, yeah, there's the forum, you can ask questions all the time, but there is still time until three weeks. Okay, I hope that was not too frightening, maybe the last task a little bit. So six more slides. Maybe you want to do a project with us, a thesis with us, bachelor or master. You are all welcome. How do we work? And this will be mainly demos and some, a little bit of meta talk, but mainly demo. We solve practically relevant problems, non-trivial of course, so we will see a lot of topics and I think I want this search here smaller. Yeah, so you will see these topics in a second. We make software very important and all this stuff should look familiar from the lecture. So it's not just that we think about something, we also think, but then we write software that is useful and used by people. We like that, or at least I do. This requires an effort to write good software, so it's not just when it runs it's good, but you want good software, good documentation, nice user interface, the whole package. And of course theory, it's also, it's like the lecture should sound familiar, but we use theory as a tool. We don't really do theory for the sake of theory, which is also fine. Some people like to do that, but we usually use it as a means to an end. But important, if you don't do theory, you're just hacking around. We saw that in the lecture a lot. Sometimes you have to see, okay, well, now I have to do some math to get it right. Yeah. Only difference to this course really that of course for our projects we use Git, but some, you didn't ask it this semester, but actually for the course SVN is easier. I mean, Git is just more powerful, but also more ways to break everything. It's just more complicated and doesn't have some features which we use for the course and Docker, in case you know. Supervision, similar as in the lecture. So if you work with us, you will have a very good infrastructure and support, inspiring environment because everybody does this kind of stuff as I explained. Apart from that, you have a lot of freedom in what and how you work. Sounds good? Freedom? Freedom? Let me just mention it. It's a double-edged sword. For some it's great. They say, okay, give me something interesting and then I want to walk alone. Some are also afraid of it. So if you, so of course you will get guidance and help, but if you want a lot of guidance, I'm not sure if we are. So if you are, some people really want to be taken by the hand from step to step, we are more the freedom loving people. So yeah, sorry. I think our group is a great fit for people. So if you like to solve problems, encode and get stuff done. You see some you say, ah, that's exciting. Let me solve it. And then in the end I have something and I will, ah, that's exciting. Let me solve it. And then in the end, I have something, and I will show you examples in a second. I want to do that. And it's important, that's why I have this little star here. It's not important that problems are super hard. It's more important that what comes out in the end is useful and nice. So I come from theoretical computer science. So every field of science has its own yardstick, its own currency on criteria. In theoretical computer science, you always have to prove that you are the smartest, right? It's important that the problem was hard. If you have a nice proof but it was easy, then the reviewers would say, yeah, that was easy. It's not,, yeah, that was easy. It's not, I mean, that's not how I think. I mean, if it can be easy and useful and nice, sometimes things are not super hard but still useful. Actually, a lot of stuff, if you look back historically, it's not necessarily super hard. So hardness is not important, it's not the most important thing. And yeah, great fit if you are happy when the end result is useful. So we do stuff which then in the end you say, wow, I like it and I can show it to my family, grandparents. Yeah. And a good yardstick is I will show you now a number of demos and if you say, yeah, I like this kind of work, this is something I can imagine myself doing. Then you will probably like working in our group. But I want to emphasize it's not the only way to do science or, yeah, different groups, different people like different things. Very briefly, machine learning. We are using that too. You also saw it in the lecture. We are not so much the people who do everything that's fashionable. That's also one way to do science, what's fashionable right now. So I'm in research business for 30 years. I've seen a lot of hype coming and going. I never jumped onto them. I don't care, basically. But sometimes things are really good, right? Machine learning for some problems, it's just the way to go. And then, of course, we use it. And you have seen a few latent semantic indexing, Naive Bayes, just the basic ones, logistic regression. Of course, we also do more advanced stuff. Deep learning, you will see some – Yeah, deep learning is pretty different. In this lecture, we did mostly algorithmic stuff, also a bit of learning, not really deep learning. Yeah, in deep learning, it's not so much about algorithms, but understanding what you are doing. And just one word here, I mean, what happens a lot in deep learning, people use some library and then they plug together some networks and then they run it on some data and say, here's my deep learning results, but you should understand what you are doing. Why are you using this component, this dropout, this so many layers and why are you using this nonlinear function. But there's little theory and lot of heuristics. And that's also that fits this last statement. We still do a good, it's not that now that there's deep learning you don't need good old algorithms anymore. You still need to sort numbers and have priority queues and queuegram indices. You still need those. And actually, that's also fun. I mean, a while I think, maybe people are realizing now, everybody thought, okay, I have to do this learning thing. It's not always the greatest fun in the world to build neural networks and not understanding. It's a lot of magic, black magic. And then you tune it, you change this parameter. Something else happens, something happens which you don't understand. It's algorithms are more deterministic. You are more in control. So they are also a lot of fun, right? You write the data structure, you know how it's represented in the computer, you know much better what's happening. So both are interesting, but good old algorithmics is also a lot of fun and very rewarding. Just a quick overview, we have the following groups. Frank Hutte, he does hyperparameter optimization, you always have hyperparameters, learning rate, batch size size and so on. These magical values which are super important if you pick them wrong, batch results, pick them right, grade results, and he learns this automatically. That's the heading for his work. Thomas Broch's computer vision, not graphics vision. It's also everything's deep learning there. Abhinav Valladha, successor of Wolfram Buggert, that's robotics. Bernard Nebel, he retired, he did foundations of AI. Joschka Budeka is doing neuro-robotics and fast cars, BMW, self-driving. And we are doing natural language mostly. So everything with text makes sense, right, if you saw the lecture. So nice coverage here and of course we also collaborate. So here's some, that's almost the last slide, some projects and demos and I will not, I mean, I could give a whole lecture about each of them, just very quickly to flash some at you and see whether you like them. And I think I tell you already. Okay. Yeah. Here we have traffic and that's just, yeah, let's just speed this up a little bit. That's just worldwide traffic, but public transit. So buses, subway, and so on. Which city is this? Any idea? Hmm? What, Freiburg? Paris. I think it's Paris. Yeah, let's zoom out like a quiz. Oh, yeah, there it was. Yeah, it's Paris. I can just do fast forward here. So that's by one of your tutors, PhD project. Lot of algorithmic problems here. For example, this line here. Oh, now I clicked it away. This line is actually computed and was not in the data. How do you know where the vehicle is actually going? The lines are not necessarily the straight line between points. And this is also an efficiency problem because you have the data for the whole world here. I can go over here and zoom into New York and see the same thing. And there's also real-time data. You also have this thing for planes and this is like the counterpart for public transit, quite popular among people. There are some related problems here. Let me go to octi. Yeah, this is about drawing maps more schematically. So you have the Freiburg map and maybe you wanted octilinear, which means you see a lot of these plants. People have used to draw these by hand. Here the problem is to draw them automatically but maybe you also want them, let's see what else do we have in the hexalinear, okay. Ah, orthoradial also looks nice. And this is done automatically here. What else do we have? London orthoradial. Okay, not bad. Yeah, and an interesting problem. There's a lot of mathematics in it, but also aesthetics, quite typical for us, right? It's not that you just do this because you have to compute something. I mean, it should be similar to this, but you have constraints now, right? It should be this in circles and then you have these axes and you have to define what it means that that's similar to this. You have to compute it and you have to minimize crossings and so on. Yeah, so I'm just giving you glimpses of the work. But you see it's always stuff which you can show to your grandparents and they will understand. Very important. Search on knowledge graphs. Okay. It was already open in another tab. Doesn't matter. Okay, Wikidata. We talked about Wikidata. F, don't have to show it to you again. Wikidata, almost 18 billion triples, right? These triples, we used them in the last exercise sheet. So let's see, 18 billion, that's a lot, right? That's a terabyte of data if you unpack it. And now let's try the Sparkle query and what we have here, we have nice auto-completion that some work of ours. And let's just look at people's first name. So I don't know how first name is called in Wikidata. I think I mentioned it in the last lecture that these things always have these alphanumerical names but they also have labels. So first name. A first name please which is not, and you see auto-completion tells me I'm on the right, a first name that's not too rare and not too common. No first name. Hannah. Okay, boring but let's do Hannah with an H. No first name. Hannah. Okay. Boring. But let's do Hannah with an H. Okay, the real spelling is on top. That's great. Good to know. Person. Okay, let's look where these people are born. Place of birth. Oh, it's even called place of birth. Okay, place of birth. And I'm sure Wikidata knows the coordinates of the, yeah, it does coordinate location. OK. You see the auto-completion helps me. It's super fast, 18 billion triples. This is one of our bigger projects in the group. And a lot of work there, coordinate location. OK. work there, coordinate location, okay, and now let me, yeah, let's just, yeah, maybe also, yeah, we could have, I don't know, I think I can add the names here automatically, and let me just do select person, place of birth, coordinate location, let's see. Okay, so I have 137 Hanas with an H and let's look where they are born on a map. There we are. We have the Hanas of the world. They are, okay, in UK. Actually, I never launched that query, I think. So popular name in the UK. Also in Germany also, but more on the East Coast than on the West Coast, yeah. So let's go back to the, so you saw a lot of things here. I could show a lot more about this, very fast, lot of data, you have the auto-completion and so on. Visualization of results sets, this does not only work on Wikidata, for example here, you have all the protein data in the world, every protein known to mankind or womankind, the gene which encodes it, just look at the amount of data, it's 108 billion. 108 billion. That's more than the number of web pages. That's huge. It runs on a single machine with our software. That's quite impressive. And also here's, who knows, OpenStreetMap. Who knows, open street map, open street map, very quickly just open street map. Yeah, let's go there. I mean it's just map data. I think I have to go here, right? It's just like Google Maps or something like this, but open source, like Wikipedia style, Wikidata style. It's amazing, right? People have basically done this, a lot of people walking around and saying, here's a house, here's a tree. And the detail is amazing, right? It says here is a house, here's a tree. It even says this tree blooms at this time of the year, it has this leaf type and all this stuff. And not just in Freib year, it has this leaf type and all this stuff. And not just in Freiburg, it's all over the world. Of course the coverage varies. And you can also turn this into triples, of course. Let's just do one thing here. I think, yeah, let's any region of the world, it could be a Bundesland, a state, a city, something not too large, not too small, maybe, so that it's interesting. Could be a bigger city, could be a state. In the whole planet. Let's see if it works. It's live. What? Vatican City. Wow. Vatican City. Wow. Vatican City. Wow. Vatican. Is this? Okay, let's see. All the streets in the Vatican. 503. Okay, let's. Okay. It's a small one, but it's, yeah, it looks correct, right? It's all streets in the Vatican. Oh, it's small. Okay, it's, yeah, it looks correct, right? It's all streets in the Vatican. Oh, it's small. Okay, let's, a little bit bigger, but that was a good one. What Manhattan? Okay, Manhattan, New York County. Okay, here we have two. Now I'm a bit, which one is it? I don't know. New York County, Manhattan. Maybe the second one. Okay, six. Oh my. This is a lot. How do I display so many? Okay, looks good, right? So you see another visualization thing. Here I have a lot of streets. This is shown. Let's take one bigger one. Maybe not a whole country, but the whole... Maybe a state of Germany. Maybe one with holes. A state of Germany with geometries can have holes. Which state of Germany has holes in it? Brandenburg. Yeah, for example. Brandenburg has Berlin in the middle, right? Let's see. All the streets in Brandenburg, isn't this fun? Your grandparents will love it. So it's 582, 583,000 streets. How do you display them on a map? And it just goes like this, Brandenburg, right? You can, you get a heat map, you zoom in, you get the streets. And of course, there's so much work, data structures, algorithms, the search engine behind this, right? But in the end, you can just show it in one minute. So we'll do a lot of stuff like this. Question answering. Okay. This is a cue. This is something Natalie has worked on. So what do I... So here I can ask questions. So sparkle is nice. It would be even nicer if you could just ask questions. So, which question do I... Who is the... I don't know, husband of... Okay, yeah, I typed something and what I hear, when I'm again, it helps you while typing the question, who's the husband of? And now maybe I was looking for Angela Merkel, I don't have to finish the question. I get it here, I get the question mark and I get the two husbands, the one she's married to now and the one where she got the name. So it's question answering, it's auto-completion. And behind that, let me see if this works. Ah yeah, I think it, does it work? Yeah I don't think it, does it work? Yeah, I don't think it works because if you would see, if we could see it here, what's actually happening here that's using learning methods. How do you find the right answer? Oh, it actually indicates it here. That's like a sparkle query. It's like Angela Merkel's spouse and then question mark husband, right? So it's trying to find the right SPARQL query for this question and it's doing that by generating a lot of SPARQL queries, which are possible interpretations of the questions and then ranks them. So this is the top candidate here. And this is, yeah. So what, let me, oh, what I don't have to type is what is spoken in, what's a country with many languages, Switzerland maybe. Yeah, so you see what is spoken and still say the SPARQL query here, Switzerland official language question mark. And here I have other candidates, languages spoken. Tourist attraction, that's not the right query for this one. So it's ranked third here and so on. So yeah, big topic in computer science. Question answering, tokenization, repair and spelling correction. Okay. That's a nice one. Okay let me, that's something Sebastian is working on, also one of your tutors. Let me do live here. I'm typing a, this is a sentence where I am introducing a lot of typos. Okay. Yeah. And I can add more. So I can add space arrows here and more things and it's still correct. So it's doing live, it's doing it fast. That's the point of the live and it's getting it. Even if I, usually spell checkers are not very good at that if I put a space in between everything. Or there are also some examples here. It also gets hard for humans then. If I do this, no spaces in between, it gets hard, right? Now I even put spelling errors in between. Even humans have difficulties. I mean at some point it will not work anymore if I introduce too many. Yeah, you see. So yeah, now it gets problems. But yeah, so that's a classical deep learning problem. So this is the input, the output is just the corrected text, even if you have white space errors. That's different from the usual learning stuff because tokenization. Usually deep networks take, we also did that in the lecture, the words as input. Here you take the characters as input. And that's maybe the last one. I will drop the last one. It's also something Natalie works on entity linking, a very nice topic which we used to have in the lecture as well, but we took it out for reasons of time. Let me just select some system here, yeah, why not take, so these are some entity linkers. Let me just select some, the baseline. This is one by Facebook, this is one and this is one. And let me select some benchmarks here. You will see in a second how it works. And let me select one method, maybe the baseline on some, this is 50 sentences and you see here what it is. So it's a sentence and you're supposed to find entities from a knowledge base in the sentence. So this mention of Apple is Apple Inc., the company, and the green tells you, so this one says what it was in the benchmark, this is what the entity linker detected and it got it right. Could also have said Apple the fruit, Stanford, Stanford University, Steve, it got it wrong. The real Steve Jobs was meant here but the EntityLinker said Steve Smith and so on. And what you see here, again the visualization is very important. What you see in a lot of papers is just the numbers. And numbers are good, but they can be misleading. You don't really know what the entity linker did, right? So to do that, you need to look at the individual sentences. And now a lot of things can happen here. So if you look, let's look at another linker maybe. Maybe this one here. Yeah, now you see blue things. That's where the linker said, yeah, I know that it's an entity, but I'm not sure what it is, and so on. We can do a lot of more things. We can also create a table here. Okay, that's just... Is it because I selected one benchmark? Oh, it's too few benchmarks. I need more benchmarks. Let me take more benchmarks here. Now I can create a graph and all from a web app. So you also have web app stuff here. Super useful if you have a topic you want to play around with which linker is how good compared to linkers, what are the problems. A lot of very useful work to dive deep in. And text extraction is similar, but I will skip it now in the interest of time. Last slide, very quickly, I will give algorithms and data structures in the summer semester. That's after the semester, before the semester. If you want to become a tutor, maybe drop us a line. And yeah, we talked about information retrieval, several variants, let's see. And there are projects or thesis anytime. There's an entry on the wiki with detailed instructions. Just read it. You can start anytime. The only constraint is if you tell us, then you should also start right away. So don't say I want to do it in three months. You don't have to talk to me three months before we meet. We explain to you what the topic is and then you start. And we always have a lot of topics. And please, if you get no response, it's never personal, just ask again and again and again. Maybe not every minute, but in a reasonable, yeah, this is a mistake. I think that's the last advice. You got a lot of it. I like to give advice. Never take it personal if you don't get response from busy people. It's not personal. It's just for some reason the email got lost in hundreds of mails. It's never personal and don't be afraid to ask again. It's not impolite unless you do it once per minute. But if you do it again and again, it's not impolite. It's the right way to do it. Of course it helps if you were in one of our lectures and showed commitment. I should say I want to, I'm really excited about your topics and I want to do a project and then we see, oh, exercise sheet were not done or something like this. Then maybe we're not convinced that you are excited about our topics. Okay, that's it from my side. Any questions or anything from your side? And in good tradition, we are in standard time again, but this will be solved maybe next year. Yes, please. What are the connections between external projects? Yeah, external projects. We get a lot of requests. People want to do it as a company or so. It's important that it somehow matches our expertise. I mean, it's totally unrelated to stuff we are doing. And the even more important stuff, it has to be scientific somehow. So if it's sometimes, I mean, I understand it, companies, they just want you to program some interface or something, it can be very useful for them. We also do user interfaces, but they are usually connected to something scientific. But if it's just something for the company which they need, and then you can't just say, and this is my bachelor thesis, so you have to convince us that there is a scientific content. And that's also, there's a section on the wiki which says exactly what you have to write. And this convincing us is not an essay, it's like half a page or so. It's four points listed there. Any other questions? So I, yeah, and you can ask questions later on the forum and so on. So I hope to see some of you in projects, thesis or other context and yeah, but now you will first do the exam. Good luck and see you there. Bye bye. Thanks.Welcome everybody to the new semester and to lecture one, databases and information systems and also information retrieval. We will talk a little bit about this in a minute. So it's two lectures. It's the information retrieval lecture but it also counts at that one. You can only take it as either of the two, obviously, not as both at the same time. So today, it will be, half of the lecture will be about organizational stuff. And you can also ask questions, of course, but there will also be some contents already, and that's, I don't know if people are here have taken the information retrieval course. I have a poll in the second. This will look familiar because we will start with building a very simple search engine. That's a part of information systems. And the exercise sheet will be to implement that search engine. We will start it together today and then you will implement it in the exercise sheet. So let me maybe just for starters three demos of for the kinds about the kinds of things we will see in this course. So let's see my first demo is and it's all stuff which is somehow related so made in Freiburg or made by you. This is DBLP, that's a computer science bibliography where you can just search something and then you get results like publications, authors, conferences with that name. So here I typed databases or I could type information retrieval. So this is work done by us a long time ago, but it's still, and I just wanted to show you, there's actually, I can log in here on that machine. Let's just see, and if I look at the log. So right now it's used a lot. This is like one of the most important sites for searching publications. So actually if I search something here, I should, is it there? Let's search something new. Let's search databases again. Oh now it's, let's maybe search something. Oh yeah, there it was, you see. So it's a live engine used by, and you see it's very fast. You see everything is milliseconds here. So that's when you build a search engine. People can type something all over the world, they get results, you operate the server. That's demo number one. So this is how we will start building something like this traditional search engine made in Freiburg. Wikidata, who in the room knows Wikidata? We will talk more about Wikidata. Wikidata, sister project of Wikipedia. Wikidata, nobody? Not a single person in the room? Okay. Interesting. That's type only Freiburg here. University of Freiburg. Now I get that's not a Wikipedia page, it's all the structured data, triples. We will talk about that in a few lectures. So it's like what you have in the info boxes on Wikipedia, also on Google. And you saw here, if I type something, I get a list of entities for which I have pages. If I make a mistake it doesn't work. No match found. That's actually, there will be an exercise about this, exercises six and seven where you will make this better and you will be able to do this. Here's another demo about knowledge graphs. Let me just show you this. I need a first name please. So this is now Wikidata and you can search in all the structured data from the Wikimedia project. 19 billion facts and I need a first name please from the room and not so common and not so rare first name so that it's interesting. Any first name? Olaf. Olaf, okay. Why not? Let's take Olaf and see whether Olaf, there we have Olaf with F or with V? F. Olaf, there we have Olaf, with F or with V? F, the real Olaf, okay. So now we get, so Wikidata knows 561 people named Olaf and here it shows us their birthplaces and we can look at this on a map. So Olaf, yeah, it's, okay, maybe let's also try the other Olaf because we also saw it there. But what would we expect with the other Olaf because we also saw it there. But what would we expect for the other Olaf? Okay, there it is. Okay, there are fewer ones of the, okay. So the one with a V, you see it's more popular in other areas. So this is also something made by us and you will also learn how this works in principle. And then we will also, it's the lecture about database and information systems, also information retrieval, large language models. We'll talk about more about learning stuff in the end. This is OpenAI, ChatGPT. Has anybody heard about this in the room? Okay, you're laughing. I just asked chat GPT the same question. I want the coordinates of all Perth places of people in Wikidata with first name Olaf. Just a sparkle query please important and some professional advice here. Let's see what it does, certainly. If you like to, and here comes the sparkle query, so how does that work? So magically, not only can it answer everything, it can also give me a sparkle query, and let's just see if it works. So it knows sparkle, it writes a query, it understood my question, it writes a query, it understood my question. This is the query it came up with. Let me execute it. Okay. Apparently not the best query. Something I have an idea why. It's not quite correct I think. Interesting. I always do this live so always something else happens. How do I do this please? Last two triples don't look good in the query. Please use, I don't know, strstarts maybe. So you can talk to this thing, let's just see. Here's the mod, okay. Always willing to please. So that's a SQL-like query here where you, okay, now it's, mm-hmm, mm-hmm, mm-hmm. Let's see if that works. If it doesn't work, we just move on. That one works. Now I get the olives. Yeah, same result via chat GPT. Okay, not bad. Not yet made in Firebug, but we are working on it. So what are the research topics? And this is also an overview of what we will learn in this course. So you saw this very first engine that was 10 million or so records. You need to do something, you need some pre-computation so that it's fast, everything you saw was pretty fast. Ranking is important. You search something, you get a lot of results, you want the most relevant ones first. Database stuff, knowledge graph stuff, this is what you have seen. Wikidata, structured data, we will talk about that pretty soon in this lecture. This was when you typed in Wikidata, maybe you mistyped, you want to find entities from a list, you still want to find it when you make a typing mistake. Everything we have seen had a web app, everything had a, right, when you go to Google, it's a part of information systems to have some front end, so we will learn about that. Although some people always dispute whether that's part of this topic, and I very much think it's part of that topic. Every information system has a web interface. And then in the second part of the lecture, we will move to the wonderful world of linear algebra. So we will say, explain how most of the things you have seen here, you can also cast it as matrices and vectors. So quantum physics also does that. Everything, the whole world, physics, the universe, and also artificial intelligence, database, and information systems is just matrices and vectors, and it's very magical, and that will be the second part of the course. So that's a rough overview, and maybe now is the time to launch this poll where I will ask two questions. And let's just see, are you getting it? Do you have, you get the question? And the question is just have you, I mean, while you are answering let me just explain a few things. So for reasons I'm not going into, the databases and information systems course will be held by me every two years now and then another colleague, but this year I'm doing it. But we are also doing the information retrieval lecture and this counts as both. Either this one or that one. It's your choice. You don't even have to make that choice now unless you have already heard one of the lectures, but that's why, that's what I'm wanting to check here because this lecture is completely different to the database and information systems lecture so far. So if you have heard that in the past you can hear that again and just you should get credits for information retrieval then I guess because it's just a different course but it's very similar to the information retrieval lecture. It's not the same. I wouldn't maybe not very similar similar. Yeah, I think it's very similar. So this probably doesn't work. If you have heard information retrieval before. Okay, it's stabilizing. Let me just show. You should see the results now. Okay, so yeah, there are a number of people who have heard that before. Okay, now I didn't ask whether, why you are here, whether you want to hear it again, or you didn't fast, or you just think it's new. But we also have 14 people who have heard information retrieval before. So I don't know, I don't think, this is very similar then. I don't have a decision for you now. Maybe you just stay for today but if you have heard it before you will recognize the first lecture. It's very similar to the first lecture from previous years. You're welcome to stay but I don't know whether you can really take that again and because it's so similar. Okay, but we have this result now, thank you. So let me just move on. Or is there, apart from this aspects which I just named, any other questions about this information retrieval databases two in one thing? Or is what I just said clear enough? I also already wrote a post on the forum. Anyway, if you have questions, we are sitting here for a longer time. So first part, very quickly about organization and style. So today we started at a normal time. In the future, we will start early and go a little bit later just so that we have enough time so that we can make breaks one or two. It just proved to be a good idea. It's never a good idea to rush in a lecture. And if you really have a problem to come at this time, the first part is always a little bit of organizational stuff, talking about the last sheet. So if really you can only come at 14, 15, that still works. The contents will only start then. If you absolutely have to leave early, there will be a recording and everything. So, but I think for most people that's not a problem. There are 13 lectures all together, but we have 15 slots because we have no fire target this semester. But anyway we will skip them, we will just create our personal holidays, which is a good idea for a number of reasons. 13 lectures is enough for one semester. We have some buffer then if something happens and also we always go a little bit over time, which is just good not to rush, have a little more time, and this just compensates for this. I think that's a good deal. Alternative would be to have another date in the week where we have a second lecture, but nobody wants so many appointments. So all lectures are recorded and live streamed right now. What if we sign up for both exams in the chat? You can't sign up for both exams. Or maybe you can, but something terrible will happen. You shouldn't for obvious reasons. You can't get credits for two courses by listening to one. I think that's obvious. We have our estimated video editor, he's doing a great job. Our videos are produced quite professionally. Alexander does this. All the course material is on the wiki. I've shown you briefly the wiki page. It's here. It's also on the first exercise sheet. The link here. Yeah, so just so that you have seen this page, once you have seen it, you will recognize it when you see it again. Everything is on there, linked, and also we use a subversion. I will talk more about that at the end of the lecture, so that everything is also in this repository, and you always get the newest material by just saying update, give me everything, right? That's on our server. Such a versioning system is just everything is in our server, also the stuff you submit for the exercise sheet, you can upload it there. You get the newest stuff from us and getting the newest stuff is just typing SVN update in the command line. Oh, by the way, now that I think of this, many of you are probably using Windows machine. I also have Windows on my laptop. SVN is actually has a very nice Windows front end, which is called tortoise SVN. Thank you for the ads. Tortoise SVN. so then you just have it in your explorer window and, but you can also use it on the command line. It's a simple versioning tool and I have a few slides about this in the end. Everything is there except for the recordings because they are big. They will be on the wiki, OSN2 formats, just watching them on YouTube or you can also download compressed MP4. Thank you Frank for our administrator exercise sheets. The exercise sheets are important and now comes to, there's one sheet per lecture, so the lecture and the sheet always go together. Nothing for the last lecture, that will be more, we will talk about the evaluation, outlook, I will present work at our chair, so it's 12 sheets. Deadline is always 12 noon before the next lecture, so two hours before the next lecture start. This year the exercises are voluntary. This is new. So in the past you always had to pass 50% of the exercise sheets. We don't do it that way this year. You do not need 50%. You need 0% or larger of the points to get your Studienleistung. And I have it again on another slide. So there's no, you don't have to do the exercise sheet. You can do them. It's of course very nice, they're very nice sheets, but you don't have to. You can also work in groups if you like. But then only one of you should commit solutions. And how exactly we do that we will tell you. For now you can just start working and we will provide the details later. Still, if you plagiarize, I mean people, humans are just strange and even with these rules some of you will copy stuff, don't do it. It's forbidden and will be punished and pretty severely. So if somebody copies, we've been doing this for a number of years now, we have been too nice in the past, just happened so much. If you copy, it doesn't happen by coincidence. You do it on purpose, which means one time is enough and there will be consequences. So please don't do it. Now you wonder why do I even say it? Well, we have tutors which give you feedback, which look at your stuff. And if you get three times the same thing, then three tutors who are paid for this, who spend their precious time on this, spend time to give feedback on three times the same thing, three different people, and that's not okay. So yeah, it's just, in any way, copying stuff is totally meaningless under these rules, right, so there's absolutely no reason to do it. Yeah, don't do it. Yes? It doesn't count as plagiarism, yeah exactly. Yeah it counts to everyone but I mean there is no counting because there is no, you will get points just so that you know how well you did. But the final points don't mean anything, they are not necessary to pass anything, there is no requirement on the number of points. But we will still give you points so that you know, oh this is how well I did, so that you have an idea for the exam. And we will actually be as strict as we are in the exam. So you get a kind of feeling, okay, with that performance, I will get all points in the exam, I would get no points in the exam and so on. But the points are not needed for anything. That's what I meant to write here and that's new. In the past, you always needed 50% of the points and this is a negation here. You do not. So now points are not required? Yeah, they are not required for anything but still you get them as a feedback so that you know okay, this was, I get full points for this, I didn't get because people are insecure about how is it sufficient what I did. And please ask questions anytime. You can just start talking. There will be no time slots for tutorials. Maybe you saw it on the Hisin1 for a good reason, people always say they want tutorials. We did it in 100 variants already. It's always the same. Only few people come and those who come just sit there. Somehow expecting that something magical happens, they listen. You just don't learn that way. People will write in the feedback also for this course, it would have been nice if there were tutorials. We just do it, don't do it anymore. It's meaningless. Some things, this very life is complicated and so many things have two sides. I don't think this has two sides. It just doesn't make sense. There will be no weekly Q&A either. We also did that. It also doesn't make sense. People just come and listen, don't ask questions. It goes in the one ear, comes out of the other. But of course we will offer something valuable. I mean if you submit an exercise sheet and you ask for feedback, you will get feedback and you will get personal feedback. That's very valuable. Somebody, a human being, will look at your sheet and over the weeks will tell you, yeah, here you could improve something, individualized feedback. I think that's very important for learning. It's very important, it's also written on the sheet, that you say what you want. Let me just briefly show that here on the first exercise sheet. It's written here in the end and also in our rules. Make sure to add a statement asking for feedback. So you will always submit with your solution for the exercise sheet. We will talk more about that towards the end of the lecture. A text file, so it's just markdown so that you can add some simple markup there. And you should always add a feedback what you want from the tutor because some people they are not really interested. They don't want so much feedback. Others want feedback. some people want feedback only on certain things. And again to make it meaningful and efficient just write it in your ear. I submitted something, please you don't have to say so much about this because I didn't really put effort into this or whatever. But here I would really like to know is this good. So just, yeah there's a human being on the other side, just tell them what you want, what you need. And if you don't write anything about this, you don't get feedback. And you maybe wonder why this is so strict, but this, I mean, few people in the room, there are 200 people participating, and it's just, it's so easy with so many people to generate a lot of meaningless work. So wherever it's meaningful, we are very happy to put in the effort, but not where it's meaningless and just costs people time. And I can tell you from experience that that happens a lot when you have a large audience. I think it's not so easy for you to understand because each of you is just one individual. You see things from your perspective, but we have you as a group. We have 200 people also. And so that's why I'm saying all this. And here it is again, you have to write it and it's very easy for you. You spend time on the exercise sheet. Now just write the sentence what you want. And then, so you basically clear negotiation with your, a conversation with your tutor who will be very helpful of course. There's also a forum that's also important. Most of you already saw it when you register on our systems you can just ask questions. And also we have that for some time now because we don't have tutorials you can make individual appointments. For the first week we don't have tutorials, you can make individual appointments. For the first week we don't do that, this will start next week. We have a tool where you can just book. Very easy, you don't even have to write an email for those of you who have social phobia or something like this. Very important that you have a tool where you can just click and then you get a 10 minute, 20 minute slot or so where you can talk with a human being and ask them questions and you get individual feedback. Style of the lectures, that's also important. So I will provide motivation, basics. We will do live coding together example, but it's really the basics, the intuition, why are we doing this and very important, the inspiration. Oh, that's interesting, I want to learn more. But that's really all, right? If I could give you all the details, I could do mathematical proofs for hours, I love it. There are whole YouTube channels about this and they are great, but you don't learn a thing. I can't emphasize this enough. You watch a YouTube channel with someone doing a mathematical proof and I like to watch those too. It's very nice, very entertaining. You don't learn one thing. You just forget it. It's just so important. You only learn stuff when you do it yourself. You just don't learn stuff by listening. But it's important to listen, but then listen as a, okay, ah, that's what it's about. Now I got the intuition, the basic ideas, and I'm really motivated now to go deeper. But that part you have to do, otherwise it doesn't work. We always have one topic per lecture, so it's not just that I talk, at some point I stop talking, then I continue talking, in the next lecture it's always self-contained lectures. That's also why we go a little bit over time in each lecture because you just need a little more than one and a half hours. That's just what experience shows. After two hours it becomes too much. Somewhere between one and a half hours, that's just what experience shows. After two hours it becomes too much, somewhere between one and a half and two hours seems to be the sweet spot if you don't want to rush. And rushing is also a problem, these YouTube videos are always so very entertaining, you can't even think that fast. So all the materials you need, lectures, slide sheets, everything that you need for the sheets, it's on the slides, except maybe some manual or reference stuff. But it's not that you have to read other books or anything to be able to do the sheets or understand the material. There are pointers at the end towards literature if you're interested, but you know it's not necessary. Yeah, I added this so our, I would say no bullshit approach, so we are very straightforward, honest and also direct. So we are very nice people and we invest a lot of work work in this course. Criticism should be written with a small C, it's welcome. But it's important that you also do your part of the work, right? It goes both ways. There are always some people, again human beings, and they are often the loudest. They do very little and they have a lot to complain about. And that's, then we are not so nice to be, yeah. That's, I don't like that. That's also a problem in society, right? It's the people who they invest little, but then they somehow disturb the atmosphere. Everything is not good. Everything should be done this way or that way. So I think it's very important in a lecture like this and in general in life that it goes both ways. You put in your part of the work and we have different roles here and then we are very happy to put in our part of the work and then we are also nice. We're not always nice if you do straight. Yeah. So what you should learn in this course, two kinds of understanding are important. So you should of course understand the concept. So every lecture is about something and you should understand it in depth, not just superficially, but you should also understand how to do things in practice. That's in all my lectures. If you have been in other lectures by us before, it's always like this. Important to understand like the theory in depth, but then also apply it in practice. I say that because at the university, you often have A and not B. So you understand everything in theory, but you have no idea how to apply it. And in other parts of the world you have a lot of B, you learn how this is how it's done, but you don't really understand what's behind it. So this is very much an A plus B lecture. And I would say both are equally important. And also both to the same degree. I'm sorry, master solutions, so after the deadline for HC, the master solutions are published. And that's also important only for your use, now and in the future. So that's also a serious offense to somehow publish them or pass them on to others. They are strictly only for your use. And again, master solutions, if you have put in the work and done the sheet, then of course it can be very instructive. Let me just see how they did it. Oh, I see. Oh, could have been done better in that way. What absolutely doesn't make sense is you don't do the exercise sheet, you just look at it and then you look at the master solution, you say, yeah, yeah, that's also how I would have done it. Now I understood it. I'm saying that because a lot of people learn that way. They're looking at stuff and then just checking whether it makes sense and if it makes sense, then they think they have learned it. You don't learn that way. So if you have put in the work, your own solution, and then looking at the master solution makes sense. Amount of work, so six ECTS for a long time now, all courses at this faculty have not all, but almost all have six ECTS, which makes it very easy to have a choice. For example like these two lectures here now so that's about 180 working hours. Here are three options for the exercise sheet. Option one is zero hours per exercise sheet if you don't do them. That's one option. It didn't used to be an option in the past year because there you need to reach 50% of the point. So that's option A. Zero hours. It's not an approximation. It's exact. We did experiments. Option B is you do them. That of course depends on your personal knowledge and what experience and how fast your brain is and so on. Eight hours I think is a good average time if you do an exercise sheet. By the way, you are welcome to do some exercise sheets and not others. So given the system and discourse where it's not a requirement, that's totally okay. And you don't have to tell us before. You don't have to say I'm going to do all the sheets, I'm going just decide on a sheet by sheet basis depending on your personal circumstances at that time. This is also important if you lack basic prerequisites and this will apply to one-fourth of the people in the room. This will be a lot more work. And let me say something very important. So by basic prerequisites I mainly mean mathematics. I know from experience and from 30 years teaching experience in computer science that a lot of people have problem with math and I'm not talking about advanced mathematics, I'm talking talking about advanced mathematics, I'm talking about the simplest mathematics. I know that's a problem, that's okay, but you should be aware of it. The same goes for programming, some people are good at it, some are just good at it in theory, some have experience and this makes a big difference for this course. If you have problems here, right? If you have problems with absolute basics, then more advanced stuff just takes you much more time. And here's an important statement. So you don't have to raise your arm who you are, but just so that you know, if that applies to you, you have problems with math or with programming. You're still very welcome to do this course. But you have absolutely be prepared and willing to spend twice the time. This course is well suited in the sense that I will always give little crash courses. Look, maybe you didn't understand this in the past. Here's how it works. It will also help those who have already heard this before. But then if you haven't learned these concepts before as you should, then you should learn them now. And this will cost additional time and a lot of time. So you should like reserve 16 or 20 hours and drop some other lectures. I'm just saying this because that's how it is. And it affects a lot of people. So if you are lacking the prerequisites, then you just have to make a choice, not now, but maybe in one or two weeks. Okay, I'm one of these person, people that's okay, am I willing to invest like 20 hours per week? And then use this to fill my gaps in math understanding or programming. That's okay, but you have to make a decision. What doesn't make sense is to not appreciate that you have these gaps, expect that it takes eight hours or less and complain. That happens and that's not good. That just creates bad atmosphere on both sides. But you're very welcome. Okay, Studienleistung. For all the courses or most of the courses you have a Bristol-Frieffungsleistung, the exam in the end. That will be the next slide and a Studienleistung. So the Studienleistung you basically get for free. You have to register on our, that's on the first exercise sheet, our own course system. And for the exam, then you get the Studienleistung. This is not urgent yet. Deadline for this sometime after Christmas. So you don't have to make up your mind now. So no technical requirements except these two things, which you will manage to do. That's what we will talk talked about earlier. You do get points for the sheets because I think when you do a sheet, you want to know how well did I do. I do don't just want well done or nice that you have submitted something. It's good that you get feedback, qualitative feedback. Look here, this was not so good. Here's something that could be done better, but also quantitative feedback like you get it in the exam. Here's an important disclaimer. So we have five tutors this time, and it's a new situation because we have information retrieval. People who are interested in that database and information system and altogether I think it's 300 people. I think 300 will not stay but maybe 200. So more like a basic. That's just a lot of people and we can't hire so many tutors. So I expect probably a lot of you will submit for the first sheet, but then it will be some people will submit here, some here. I hope that five tutors can correct about 100 submissions, not more. If for some reason you all decide this semester, now that it's voluntary, you all want to submit. When it's a requirement, all want to submit. When it's requirement, you don't submit. You never know how human beings work. So then we have to see what we do. So if you get 200 submissions every week, we have a problem. I don't think it will happen, but let's see. And then we have the exam. Date will be fixed in the second half. I don't know if we have a say in this or if the Prüfungsam does it or maybe they will fix it. It might be that they fix it earlier. I don't know. And it will be four tasks, 25 points. Actually, don't take my word for this, but it's also not important. I mean the point is you get 100 points and then there's just a grading scheme 50 points is passed. Here you get 1.0 and just so that you know it before this exam will test all of the following three. There was this earlier slide basic understanding of course there will always be simpler questions like did you understand it at all? There will be deeper understanding, of course, and also practical stuff. You will have to write code also for the exam. And I can tell you right in the beginning, I mean, it's kind of obvious, but just so that it's said superficial understanding, so just like The official understanding, so just like producing words which somehow have to do something with a topic, just the basics will not be enough to pass the exam. So you need to have a certain level in all three of these because the proportion will also be roughly one third, one third, one third. I mean, it's kind of clear, but I just wanted to say it. And also over the years, we also do our exams that way that we see whether you actually understood something. And not just, right, it's a common practice to, if you have a superficial understanding, you just write words, which are somehow related to the question, a bit like a bad language model, and then expect to get points. We don't do our exams that way. Actually it's not easy but it can be done that you ask a question and you only can answer it if you understood it. By deeper I don't mean super deep. So I don't want to instill any fear in you. It's not that you have complicated high IQ questions in the exam, but they're like questions, write down the definition of this or just some very basic questions and questions where it's just, did you really understand it? It's not super hard questions, but did you understand it? Just to be sure, my work during the semester will not be counted for exam. Oh yeah, so the question was will anything, any of the points be counted towards the exam? No, it's not even allowed by law. So it's just for you to practice, to get feedback early on, but the final mark will only be the exam and that's also the law. It's not allowed to take points from the exercises or class participation or the exam, that's the final grade. Yes, please. If I already did, I have information with you, would it still make sense for me to attend this lecture or to have something on the other? Yeah, that was the question. You came a little bit late maybe because there was this poll in the beginning where I asked this and said this. We haven't 100% decided yet. So this is, where do I have this? Can you see it here? So there are these 14 people who did hear information retrieval. So let me say it again, this lecture is very different from database information systems in the past. That was a classical and a bit old style lecture. Very different from this one, but quite similar to this one. So probably it doesn't make sense but we are not, we have to think about it. We first wanted to know how many people there actually are. So you are one of these 14, right? You heard it before. Yes. So sorry I can't give you a definite answer. Now we have to think about it. Any other questions about this was the organizational part? We will make a short break now then go to the contents. Védiclauseur, okay there's another question. Voluntary because there are not enough tutors. No there are more reasons for the voluntary but that was one reason. Will the exam be completely different compared to the database exams of previous years? Completely yes, completely different, because the contents of the course is completely different. So, I don't know if we have anybody here, you don't have to say it, who has just has to take the exam again for database information retrieval. I'm sorry it will be new content. If that's a problem maybe we can talk. I think it affects only very few people. Since you said that you only do every two years, does this mean, I'm just reading questions from the chat but it's an interesting one. Does this mean that next year will again be the old database lecture material? No. This is the new database and information systems lecture and it will be held by me and someone else in turns but with a very similar material. So maybe Joschka Burdecker will do it next year with these slides or similar slides. But then in the years where I don't give it, I will do an advanced lecture. So in the past, I would give information retrieval every year. Now I'm doing this lecture here in uneven years and in the evening years, like a follow up with more advanced stuff because students have always asked about that. Okay, we will have a second half if you have more questions. Let's now make a break for five minutes and reduce the CO2 which is now twice Eocene level, 2,400. So five minute break, see you again in five minutes. Okay, let's continue with the content. So as I said, first lectures about building a simple search engine. Let me just explain. It's very simple and the exercise is mainly a, okay, a getting used to, exercise, programming, understanding something simple, registering. It's just everything, getting used to everything, exercise. So for the first exercise sheet, let me just, so we will now, here we have that's on a machine at our, in our rooms here where we'll do the programming. I have two windows here, this one and this one and one I will write the code and the others I will compile the code. Okay, I'm here and let me just show you the data set for the first exercise sheet, which is also on the wiki. It's linked on the wiki. Let me show you here if I go back. It's a data set sheet for ES1 and it has information about movies over 100,000. And let me just briefly show it to you. It looks like this. So you have movies here, they are ordered by IMDb scores, starts with the Shawshank Redemption, the all times number one, Dark Knight Inception. So it's actually ordered one, two, three, four, five, six. So if you look at the, you can ask yourself how many of the first 10 movies have you seen? And I think I actually have a poll about this. Do I have? Let me see. Yeah, let me just ask this. I think it's still there from back. Let me just launch that top 10. How many have you seen of these movies? So you see the list? There it is. How many have you seen of the first 10? One, two, three, four, five, six, seven, eight, nine, ten. Until Fellowship of the Ring. Okay, that's your data set. So just movie titles with descriptions, rather long descriptions. And while you are taking part in the poll, and now we have a keyword query like Pacino, maybe you're interested in mafia movies with Al Pacino, and that's your keyword query. And for today, I will only explain it for two keywords for the exercise sheet, you do it for any number of keywords. And now we just want all the movie descriptions which contain all of the keywords. So both Pacino and Mafia, anywhere in the description. And for the exercise sheet, you just return three documents, any three. If there are 100 hits, just return any three. So we don't do ranking today, like, yeah. You should note and you can exploit that, that these are already ranked somehow. So if you take the first three from the list, from this list in this order, you will already get some ranking because the input movies are ordered by IMDb score. Oh, it's stabilized already. Let's see what the results are. The results are none. Okay, there are two people. Wow, interesting. All ten of them, believe it or not, 11. Let me just check for myself. One, two, three, four, five, six, seven, eight, nine, 10. I'm also 10 out of 10, just so you know. Okay, interesting. I think it's a very important part of education that you have watched these movies. So how do we do this keyword search? Given a keyword query, just so it's always good when you have a research question or a programming question, what's the simplest solution? The simplest solution is just I go over all the documents and let me just do that here and I used like grab. So I could do the following. Let's just see how large is this file. It's not super large, 50 megabytes. It's the first exercise sheet. We will work with larger files in the following. Let me just do a Unix style grab. Just finds, give me all lines which contain a certain regular expression. And I could just do, I don't know, I want Pacino mafia. Now I have to do a strange thing because this will search for a string Pacino, any number of characters mafia. And now I also wanted it the other way around, so it's a bit strange to do this with regular expression. But that's how GREP works. If you haven't heard about GREP, it's not important. You can look it up afterwards. Just a command light tool to find all matches in a text file. And let's, okay, this didn't work, probably because I did do it case sensitive and Pacino's usually written with a, yeah. And you see, it's actually quite fast, right? How long did it take? I mean this was just two keywords. It took, okay, one twentieth of a, between one tenth and one twentieth of a second. And let's just look at the first at the movie titles. Okay, that's, and you already see interesting things happening, we have Godfather one, oh, and for some reason I have an echo, but that's not by me, right? Does somebody? So where's Godfather two and three? Why do we have an echo now? Does anybody have an idea why we have an echo now? Did anybody do something? And if something changed, you have to ask yourself what changed in the last few minutes. The Dark Knight goated movie true entry. Yeah, so where's Godfather two and three if it's search for mafia and Pacino? You can figure that out. So it's not so bad actually, just grab, right? That's a very simple approach. You just take your whole input and grab it. Grab takes around one gigabyte per second and we just have a 50 megabyte collection. So not a bad approach, just good to know, right? Just for your curiosity, the current web has around 60 billion pages or maybe more. It's actually hard to know. The companies don't tell us. There's a website for this called World Wide Websites. Actually it estimates the websites using an interesting approach based on Zipp's law, which we will actually talk about in this lecture. So basically what they do, I can briefly say it because it's interesting, let's just do something like this, the. So now I'm searching for the word the. And Google actually tells me so many documents contain the word the 25 billion. And I can approximate from, and that has something to do with Zipp's law, how frequent is the in documents usually. And knowing how frequent it is usually and how many documents Google tells me it has with that word tells me something about the size of their index. So that's the idea. So it's just an estimate, but a good one I think. So we don't do it the grab way. We want to do it in a way that also works for bigger selections or when you don't have grab and the standard way to do it is the inverted index. The inverted index is a very simple data structure so what you pre-compute, it's something you pre-compute from your data and we will program it in a very simple version in a second for each word that occurs at least once, for example, Pacino, and you lowercase the word, you normalize them. The list of document IDs, so every document gets a number, just the line numbers for example, Pacino occurs in document number 13, 57, 61, and you make sure that this is sorted. Same for every other word. Mafia occurs in the document 5, 23, 57. And now you can already see how that helps with answering keyword queries. If I want to know in which documents, which means lines in my file, which means movie descriptions, do I have both Pacino and Mafia? Well, I just look at this list. Where do I have the same ID? 57 has both of them, so 57 is a match. So I can compute it from these lists without actually going to the text once I have this just from these lists of numbers. And that's also how every search engine like Google will do it in principle. Of course a lot of advanced stuff on top of that but that's always the basic principle. It's the basic data structure and everybody should know it. And we will later see in the second half of the lecture that it also is strongly related to linear algebra and matrices in a not super magical way but interesting way. So these things are called inverted lists. Maybe let me very briefly explain to you why are they called inverted lists. They're called inverted lists because if you look at the, I mean it's a strange name, right? If you look at this, what do you have? And let's even look, let's look at it with line numbers. So now I have for document one, the words in the document. That's kind of the, my data is just this information about movies. And here I have this data in a particular order. For each document, the words in that document. And here I have for each word the document it contains. So in this sense it's inverted. And I, yeah we tried to figure this out before. It went away in the first, I don't think it's the hair, but if it's too far away. So you are seeing me, if you see a reason why, when this happens, just as a background process, try to figure out what's the reason for this sound effect. And it shouldn't be too far away, because then it's too. So that's why they're called inverted lists. And important, that's different how I would do it today when we coded live, for the exercise sheet each inverted list should contain a particular ID only once. So even if Pacino is mentioned three times in the document, maybe 57, 57 should be here only once. It's a tiny complication or you have to pay attention to it and it's easy to do something wrong there. So pay attention. It's easy to fix but you have to take care of it. Alternative, you could also do the following instead of having just a document idea, you could have a pair document ID and saying how many times is it there. That's actually related to lecture two. Not just saying it's in there but also saying how often is it in there. What could it be? It's a bit annoying, right? What could it be? Maybe it's also the cable. Maybe I should just shouldn't move. Or not talk, that's also a solution. So how query processing one keyword? Maybe it's also my brain. If your query consists of only a single keyword with what we have just precomputed, you just want Pacino, well there you have the result. Right, you have precomputed it. So if you have precomputed this and somebody types Pacino and wants three documents, you just say 13, 57, 61. You can do it in zero time. It's already there. And to say this again, this is already ordered by popularity, so the higher line numbers are kind of the more popular movies, so these are even the most popular Pacino movies. If you give the numbers in the order in which you saw them in the file, which we will do, it's the natural way to do. How do you do it for two keywords? Well, here's a simple algorithm. You have to implement it. It's called zipper for reasons that will be clear in a second. So here, so that's our, let me just, yeah, let me just call this list here L1. So now I have two sorted lists of integers. It's important that they are sorted, otherwise this algorithm will not be efficient. And now let's just do the, let me just check, yeah. So the zipper algorithm, and now I'm explaining an algorithm to you, a very simple algorithm, but it's actually very important in practice. So what you will have, you will have a variable i here for the first list and the variable j for the second list, which are both set to zero at the beginning of the list, at the beginning. And I think I would just write this on top here. So this is my i and this is my variable j. So initially, let me just write it here. Initially, i and j are zero. And now you just look at i and j and you compare them. And if they are equal, then you write that result, write that ID to the result because now I want to find the intersection so I want to find documents which contain both words and if they are not equal you just advance. So now I just advance here, I have 57, no no that's not correct what I did here. I explained, I made a mistake. You don't advance in both lists. I missed the most important piece in this algorithm. The most important piece is that you advance in the list with a smaller number. So where you have the five, you advance. So advance, let me just write it here. Advance in the list with a smaller ID. That's the whole algorithm, and you can think about why it's correct. So now I'm here at 23, now I compare 13 to 23. It's not equal, I don't write anything to the result. I advance in the list with a smaller value, which is now 13. Now I'm here, my i is here, my j is here. 57, 23, they are not equal. I advance in the list with a smaller id, which is here, my j is here. 57, 23, they are not equal. I advance in the list with a smaller id, which is here. 57, now they are equal. I write 57 to my result. Now I can advance in both lists here. 61, 63, they are not equal. 61 is smaller. I advance in this list. 114, 63, not equal, 61 is smaller. I advance in this list. 114, 63 not equal. I don't write anything to the result. 63 is smaller. I advance in this list. Now I have two equals again. And I advance in both lists and so on. Maybe I have more common elements. The lists are longer here. That's the algorithm. And it's called zipper. So like the zipper on your clothing because you go through these lists like this. Not necessarily in a strictly interleaving fashion. It really depends on the values here. You always compare where you are currently at and then you advance in the list with a smaller value. And because this is correct, think about it yourself because both lists are sorted, otherwise it would not be correct. And this you should implement. We will not implement it today. That's your task for the exercise sheet. And yes, the question in the chat, this is a merge join. We will come to databases in lecture 3 and this is exactly sorted list intersection is what in the database world you may not understand what it means now but let me just say it, it's a merge join. When you join two tables on a column they have in common and these columns are sorted, that's the same thing. And actually here we did intersection, that means we are only interested in those IDs which occur in both lists. Maybe we want all IDs, also from both lists in sorted order, that's called merging. So then we would like as a result 5, 13, 23, 57, 57, 61, 63, so the whole, the union of the two in sorted order, that's why we will do in the next exercise sheet. Because maybe you also want documents which contain only one of the words. For the exercise sheet you should actually be able to process more than two. So this is the intersection algorithm for two lists. Now what do you do if you have three, four, five lists? Well for the exercise sheet you can make your life very simple, you just do it pair-wise. So if you have lists L1 through LK, you just start with the first two, you intersect them, and then you intersect that result with the third one and so on. So you just have to implement the two, intersect two lists and just use it iteratively to intersect K lists. There are more intelligent ways to do this. You don't need to implement them, but maybe you want to implement them. They're not so complicated. For those of you who are a bit more experienced or just like the challenge, one thing is if you think about it, I mean, for the final result, you can intersect them for this pairwise heuristic. You can intersect them in any order. It makes sense to start with the shorter ones. So just order the lists. Why? Well, if you have the shorter ones here, then this L12 will already be very short, right? So the lists, the result becomes small early on. You just spend less work if you start with the smaller ones. So that's an easy optimization, you don't have to implement it. And the other more important optimization is this very same algorithm here, you can generalize it to k lists. So just imagine I have three lists here, now I have three pointers a, i, j and k and you do the exact same thing and you always advance in the list with the smallest value but now you always have three values where you look at or if you have k lists, k values and you need a priority queue for that. So if you are interested in this stuff, look it up or maybe you figure it out yourself. You're welcome to program it, you don't have to for the sheet. You can just do the simple pairwise iterate. The running time by the way of the k way is log k times total length of the list. We maybe come back to this in a later lecture, maybe not. So before we go to the coding, you somehow have to break the text into words. That's actually a problem that's harder than it seems. We can do something very simple, just take, we will see it in the code in a second, just take longest sequences of word characters. So we just take the regular characters until we meet a character which is not a word character, like here is space. So we just break this up like the matrix is, you will see it in a second, we will use a regular expression. So it's conceptually simple and yeah, we will just take the very basic list. So if you have some funny character in there, it will be considered as breaking the words and the search for that character will not work, but it's good enough for exercise sheet one. Yeah, here's some, that's how reality looks like and it's actually very important and later we have a lecture about this because whatever you do, we have seen several examples, search engines, databases and also programs like ChatGPT, it's super important that you can support all the languages and all the characters and stuff like this. This is a haiku, by the way. Anybody here can read it, please tell us. I actually know what it means. It's about monkeys and rain and winter, as you can see. In German, you have these words, they're mister of nung mister of knows party organizations committee, for us it's some other languages also do this, they just fuse nouns into, so should this really be only one word? What if I search for party? Maybe I also want to find this. And then you have funny characters. And if you do something wrong, the funny characters become have funny characters. And if you do something wrong, the funny characters become other funny characters and you don't. So this actually, O umlaut, usterreichis, gemüse, brühüm. And this is a very important topic. And we have a part of a lecture about it, about this, getting this right. And actually one very small side story, when Google became big in 2000, around the 2000s, and it was, there were other search engines before. One of the reasons they became so successful is because they did these stupid details right. This is so important in practice, I can't stress it enough. There were engines before, they just wouldn't work in Germany because they couldn't, they didn't deal properly with these characters like the German umlauts, oh with two dots and these funny things. Now you have a whole country which is not using your engine because you didn't bother to care about their letters or language. And that was, yeah, so unbelievable, but Google just paid attention to this. And the official story is different, but actually that's a major reason for their success is that they took care of all the details. And encoding letters, that's a very important detail, which is why we have half a lecture about it, it's so important. How do we construct an inverted index? Well I think we will go right to the coding now, let's code together now, let's build an inverted index from, let's write some Python code now. Yeah, let's see, we have to keep this open, otherwise we will suffocate. Okay, let's start to write some code. And while I write the code, and you pay attention, so it's our joint responsibility that this code will be correct and compiles. So it's Python. It's a bit loud, right? Let's try if it's, but if we close the door so it's either suffocation or, let's try for a little more because the CO2 is... So the EUocene was an epoch in German's history right 50 million years ago that's when co2 was three times big maybe you don't understand what I say when I say you see in eocene in German 50 million years ago the temperature was like 20 degrees higher and the co2 ago the temperature was like 20 degrees higher and the CO2 in the atmosphere was like 1,200 right now we have 2,200 So it's twice so in the Euzein it was pretty hot jungles at the poles and so on so So, let's see we start with a Let's write some comment here, a simple inverted index as explained in lecture one. Okay and now let's start by, maybe I will explain some basic Python stuff on the side but not everything so this is now the constructor create an empty inverted index so you always have to write this self things so we have our inverted lists and let me use types in Python and let me just quickly go back to the slide. Think about it, let me just go to the slide where I had it, to the right slide. This is what I want. For every word I have a list of integers. So I have a dictionary where the key is the word and the value is a list of integers. So I have a dictionary where the key is the word and the value is a list of integers. Right, I have words and lists of integers. That's what I have. So let me write it like this, dict. Is this correct? And I have strings and I have lists of integers. So in Python you can... and I have lists of integers. So in Python you can, and initially that's empty. I have no words and no words and the lists are empty. And probably I should put a, and just if you see a mistake, if you spot it, just shout it out so that I can correct it. It's your responsibility that it compiles without errors when we are done. Okay, that's the, now we have an empty inverted index, okay, and now we want to build it from a file and actually I've shown you the file from the exercise sheet, but let's look at, I haven't copied this. Let me just do this. Where do I? Internal, internal. We have an example file, I think. Yeah, that's also linked on the wiki right but it's it's not here is it already in public or how do I get the example file maybe I should just download it. Is it on the wiki Sebastian? It's in public templates, okay. Oh it was here. Oh it was here. Ah, I think I have to, just give me a second. I think I have to update. That's the stuff which you get to do the exercise sheet. I will explain it in a second. Okay, let me just copy it here. It will be linked on the wiki. Example sheet one, example GSE. Okay, now I should have it. Just ignore the last minute. Just remove it from your memory. It served no purpose, instructive purpose. So we have this input file. It's just a test file. We will always have unit tests in our code so that you can check that you did things correctly. And you should know that this is of the same structure as this real file, the big one. But big file is not good for testing. So you have a title and the title is just doc1, doc2, doc3. You have a description of a movie, a movie, movie, a film film movie, stupid descriptions. And then, by the way, this is for future exercise sheets. There's also more information here. These things here in the middle, it's actually not spaces, it's a tab character. So if I go here with GA and NVIM, that's a zero nine. So that's a tab, it's not spaces. So it's tab separated. And this is I think the number of votes from IMDb. This is the IMDb score at least in the real file and this is the number of Wikipedia or Wikimedia articles about the movie. And these are just if you want to play around with ranking it's for lecture 2. For this lecture you can ignore it, you can also use it. So we want to build from a file. Let's see, you have to help me because my python might be a little rusty. So build inverted index from given file. Okay, so what do we do? We have to read the file, I guess, right? So let's with open file name as file. Okay, what do we do? We open the file. Now we iterate over four lines. We iterate over all lines. Now we have a line. Now let's split it by tabulators. Let's see how do we split it. The first thing is the title, then comes the description, the rest we don't care about. I think that's the way to say in Python, don't care about. We want to split it by tabulators and yeah we only care about the first two columns, the rest we don't have to split it up. I think this will say split by tabulator but into at most three things. This will be the first column, the title, until the first tab, Shawshank Redemption, this will be the description, the long thing, and then the rest will be in this, this is just Python's way of saying don't care, some temporary variable okay now what do I do I let's split up the line in words and I think we need the regex module from python let just see, and you tell me when something, and I think we, is it find all, if I want all matches for, so okay, we said these are our word characters, A to Z plus, so this is a regular expression that matches a maximum sequence of these characters. So it would match matrix, it would match the and so on. Plus says, if I would write star, it could also be no match. Plus says one or more. Okay, and I think if I'm, you have to tell me if something is wrong. This should find just all matches of this regular expression, which means it should find all the words in a line and return them here. Okay, let me just move this up. Oh, and before I continue, maybe let's write a unit test. We will often give you unit tests, so doc tests in Python. You can write the test right in the comment. So let's just see, what do we want? If I create an inverted index, so I just start with these three greater than signs, which just says, okay, now comes code for a doc test. So just execute this code and see if the result is as it should be, you will see it in a second. So I'm just calling this build from file and now I'm taking the example TSV there. So let's just see if I build an inverted index from this example file. Now think about it. And now I want the inverted lists. Now the problem is that the inverted lists are a dictionary and the order of the words in the dictionary that's not defined in Python. So let me somehow, let me just take the key value pairs, I think you do that with Python. Hello. And then let's sort that. That will be sorted by words now. Now I get pairs of word inverted list for that word. And now without these three characters in the beginning, I can write down the, so what's the lexicographically smallest word in these movie descriptions? And let's just take the descriptions. We don't index the title for this code. What's the, if you see these three movie descriptions, it's just the second column, what's the lexicographically smallest word? Hmm? Movie, film, if I lowercase, we will lowercase the stuff. A, I think, A. Okay, so the first invert is, let's try it like this. In which documents, so if I give the documents IDs which correspond to the line numbers, in which documents does A occur? What? One and two, okay. So this should be the inverted list for, this is what I expect. So I have a list for A and it should be the documents one and two. What's the lexicographically next one of the words that occurs? Film, I also think so. And film has which inverted list? Two, yeah, okay. So this is how you write a unit test. Not you write the code and then you take the output from the code and paste it here, but you write what you expect there first. So there's only one other word, which is movie. And now I will do what you will not do in the exercise sheet when a word occurs multiple times in the document, I will have the document ID multiple times. That's what you should not do. In the exercise sheet, there should only be one one, but I will do it like this now for simplicity, so that you also have something to think about for the exercise sheet. So one one. Movie occurs in document one, again in document one. Yeah, let's just see. And the reason I did the sorted here is because, yeah, a dictionary, it's not clear in which it stores the order. So, what do I do now? So, for each word, for word in words. Okay, first I think I need now the record ID. So let's start with, somehow I need to keep track of where I am, the line number. And let's just whenever I read a file increase this and let's just go up here again so yeah so I reading the first line and now my record ID is 1. So now I have a word and I'm a record 1. So what I do I do well I take the inverted list of this word, so I've already seen this word and now I can just append the record ID. I think that's what I need, right? And the nice thing is because I'm going through the documents in order anyway, so for example I'm here, I'm at the dark and now I just, I'm in document 12, maybe I've already seen dark before, now I just go through to the inverted list of dark and append 12, this line number, this record ID. And that's how I built the list for. And maybe I should lowercase my word here. So that I lowercase this. Any questions about this? I think that's almost it, but not quite. Any questions about this line, this code, will it work, will it not work? Probably not. Probably not. Okay, there's a probabilistic statement about this code. Interesting. Our world, universe is also probabilistic, so it fits in that respect. Why probably? There are two moments I'm not sure about. Okay, what are you not sure about? it fits in that respect, why probably? Okay, what are you not sure about? Okay, you doubt about this, I think, I'm also not sure, but I think you can. That's Python, in Java you would have to write 20 lines, and Python it's one line. The lists are not initialized. Yeah that's true I think. Initially we don't have any lists, right? It's just empty. And now when I see a word for the first time, this will probably not work, right? Here I'm saying give me the list for that word. No, I don't think me the list for that word. No I don't think that way we should absolutely, let me do it this way word lower. And so if the word is not in that inverted list, we add it to that inverted, we should first create this list for the first time. I think you are absolutely right, so I should create this list for the... So if I'm seeing this word for the first time, yeah, if we see the word for the first time, create empty list. Okay, now here we do append record ID to the list. And just note that's very nice, it's automatically sorted because we are going through the record IDs in ascending order that way, right? So append something automatically will be no smaller than the stuff. So we have something in this chat. Shouldn't the line in find all be replaced with desk since we only want to search the descriptions? I completely agree. It should be desk here. Okay, and you should also code like that, not just mindless coding and then compiling until you have no more syntax errors. The goal should be that your code compiles on the first try. So, will this compile and will it work? Any other? It's our one big achievement in this lecture. Yes? And if the word is repeated, but that's on purpose, we want that. I wrote it in the unit test. For the exercise sheet, you shouldn't do it. You should check. Here it's okay. So if it occurs twice, I have the same record ID twice. That's why I wrote it in the unit test. Yeah? Well, always say line number, please, so that I... 35. So you are saying we are missing numbers. Yeah that's true we're missing, okay that's a in 1997. Yeah, that's true. We're missing, okay, that's a fair comment, but for simplicity we are doing that way because right now it's also written on the exercise sheet. But you're completely right. By this very simple regex we are missing like this year, 1977, we can't search it. That's true. It's absolutely true. And also I see one thing we should include, no we did include the Regex thing. Okay, maybe just to look at it again and I'm already starting. First thing we just asked our style checker, inverted. Flake eight just checks do I write it in proper Python style. Wow, we did not make a single. That's pretty good. Now, yeah, if I would have two lines here, I would get some, yeah, it says too many blank lines. It's actually quite picky. Okay, should we? Let's see. picky. Okay, should we? Let's see. Now I will just execute the doc tests, which means it will just execute this code. It will just call the code, build an inverted index, use the example files, and check if this is correct. Wow, that's amazing. No check style error and the test was right. Let's just check that it actually executed the test by making the test wrong. Okay, now it will say, now I deliberately gave it a wrong test. So it said I expected this one three, which is wrong. It got this. So congratulations to you too. We did no compilation error, no check style, the style was perfect on the first go. So that's the goal for you. And don't write a lot of code at once, write small pieces and then it should just compile. So that's, okay, great. Now we are almost done. So that's a, okay great. Now we are almost done. As you can see. Yeah, let's code this together. Now one more thing. I promised that we would talk about Zipf's Law. Let's look at the, now that we have these lists, we can do one thing very easily. Let's just write a small main program here. So if I'm calling this as a main program, I think that's the way you do it in Python. Now I should do, oh I have to... Ah, now I should do, oh I have to, arc pass. Okay, I'm not sure whether I can do the argument passing correctly. Arc pass, how do I, what do I write for arc pass? Arc pass, I'm not using, who knows it. If I want to pass the command line arguments. I just want to call this like this now. Let me just clarify this without the doc tests. And I want to call it on my example file. Now I want to parse this command line argument. How do I do it? There is a list of arguments. Yeah there is one argument. Pass the command line arguments, okay. Oh I have to go here, okay. It's not arguing, you don't have to do that. Oh there it came. Oh it just came magically. Okay, arc parser. I think I should, arc parser, hmm, add arguments, okay, wow. File name, file name is good. And I think here I could also add, build an inverted index. Build an inverted index, add argument file name, file to build index from. Yeah, I think that's reasonable. And now I have to parse it somehow, right? How do I parse it? Nobody knows here how, ah. Ah yeah, there it is, okay, thank you. Okay, now I have it and now I want to call my name. And again. And now I just want to output how frequent each word occurs. That's what SIF's law is about. So I just go to it's getting loud again, inverted lists. And now I want print in a format, I want to print the word, a tabulator and the length of the inverted list. len of ii inverted the inverted list of that word. Let's see if it works. No, it's not printf in Python. Is this correct? So this just parsing the Well, it's our goal is to compile it on the first pass It's not correct I can't use complex and I think I can I think I can I'm pretty sure about that one. You see another problem? Let's see, again it just worked. Let's see if I don't have the argument here, it will tell me okay, I can write. So we are really good, we're really good, wow. So now I have the inverted, so it tells me A in this example is twice movie three times, one, two, three, film once. Okay. And now the nice thing in computer science is if you have written a program that works for an example file and it also works for the big file, right? It's not that you have to write a new program for it to run on a 50 megabyte file. Don't you know if you have ever realized this, but that's the nice thing about computer science. Then you can just run it on one million files. Now it will just, yeah, and we will output this, and now let's do the following. Let's just sort this by the second column and now we will get the word frequencies. Most frequent word first. What will be the most frequent, what will be the top three most frequent words in our collection? three most frequent words in our collection. I'm looking for the apostrophe and I don't find it. Ah, there it is. This is tab separated, this is why I'm doing this, minus K N2 2 NR, this is sorting in reverse order. Just taking the output and sorting by the second column in reverse order. What will be the most frequent word? What? The? See Yeah, the and film film who would have thought Film it's the third most so these are the word frequencies in this collection And now let's do one final thing. We are almost done. Party. Let's just write this into a file. Word frequencies.tsv will take a while. So it's actually big files already. Okay, now I have it in a file, word frequencies. And now let's just plot these numbers here, these frequencies. One, two, three, the first most frequent one, the second most frequent one. Let's do this with a GNU plot. with a GNU plot. So let's say plot, let's just plot this. I can just give it this file, frequencies.tsv. I only want the second column, that's I think how I do it. and this is pause minus one is so that the window doesn't disappear again right away and I think it should be this one. Okay that is let's just plot the initial part the first maybe the first hundred most frequent. Okay there we have it. This says the most frequent word occurs like, that's the number we have seen, 300,000 times the second most frequent one, the third most frequent one. So what you see here is just the frequencies ordered by how frequent. That's why it goes down. But not only does it go down, it goes down in this very nice way. And that was just movie descriptions. And that's Zipf's law. Zipf's law just said take whatever you like and you just look at this frequency distribution and it always looks like this. And the question is what's like this. That's what Zipf observed and it's kind of easy to observe nowadays but like in his time which was yeah a hundred years ago that was a bit harder to observe because you didn't have computer programs. And actually it's this function, it's like a hyperbola where you have this. And how can you check, that's say fN is C times N to the minus alpha. So that's Zipp's law. Zipp's law says it follows somehow this law where we have some constants here, proportional means there is some constant and then it's n So n is 1 for the most frequent one, f2 is the second most frequent one, f3 is the third most frequent one and we have some parameter alpha here Let's just take the logarithm on both sides Let's just take the logarithm on both sides. Log Fn is equal to, if I don't plot f n over n but log f n over log n, what should I see? It's a log log plot so I'm not plotting the numbers so what we saw on the plot was f n on the y-axis and on the x-axis. And now if we put log N on the x-axis and log fN, you said it? A line, you said a line. Yes, we should see a line, right? Is it which line, which slope? Down, yeah down, because it's minus alpha. And we can even read off the values from the line, right? The slope will be the alpha, the negative slope and the C here, this hidden constant, it's already, it's also written here. Let's just do that. Frank, we are almost finished. In GNU plot, you don't need it for the exercise sheet, but it's just good to set. I can just say plot this in a log scale, which means on the X and Y axis take the logarithm, both axis. And then now I only took the first, let's take all of them, not just the first. Yeah. So that's Zipp's law. So it's not a perfect line, but it's pretty line-ish. So that kind of proves, in the end you have some funny artifacts, you can think about why. It kind of proves that that's the law. And the nice thing about this law, and you can put in anything there, any text, but also other things, where you just count frequencies. And it always looks like this, it's kind of a gosh in this normal distribution. You see it everywhere. You take anything in nature, you always get this distribution. Okay, so that's this part I will skip. This is about how you submit stuff. Sebastian, by the way, let me introduce Sebastian. Can you very briefly stand up please Sebastian? So that's Sebastian. Thank you. He's the assistant for this course so he will do a lot of very, thank you, very valuable work behind the scenes and he will, he promised which I think is great, record a short video where you will just show and explain how these course systems work, how you submit to our versioning system, how you register, just the whole thing which you need to do for exercise sheet one. So there is a forum, let me at least show it to you once. It looks like this here. So that's our system so that you have seen it once. Once, okay, I'm registering here. I'm already registered. Okay, I don't know what's going on here. You just have to, so there's a forum where you can ask questions. Here is Sebastian's name again. There is this repository, it will be in the video how you use it. The slides are just for reference. That's what I just said. Sebastian will record the video. Whenever you submit something, the code will be automatically checked. It's called continuous integration. You also have it on GitHub. We have our own system. You submit something, it will be checked for compile errors, tests run through and so on. And that's it. Here's some references. As I said, you don't really need them just if you are interested. That's it. The exercise sheet is on the wiki. Are there any more questions right now or in the chat which I didn't see? So, yes please. Ah yeah, thank you. Thank you for asking this. Actually I should have, at the top of the exercise sheet, you have one more minute please. There will be two lectures, two weeks where we have no lecture, so the overtime is okay. That's why we're doing this. Here's in red these rules. You should absolutely read them. It's on the exercise sheet. You must read them. And the on the exercise sheet. You must read them. And the first one is about the programming language. So we exclusively use Python. We used to allow other programming languages, but really it's much more work and nobody really does it. So in this semester we just say everybody use Python. In the past, maybe one person. If you absolutely, for a very good reason, want to use another programming language, talk to us, maybe one person. If you absolutely, for a very good reason, want to use another programming language, talk to us, might be okay. But most people want to use that anyway. And also it's version 3.10 or newer, for a reason which I skipped now, but we will communicate it afterwards. It just lists, just read this through carefully. So it's written on the sheet. Any other question right now? Okay, so that's it for today. Have fun with the first sheet and see you next week. Bye bye.Welcome everybody to lecture two, databases and information systems in the winter semester 23-24. This course can also be taken as information retrieval and indeed is by some of you. So what we will be talking about today about your experiences with the first exercise sheet which was a very simple implementation of the inverted index, the basic data structure behind all search engine. Say something about your tutor feedback time slots for those of you who do the exercises. And today we will talk about ranking and you will see what that is in a second. And the exercise sheet will be to implement it and we will also talk about this and there will also be a small competition. So first some excerpts from your feedback. So most of you liked it as usually programming was a little rusty for some but you're getting back into this. Here are some excerpts, very nice sheet. All the contents needed was well explained, good and welcoming introduction. Sheet was kept simple. Yes, it was deliberately kept simple. Think it was a pretty cool sheet, not too hard, not too easy. Lecture was easy to understand and follow. I'm looking forward to the rest of it. I've not coded in Python in quite some time, so quite a few people wrote that, but in a positive sense, like I'm getting back used to it again. After a long break, need to get used to coding again. Some people also did the optional parts, which is great, so like the highlighting. Of course, that takes time. I invested more time than I like to admit. We were surprised by this, so there are over 300 registrations for this course and around 80 people I think submitted a sheet. That's interesting, less than we expected for the first sheet, but interesting, yeah. And also interesting, usually there are about 20% of people who say for the first sheet that it's too hard or everything. We didn't get any such comment this time. I guess those people just didn't commit a sheet in case they are here. So the sheet was about a very simple search engine, our movie collection collection and then let me just show you the master solution here, which is also linked on the wiki. So this now just reads the, and maybe before I show the master solution it's maybe a good idea to show the data file again. This was the input file, it was just a collection of movies. Let me even put, and it was ordered by IMDb rank, so about one here, we have the line numbers 150,000 movies, title, description and then some additional information which you don't need for this sheet, maybe for later sheets. So that was the input and this is what you were supposed to write, a program where you can just give this input file or any input file as an argument and then you can type your keyword query. So here were some examples that worked well and let's just very briefly go through these examples, positive and negative, and why they worked or didn't work well. Godfather, what would you expect? The three Godfather movies and you get the three Godfather movies. Godfather one, two and three, Al Pacino, Francis Ford Coppola movie. Why does it work? Godfather is not super specific. There will be other movies containing it but the movies which you expect are the most popular ones. So it works. Tarantino is another example. You expect movies by Quentin Tarantino, you only get movies by Quentin Tarantino. Why? Because this word is so specific, if if it matches it's the director. Zombie so I deliberately chose examples which are different in kind what you expect probably zombie movies and you get zombie movies because if somebody mentions the word zombie it's probably a zombie movies. Lord rings it's also interesting let's type it Lord Rings and you get Lord of the Rings. That's a bit different than the other one. Lord is very unspecific. Rings is also very unspecific. But it's a very popular movie so you get the expected movies first. So this one, Romantic Comedy, let's try that. This is an example of a movie that didn't work very well. It does so sometimes you have movies where it says romantic comedy, but first is American Beauty, which is kind of a comedy. It has comedy here, romantic here, so there are some romantic elements, some comedy elements. It's not really romantic comedy. So here it was important that the words are not close together. Titanic, you would expect the movie Titanic first of course, but you don't get it first because there are other movies which are more popular which mention Titanic. Because there are other movies which are more popular which mention Titanic. So many interesting things you have to consider. And that was the whole point of the exercise. You think, oh that should work, that's a good idea. And when you try it, you see, oh it doesn't work. Interesting. So that's very typical for this kind of research. You have some ideas, some expectation, then you try it and then you see all kinds of effects why it doesn't work. 2006 film, you would expect films from 2006, but you don't get, yeah, or if you even just type 2006, you get zero hits because we didn't index any numbers. Very simple, but of course important. And that's the last one, Spider-Man. You would expect Spider-Man movies, but you don't get the Spider-Man movies because the Spider-Man movies, I think, contain it with a hyphen. And now you again get movies which just refer to Spider-Man and not about Spider-Man. So many different effects, why it works or doesn't work. So that's an interesting slide I guess and of course there are many more different phenomena. Okay, before we move on to the content, one more organizational thing so you will get a tutorial. If you have asked for feedback, if you have submitted the sheet number one, and then you should ask for feedback. And it's very nice, you wrote very diverse things, some people said I'm fine, just let me know how many points this is worth, others said please give me detailed feedback, others wrote give me feedback on this please, not on this. That's very good for us because then we can focus on our effort on where it's needed. So if you asked for feedback, you will find it in your repository in a file feedbacktutor.md and you get it by doing SVN update, then you get the latest version of the repository from us. How do you know when you should do SVN update? Well, you could do it every minute. Actually, I'm not sure. I think some tutors, or maybe we will ask our tutors to just write you an email or something. Did we do that in the past, Sebastian? Maybe that's a good idea, right? That you get a message, your feedback is there and then you know as we update and I get it. We introduced this I think two years ago. So we have the forum. We don't have tutorials for the reasons I explained in detail in the last lecture. I will not repeat it. But some people, for some people the forum is not enough. They need kind of individual help for all kinds of reasons. And we realized that that's why we have a tool which I briefly show you here where you can just book time slots with your tutor. So here, one tutor is missing because he just became a father, but he will be with us again next week. So let's say Patrick Brosi is your tutor, and you will see this in Daphne with your tutor. So you have some slots here, and that's up to the tutor to say which time slots they have. It's short time slots. If you need more time, then you can ask for that in this short meeting. You just write your email. So very easy, very low barrier way to get a short meeting and ask all kinds of questions. And the link is on the wiki. So who is your tutor? That assignment will be made hopefully today. If sometimes there are technical problems, then it will happen by tomorrow. And since it's totally up to you who submits, who doesn't submit, you can make up your mind every week again from scratch. There may be imbalances. so some tutors might have, might end up with 20 students, another one has only five students, so we will shuffle around students, which means that your tutor may change over the course of the semester. It might, might not, just so that you know. Any questions about that before I move on to the contents? Or any questions about anything? No questions, I continue. So today is about ranking, so let's first say why we talk about ranking. So, very simple thing to understand, but me just say it although it's obvious. When I type 2006 here, what does it mean? What do I search for? I have a search desire in the back of my mind and I express it in keywords. So maybe now I want to find movies from the year 2006 but I didn't write that. So you have to distinguish between the search desire and the query and that's what makes search hard always for all information systems, also in databases and other systems. So there is what you actually want and then what you type. And so naturally, and we have seen examples for that, you type your query and then you get stuff where you say, yeah, this is what I was looking for. And you get stuff where you say, this is not what I was looking for. And the important word here, which is used in this context is relevant. So there is stuff which is what you expected and stuff which you didn't expect. And you already saw this and you, and exercise two is all about this relevance. Naturally, you search and you want the most relevant ones first. And sometimes it works, sometimes it didn't. And this is especially important for web search where, I mean, if you just get all documents which contain your query words, I mean, let's just, yeah, I don't know, we type Star Wars here or whatever, you get one, yeah, these words are probably mentioned in a lot of documents. You have one, over one billion documents containing these words. You want the ones you're actually looking for first. And this sounds so obvious, but all the search engines before Google did this wrong, basically. You would type a query and you would get some super irrelevant document first which does contain the word but it's not what you are looking for or what anybody would be looking for. So not such an easy problem. And why is it not so easy? Because you have to somehow measure what relevant means. So and relevant is something which is in the mind of the person who is searching. So kind of a hard problem. And that problem we are trying to solve today. So how do we solve it? Here's the basic idea. The basic idea, let's go back to our inverted lists. Inverted list, university in blue. I have the IDs of the documents containing university. And now I have, I also have a score and right now just take this score for granted. In a second I will explain how to compute these scores. And this score is somehow, it should say this document or this document 127 127, yeah it's really about university. The word university is here for a reason. Here for 53, 0.2 it's a low score. Yeah the document mentions the word university but it's not really about university. So we have seen this in our examples, maybe let's do a Titanic again to make this clear. So first movie is Lord of the Rings and it mentions Titanic. So we would give Titanic a low score in this document because yes, it contains Titanic, but it's not really about the movie Titanic or about the Titanic. Whereas in the document with a movie, in the movie description for Titanic, Titanic should get a high score. So that's, and we do this per word. So that's the basic ideas behind basically all search engines. So now we still have a problem, how do we compute these scores? And then, but before we will talk about that, so now I have a keyword query like before, and now I just aggregate the scores. So document 17 just contains a note, a new thing here. Now we are merging. Last time we were intersecting. Document 17 only contains Freiburg, a university and not Freiburg. We still have it in the result list and we just take only the score from that. Let's take another one, 53. 53 occurs in both, contains both university and Freiburg and then just we sum up the score. So aggregate, aggregate just means you have several scores. In this case, let me just write this, in this case sum, doesn't have to be sum, could also be a different form of, you could also take the maximum or something like that, just saying that's one way to aggregate some. One hundred twenty-seven contains both words with a high score so it gets 1.5, 0.8, 0.7. And then we sort this. And now one hundred twenty-seven is first, it's the one with the highest score. Why? Because it contains both of the words with a high score. One other important thing to note, understand this, 17 contains only University, not Freiburg, but with a decently high score, 0.5. It comes before 53, which contains both of the words, but with a slow score. 0.2 plus 0.1 is smaller than 0.5. So these things can happen and do happen. So that's important to understand. For some reason these things are called postings. These entries in the list they are now more complex document ID score. They could contain even more information like where in the documents do these words occur and so on. Any questions about this? So that's like the basic stuff we are dealing with today. We have this inverted list, document IDs and now also scores. Now let's see, okay first of all, okay that's just a side remark, not really needed for the exercise sheet but I did want to mention it. You're doing the merging here and now you have to sort it. So that's sorting, how do you sort? Well sorting takes time n log n. It's not the topic of this lecture. Where n is the number of results in your list. n can be very large for web search engine. But when you, in a typical engine, let's say you have these one billion hits for Star Wars, you only want the top 10 or the top 3. So actually what you need is a partial sort and I just wanted to mention this. So it's enough. Assume that this list here is 1 billion entries and you only want the top 3. And maybe you want to implement it for the exercise sheet, you don't have to. I just wanted to mention that partial sort, so you have a long list, you want to sort it but you're only interested in the top k, the k largest elements. You don't need to sort the whole sequence. This can be done much, much faster. N log N, that's a lot when N is large. This is N, so we have to look at all the elements once. You have the n, but then you have k log k, which is very small. So it's essentially linear. And for those who are interested, I'm just mentioning this on the side. You can do it with k rounds of heap sort. Heap uses a binary heap. You learn that in algorithms and data structures. You can also do it with quick sort where you don't recurse when it's outside the range of results you're interested in. You can modify many sorting algorithms so that it just gives you the top K. You don't have to do this but maybe you have spare energy, you're interested in this, feel free to do it. And most programming languages, except Python, which is not about efficiency, have functions for this in their library. So, how do we compute back to the scores, meaningful scores? So where do these numbers come from? They are obviously very important, kind of reduced our initial problem to computing these numbers. And I already explained it, this number should somehow reflect, so how do I manage this, that Titanic gets a low score here? How do I do this? That's what we want. And the problem is, it's kind of subjective or kind of you have to, it looks like you have to understand, right? I have to understand, oh this document is really about Lord of the Rings and not about Titanic. So Titanic should get a low score. So we need artificial general intelligence, it looks like. And it has been looking like that for 70 years. And today we will not yet do artificial general intelligence. We will do it in the next lecture or in a later lecture. We will do it in a very simple way which has been the way it has been done for decades now. There's a shift now. And here's a very basic idea, which is still used a lot, term frequency. Let's just look how frequent the word occurs. It's a very simple proxy heuristic. You say, oh, this is a long document. It only contains titanic ones. Probably titanic is not so important for this document. I just count how often the document occurs here. That's maybe, what was another movie? I don't know, let's look for James Bond. Yeah, look at this, Skyfall, James Bond, James Bond, James Bond, Bond, James Bond. It's mentioned many times, so if a document is really about a certain word, the word occurs there many times. Kind of obvious idea. Here's a problem with that simple approach. And we are going to solve that problem on the next slide. Let's assume our query is University of Freiburg. Here are our posting list, document ID and scores. In blue I have the document IDs and here just highlighted two documents, document 57 and 123. Document 57 contains university five times, the word off, the very important word off, 14 times and the word Freiburg three times. And here we have another document with these numbers. Now let's do the math, how I explain it. What do we get? 57 gets a score of 22 and 123 gets a score of 26, which means this will be ranked higher, right? Because the sum of these numbers is higher. That's not good, right? So why did it get higher? It contains the word off very frequently which really says nothing but the words which somehow carry meaning, university and Freiburg, it contains them less often. So kind of we would say looking at this, 57 should be ranked higher because university and Freiburg are more important than off. So that's the problem. And some of you have already observed that if you type word like film or end or something in your, they kind of distort the results. So and you see what you're going deeper and deeper and you find all these other problems which look hard to solve. Now how do we tell that university is an important word and off is not important. That also sounds subjective. Actually there's a nice way and also simple way to do this. It's called document frequency. And here it is. We just count in how many documents does a word occur. University occurs, and let's just put some, and maybe you recognize this as a power of two. You will see in a second just for the sake of example why I did this. It's which power of two? Two to the? Four. Hmm? Four. Four? Two to the power of what is this? This is two to the power of what? Let's start here. Two to the power of zero? It's 1024. Which power of two is this? Ten. I agree, it's 10. Let's go back to this one, 16,384, it's a power of, that was not 14, I agree. 14, what about this one? It's half of one million, that should give you a hint. It's half of one million, that should give you a hint. This will be on the exam, things like this. You should really know your powers of two. And you don't need a calculator for that. So 1000 is two to the 10, one million is two to the 20, around, what's this? If I tell you that it's, I guarantee you it's a power of two, what's this? If I tell you that it's, I guarantee you it's a power of two, what is it? Okay, I hear some mumbling. Okay, I hear this one. Was that my echo? That echo? That's an interesting effect. Ah, maybe someone from Zoom has. Okay, we have this power of twos. They don't have to be power of twos, just for the... Now, the inverse document frequency is just computed as the logarithm base two of N over DF. And let's just, let's assume we have a collection with one million documents, which is two to the 20. So we take one million, the next power of two. And then we have a collection with one million documents which is 2 to the 20, so we take one million, the next power of 2. And then we have these numbers. Let's just verify this. So IDF University, this is now log 2 and I won't do it for the other numbers, off, that's just this formula, two to the 20 divided by two to the 14, which is, yeah, two to the, let me just write, you should absolutely know your, and there will be a question about this in the exam. So these things you should absolutely know. That's 20 minus 14, that's how it works. It's 2 to the 6, just the law of logarithms or of powers. And I won't do it for the other two, so that's the numbers you get. Log 20, 2 to the 14, so you get 2 to the 6, and the log two of two to the six is six. So that's what you get, and now let's look at these numbers and understand them. So here we have a word that's quite rare. Freiburg is very rare, just occurs in around a thousand documents of a million, which means this IDF, because the DF here is in the denominator is rather large and the intuition is this is a word if it's there it means something. It's relatively rare overall. If it's rare it means something. It's a significant word. Of occurs in half of the documents, half of the 1 million. So we have log 2 of 2, the 2 meaning half of the documents, half of the 1 million. So we have log2 of 2, the 2 meaning half of 1 million documents contain it. So the IDF is 1. So a very low score which says off occurs everywhere. It's not very meaningful. University is somewhere in the middle. It's the idea behind IDF. And why do we take the log2? We don't have to take the log2. We would have the same effect, like higher if more specific without the log2. That's a very common thing to do. Look at the differences in these numbers without the log2. They're like enormous, right? There's a difference of 500 from off to Freiburg, that's too big. We want a difference but we don't want it to be as large. By taking the log of this funny thing we get number 6, 1, 10. There's a range here but the differences are not super large, that's what we want. And now let's just apply it and see that it works. That's our earlier example, I won't explain it again, should look familiar. Here the situation was that 123 was ranked higher because it contained the word off so often. And now let's look at the list with Tf, Idf scores and let's just do the math quickly to see that, so what did we have? We had, I'm sorry, this is not what Idf, I wanted to write Idf, let's, so the Id of... what was the IDF of university from the previous slide? Six, that's correct. So that was like medium. The IDF of off was one, very low, yes, right. And the IDF of Freiburg was exactly 10. So let's just maybe write it on. So this first, yeah, the 30 year is just five times six, right. The 12 year is the two times six, so the tf times the idf. The 14 here is the 14 times 1. The 30 here is 3 times 10. So because Freiburg is so rare overall, the 3 gets multiplied by 10. And here we have 23 times 1 and here we have the 1 times 10. And now we do the math again and we just sum it up and now we get a much higher score here because yes of occurs very frequently but the other two words which have a higher IDF, they get more weight now. So the 74, so 57 will now be ranked before 123. So that's TF-IDF, very important formula used as a heuristic in a lot of settings, not only information retrieval. Basically whenever, so that's really very, it's very simple to understand the principle. It's important to understand and remember this. Whenever you have objects and you somehow want to give them scores, think about TF-IDF. Term frequency and also, okay, how often does this occur overall? Here's some problems with TF-IDF in practice and we will solve them now. So the IDF part is fine, can also be improved, the TF part has some problems. And let me start with an example, let's go back to James Bond. This mentions James Bond several times. Now just this movie description is pretty long. If it would be twice as long and go on like this, it probably would contain the words twice as often, right? Just because it's twice as long. So some documents are shorter, some are longer. Longer documents will contain more word. It's kind of unfair to say that if this document would be twice the length, that it's twice as much about James Bond just because it contains the words twice as often. It's kind of strange just because it's longer, it's not more about the word. And that's written on these slides. I will not, I just explained to you the intuition. You can look at this yourself at home. I just gave you the intuition. I will now tell you how to address this. So here's a formula which kind of takes into account that longer documents tend to have more words. And this is a very famous formula, it's still used today, BM25. It's called BM25 because the people were just for decades just randomly trying out formulas, BM1, 2, 3. it's like the programming language C, which is the successor of B, and so that's just what computer scientists do, they give things funny names. So that's the formula is, I, TF-IDF, so this should look familiar, that's the IDF part from before, let me just write that down, but the TF part is modified. And how it's modified, I will explain now. So this is the, this is the IDF part we have seen before. And what's the TF, that's the TF modification. So we take TF as we had before, term frequency, and we multiply it by k plus 1 divided by k times alpha plus tf. Where alpha is this. Where dl is the length of the document and average dl is the average length of the documents in my collections. So kind of this here says this document is twice as large as the average or is half the average size. And let's, okay, this is a strange formula and now in the next 10 minutes or so we will try to understand this. Somebody says here's a formula that works well. So it looks magical or strange or whatever. We now try to understand this formula. One way to understand the formula, and there are standard settings, like when you use BM25, use these settings. And it will be part of the exercise sheet to play around with these settings and see what works how well for our movie data set. But let's first understand two extreme settings and see that this formula makes some sense. Let me take the pen here. So if B, let's understand this alpha. If B is equal to zero, what is alpha? If B is equal to zero? Yeah, it's just one. Alpha is one, so alpha is a factor which somehow takes into account how long that document is compared to the average, but there's a factor B here which says how much should we take that into account. That's a very common thing here. So it's B one minus B, and if the B is very small or zero, then I say, okay, ignore this correction by document length, and I just get one. If the, if alpha is, if B is one, then this gets full weight. So in this case, b equals zero, alpha is just one. And now our formula is just tf star. Let's write this down. Now it's just tf times, and this we will need often in the following, k plus one divided by k plus t f. And now let's look if also k is zero now, what do we have? If k is zero, then I have k plus one is one, k plus t f one, k plus tf is just tf, then tf star is just tf divided by tf which is one. if tf is, yeah I should write that, if, no we should be careful here, this is not true the way I wrote it, this now depends whether, actually let's look at this formula, is it even defined for tf, it's not really defined if tf is zero, right? Let me write that here. So k, this k parameter is greater or equal to zero, this b parameter is between zero and one. And for, if TF is zero, TF star is just defined to be zero as well. It's kind of a corner case of this. So, but if TF is not zero, yeah, then this is true. TF divided by TF, which is just one. If TF is not zero. Yeah, that's pretty simple. So this kind of mathematics, you will certainly also get it in the exam, just computing with formulas. We have a little bit more on the next slide, so I just plug something in here, k equals zero, I get tf over tf, which means no matter how often the word occurs in document, tf star is just one, which is just our binary model from the first exercise sheet where we didn't really, from your first exercise sheet, where you didn't care how often the word occurred in the document. Now let's look at another extreme case when you pick the k very large. Let's look at that what happens if k is very large. So let's write it down again from the previous slide tf star, I should write this TF again, because it's not, TF star, if B is equal to zero, it's just B will be equal to zero all the time now in the following, that's K plus one, divided by K plus TF. Now if K goes to infinity, now k is in the numerator and the denominator, how do we compute if this goes to, what happens if this goes to infinity? Let's just divide by k in the numerator and the denominator, 1 plus k divided by 1 plus tf divided by k. So now if, let's take the limit, if k goes to infinity of this, so if this goes to infinity, then this goes to zero, and this goes to zero. So then this just becomes tf. So we have two extreme cases here. So now we already understand a bit better what the k and what the B do. So the B, if it's zero, and we will take it zero here and on the following slide then basically ignores the correction for document length and the K says how much should I take into account when a word occurs more often. Here, no matter how often it occurs, the TF star will always be one, and here it will be the full term frequency. So really if it's twice as often, TF star will be twice the value. And here's some more properties which are important to understand. And let's just prove them. Let's just see what else is on this slide. Okay, I can see where I can write my formulas. TF and let's in the following, let's also work with B equals to zero here. So T F, and if B equals zero, let me just write that again, then T F star is T F times K plus one times K plus tF. That was the formula. So if let's to prove this, let's just write tF star. So we have the tF twice here. Maybe let's just have it only, what do I do if and only if. Yeah, let me just divide by TF in the numerator and the denominator, then I get this. K plus one divided by K over TF plus one. No, actually I don't need this here. This was a mistake. Misguided so, so this part here is always non-zero, right? So since this is almost always non-zero, TF star is zero exactly if TF is zero. So we have that part here. TF star increases as TF increases. That's what I wanted to do here. Let me just write it as follows. And these are typical exam questions and you just have to do it yourself at home to really understand it, to be able to reproduce it. That's the point of this. Let me just do what I just did. This is equal to k plus one divided by, so I just divide by tf. It's like writing tf in the numerator. And then at the bottom I also have K divided by TF plus one. And now, let me just say this. Now I just have the TF one here, if TF increases, then this, if this increases, then this decreases but it's in the denominator, so this increases again. So by having the TF only once here, I see that increasing TF increases this, because it's twice in the inverse here. So that also proves this. And now let's also do this, the limit here. What happens if the term frequency becomes higher and higher? What happens with the tf star? Let's also do that. And yeah, for that this formula will also be useful. So let me just write it like that. When tf goes to infinity, and let's look at this formula here, k plus one divided by k divided by tf plus one. So if you did this simple equivalent formulation of the expression, so what happens if Tf goes to infinity here, what does this become? Yeah, this becomes one, so what will be the result? K plus one, yes. So and this gives you a good intuition of what the K parameter, the K parameter tells you okay, no matter how often my word occurs in the document, this TF star will be at most K plus one and of course very very often it will approach that. So that's what the K parameter does. And kind of that explains the point of the whole formula. So if you see this formula, you wonder, okay, why that formula? Why not another formula? What's the point of that formula? And really the best explanation, I mean, there's a whole, there's so much literature about this, but the best explanation for this formula is you have these three very natural properties. You have TF, you want to modify it to some variant of TF. Certainly if the word does not occur at all, this should also be zero. That's very natural, we have that property, we saw it here. If the word occurs more often, then this modified measure should also increase, we have that property, and as it approaches infinity, it shouldn't grow to infinity, it should approach some fixed limit. And it also has this property, and if you want these three properties that's kind of the simplest formula which does this. If I tell you come up with a formula which has these three properties you will come up with that formula. Or something more complicated which doesn't help. So that's BM25 and you will implement it for the exercise sheet and I strongly recommend that you do. And BM25, there are popular exam questions about this formula. Okay, here's another. Yeah, that part I already explained. This was just the role of the alpha. With the alpha you can just say how much do I want to weight the average length of the document. Here's the, how do you implement it? Think about the first exercise sheet where you have, yeah you didn't do any term frequencies yet. So first you have to compute the inverted lists with the TF scores. And let me just go back to a slide where I see TF scores, maybe here. How do I compute these five? Well, we already did that in the very first exercise sheet because we just appended things to the inverted list, right? I'm in a particular document and I see the word Titanic and I just appended this document ID to the inverted list for, or let's say here's university, okay. I'm in document 57. I see document, I see university for the first time. I append 57 to that inverted list. Now I see it for the second time, in the code I wrote in the first lecture, I just appended it again. I appended 57 five times. That's what I did in the first lecture. Instead you can just have a counter, which says oh 57 again, counter two, 57 again, three. So you just have a counter which says oh 57 again, counter two, 57 again three. So you just have a counter here and if you see it again you increase the counter. So with the same implementation from the very first lecture you get the term frequencies. Okay, so we already did that implicitly. We didn't have a counter but we just had the 57, the document ID five times. You didn't do it for the first lecture but for the first exercise sheet but now you will do it. While you are doing that it's also very easy, you are seeing one document after the other. Let's go back to just looking at the...right, you will process one document after the other. Let's go back to just looking at the, right, you will process one document after the other. It's very easy to just remember the size of the documents. Okay, this has, and when I say size of the document, it actually doesn't matter what you count. You could count number of characters or number of words, make your choice. I think for the sheet we recommend number of words, but it doesn't really matter. Just remember this movie description has I don't know 237 words. While you are doing it, easy to remember. And when you have done it for all the documents then you also know the average document length. So you just compute that. Now you have that. So now you have your inverted lists, you have the term frequencies and you know these document lengths and average document length. And now you can make a second pass over your inverted lists where you can just compute this number. Now you have all the information for that number. So first you just compute the inverted lists with tf scores and now you make a pass over them and this is important to understand and I recommend that you do the exercise then you understand that it actually works. Now when I do the pass a second time I have all the information and let's just check to compute this formula. Tf, yes, that's what I have computed in my first pass. K is just a parameter. B is just a parameter. Document length, I just said it, you remembered those. Average document length, you could compute it at the end of the first pass. N is the number of documents. Df, what's DF? Well DF, actually we did it in the first lecture, it's just the length of the inverted list, right? The document frequency, let's maybe go to the, yeah, the document frequency, I didn't say this, but let me just, where is my slide with where I have the, yeah, here I had the, this number here, Freiburg 1024. This means Freiburg occurs in 1024 documents. This is exactly the length of this list. Not of this example list, but the length of this list. It's just as long as many documents contain this. And this is also what you have after the first pass. You just take the length of your inverted list and then you know Freiburg occurs in so many documents. So I really recommend that you implement it, even if you don't submit it, implement it, start implementing it because it's kind of, you could easily do way too much work here by doing many more passes over your data or wondering how should I compute this. It's actually very simple. You do one pass where you get all this information, then you do another pass. Very efficient. So that's nice. So we have a break in a few minutes. Just a few. How could this be refined further? For the exercise sheet, and let me briefly show you the exercise sheet at this point. Here's the exercise sheet. So that's what I just explained. You take your own code from the first exercise sheet or you can always take the master solutions as well, build on that. Now just modify this build from file method in the way I just explained. Make a sec, first compute the TF scores and then PM25 scores. Then, I hope you realize that we did something different. In the first lecture we did intersect only documents which contain all the words. Now we do merging. We also consider documents that just contain some of the words. The exact same algorithm with a very small modification can also do merging. This zipper algorithm which goes through the list. And then sorting. I also explained that you just can use sort if you maybe if you have free capacity implemented partial sort. So that's the first part of the exercise sheet and then you should play around with k and b and see how well they work. We will talk about the second exercise later when we talk about the second part. There are other ways to modify the score. You could also take the popularity of a movie into account, which you kind of already did for the first sheet just by the way that the more popular movies are at the top. You could also, yeah, this is what I said. Oh, I don't have more. Okay, ah, okay, yeah, here's more. What's not on the slide, I was expecting it on the slide, what's not on the slide is, but you probably don't want to do that for the sheet, is how close words occur to each other. You could also have information about position in a document. Here I just mentioned taking the popularity into account. If you want to play around with it, go ahead. But you don't have to for the exercise sheet. Here are other methods which are beyond the scope of this lecture and this score. What the big search engines of course do is use click through data. People search something, they click maybe on link number three, now Google knows, okay, not the first hit, but the third hit was actually obviously a good match, and they remember that. And then there is learning to rank, which I just very briefly list here, where you build a machine learning model to somehow do the ranking, and we will do stuff like that in the second part of the lecture. But it's just so that you know that there's also other stuff. Can you give, there's a question in the chat end because it will be part of the second exercise sort of and it somehow belongs into the second part. Any question right now before we make a break? Okay, five minute break. See you again in five minutes. I'm assuming that people on Zoom can hear me again. We continue. So okay, somehow the first sentence of this, why the lecture is called ranking and evaluation. Why evaluation? I mean I should say a word about this I guess. You have these formulas, why this formula, why not another formula? You can come up with so many heuristics and then okay this formula, which case should I take, which B should I take, you somehow have to make objective how well it works and that's what evaluation is about. It's of course very important for any such framework. And so to evaluate something you need a ground truth like the ideal, what's true. And this is how a ground truth looks like for our setting and we will have this in other settings too in future lectures. So for example we have a query, matrix movies and then we say somebody went and did the work. These, in this collection, these are the documents relevant for matrix movies. And let's say there are four matrix movies, 10, 582, 877, and 1003. So this is now the ground truth. So if a search engine returns these four results, it's perfect. So you call this the ground truth, like the perfect result, and a set of such queries with their ground truth, you call it a benchmark because it allows you to evaluate the performance, in this case the quality of such a system. We have built a for you as Bastian has built and previous assistants have built a collection and let's look at this. And you will see there are two collections test and train. I will talk more about this later. Let's look at the training set movies benchmark train and it looks like this. So you have, it's actually a small file, but it's a lot of work. So you have, let's look at films shot in Spain. So we have a query films shot in Spain and now here we have the alleged ground truth, these films were all shot in Spain and not only these films but these are all the films shot in Spain we have them here and you see maybe not do it like this but like I'm sorry like this that I see the whole list here. Films where Dwayne Johnson was in the cast. So, and here I have the list of, so this is how a ground truth looks like. So a small file, but a lot of work. Yeah, this actually very important part of research. So these kinds of sometimes they, it can really advance a field if you have a high quality benchmark, but obviously it's a lot of work because you have to go to your collection, maybe it's a huge collection and find all the relevant documents. How do you even do that? Other different question. Now let's just assume we have such a benchmark, a ground truth, how do we evaluate our BM25 function or whatever function with certain parameters. That's what this part of the lecture is about. And let's start with some simple measures. Precision, precision at K, and here's, let's just do it for an example. So now, and that's always the setting you have. I have my query, I have the ground truth, and now I have the ranked list of results. And at the end of this part, maybe it occurs to you already now, it's a bit strange, but let's just take it for granted as the moment. The ground truth is a set, the result is a ranked list. That's kind of strange. Why isn't the result list also a set or why is the ground truth also ranked list? But for now, let's just take it as it is. The ground truth is a set, so these are the relevant documents and the result search engine always return a ranked list. Maybe you want to think about this if you have spare capacity why is the ground truth is set and the result is a rank list. And now we want to say okay the search engine produced this result how good is it with respect to this ground truth. And all the following slides will introduce different measures which somehow measure this. So let's first write on top of each document. This one here is relevant. No I want relevant in green I think. This one is relevant. Let me write it like this. This is relevant. 566 is not. This one is relevant. And this one This is relevant 566 is not. This one is relevant and this one is also relevant and 37 and 17 and these are not relevant. So let me just write it like this. Not relevant, not relevant, not relevant. Okay and now do we have anything else on this slide? This will come in a second. Piate 1 just says when I look at the first one document how many of these are relevant and it's 100%. So if I just look at the top one ranked documents, if I look at the top two ranked documents, so now I'm looking at these two and I'm asking myself how many of those are relevant. Which percentage of those are relevant? 50%. 50%, yeah, you got it. That's 50%. And let's continue. How many of the top three are relevant? What's the percentage? 33%, yeah. So it's one third, which is kind of 33 point something percent. What's p at four? Yeah, it's 50% again. So it's not that the p at something goes down, right? Now I have a relevant one again here, and now it's two out of four which are relevant. So maybe to clarify, let me write that like this. Maybe let me quickly write it like this. I think it's a bit clearer. So this is one out of one is 100%. This is one out of two it's 50%. This is one out of three and let's write 33% and the third one? It's 60%, exactly. So and then there's a special measure which says P at R where R is the, yeah you shouldn't just know what it is where R is just the you shouldn't just know what it is, where R is just the number of relevant documents. So let's assume here we have three relevant documents and then this one here is also called the P at, I'm sorry, the P at R. Assuming that we have only three relevant documents overall here. So this is P at R. Okay, here's some more measures which is based on the previous measure. Let's again take this one and let's do the same thing again here and let me just not write relevant or not relevant. Yeah, let me do it just like this. So this is, do I have the right color? Yes. So this was relevant. Which else was this was relevant? I think it should be the same. 10 is relevant. And 877 is also somewhere in my list so I'm looking for all the relevant ones and this is not relevant and this is not relevant. And I will just explain these by examples and then at home you can try to really understand them but these are very simple measures. Now I look at where are the relevant documents in my list. So I just give these rates. So this is the top ranked document one, two, three, four, five and let's just say this is at position 40. So these numbers are just, so the first relevant one is at position one, that's why I have a one here, the second relevant one in my list at position four, that's why R2 is four, the third relevant one is five, R2 is five, the fourth relevant one, and that all the relevant ones I have k is equal to 4. And now you tell me, and now I just compute the precision from the previous slides at each of these, yeah you can compute the precision at every rank, I compute it at 1, 4, 5 and 40 and you tell me what it is. What's the precision at 1 for this list? 100% that's 100% correct. What's the precision at 4? 50 yeah 2 out of 4 that's correct. What's the precision at 5? 60 we already had that. What's the precision at five? 60, we already had that. What's the precision at 40? 10%, that's also correct. Yeah, it's four out of 40. Let's maybe just write that on top. So this was one out of one. This was two out of four. this was three out of five, and this was four out of forty. And now the average precision, so you just look at where are my relevant documents and you could compute the precision. Okay, here's my second document, relevant relevant document and until here the precision of 50%. And this tends to go down because it's, yeah, but it doesn't have to go down. And then the average precision is just, it's just the average of these four values. So that's in this case, it's just, let's write it down, 100% plus 50%, plus 60%, plus 10% divided by 4, and that is? First the sum is 220% divided by 4, which is? 55, yes. So, and one question here is, okay, if I have a, my search engine doesn't contain all the documents, it just contains documents which contain the keyword of the query. Maybe a relevant document was not even in the list, and then we just say this p value is zero, which is like the same as saying it's very, very far back in the list and then we just say this p-value is zero, which is like the same as saying it's very very far back in the list. If it's very far, if you have a document here, if this is not at rank 40 but at rank 4 million, then this number will be basically zero. Now you have the, this so far was for a single query. Now in a benchmark I showed you this. We have, let me show you the, yeah, we have 28 queries. Now I can compute for example precision or whatever average precision, precision at 5. For this query, this query, this query. If I want to compute it for the whole benchmark, I just average it. And now here we have a measure for a query, which is an average of precisions of that query where it's called average precision, but that's for a query. And now I want to compute the average over all queries in my benchmark. So it's like average, average precision. Average precision for this query and the average over all queries. And because there are two averages here, somebody had the ingenious idea of calling it mean in that case. So map, which is a very famous measure, is just the average of the average precision, precision, sorry, and it's called the mean. Mean and average are really synonyms. So when you average over all queries, they use the word mean just to this average when you do it over single query mean. So these are just the means, the averages over the whole collection. And thus you get a single number which tells you, since the average is over the whole collection. And thus you get a single number which tells you, this is how well my search engine did for this benchmark. So you get a single number which is of course dangerous because single numbers measuring something complex are always dangerous. And you should implement this for the exercise sheet. And all this stuff very typical exam questions here. Like you can ask all kinds of interesting questions about this, like math-y questions. So here are two more, and yeah, we can take our time because they are basically, the rest of the lecture will be two more. So these were very simple measures, right? So far, I mean precision, that's pretty simple. Now come two measures, yeah, which are a bit more involved and you will understand why. But this one is also pretty easy actually. Let me see, is this one slide or two slides? Ah, there's an example, okay. Here's a setting that you often have. You have a query and so far we just had relevant or not relevant, but very frequently you say, yeah, that's kind of relevant and this is very relevant. So you have shades of relevance. Which means, yeah, in this case I have three shades of relevance and you will see an example in a second. And here's a formula and I will just do it by an example on the next slide. So here's an example and we will switch, go back to the formula on the previous slide. So here we have a search engine. Of course the search engine does not know what's relevant or not relevant and it returned these results. A relevant document at the top, the second document it returned was not relevant, the third document it returned was very relevant, the fourth one was relevant, the fifth one was not relevant, the third document it returned was very relevant, the fourth one was relevant, the fifth one was not relevant. And now let's look what the formula on the previous slide says. Now we somehow compute. Now I go back like this. The formula does the following. It computes this sum here and for now we will just compute it without understanding what it means. Let's just compute it. So I just sum up the relevance values which are 0, 1 or 2 depending on how relevant and I divide by the position, by the log of the position plus 1. Let's just do it. And so this is, and then by doing the example it will be clear. So this is, and I do it until position five, so it's always add something until when do I do it. So this is the first relevance value, which is one over. And now I do the position position which is log two of two. For the first position it's two, log two of two. Then this is the, and maybe to clarify this let me write this number here in green to clarify that it's indeed that number. Now the second one I don't really have to add it up because it's a zero, but I still write it here so that it's clear. Zero, log two of three. This now just goes up. Now comes a two. I know it's not the same green, but I hope you can see the connection anyway. This is now log two of four plus another, the one here that's this relevance value one of, oh no, this is not correct. Log two of five plus, and then yeah, let's also write this and maybe some of you someone of you can just compute this sum. This is not something you can do in your head because logarithms of non powers of two tend to be funny numbers. So can someone compute it? So actually we can simplify it a little bit. It's a 1 over log 2 of 2. This is 1. So maybe let's do this here. Now I need a little bit of space there. So I should erase this. So this is equal to 1 plus log 2 of 4 is 2, 2 over 2 is 1, 1 plus 1 plus and here we have 1 over log 2 of 5. So this should be something like two point something and who can tell me what it is? One over, I'm sure you are all log two of five. Let's see. Okay, that's this one over. So it's 0.43. This is 0.43, this is 0.43 something, so that's 0.43 something. You will understand in a second why we're doing it like this. Let's maybe, now we have a little problem here. We get a number 2.43 and the question is, is this good or bad? I mean, usually you want to measure, which is somehow if it's bad, it's zero, if it's good, it's one, or the other way around, you know what the range is. Here we don't really know the range. So what we are doing, we are also computing like the best version of this. And now it will also become clearer while we do the logarithm, the thing in the denominator. So what would be the best ranking here? Which one should be at the top here, if it's in the best case? The very relevant one. So in the best case, it should look like this. I have the very relevant one here, the other two relevant ones next and then I have the not relevant ones here. And now let's compute that number and now this becomes 2 over 1 over 1 over and I leave out the zeros because they don't add anything. So that's now 2 over log 2 of 2 plus 1 over log 2 of four. And you see without even computing it, this denominator becomes smaller and smaller. So that's the idea, that's why it's discounted. Now you understand the discounted. If I get a very relevant one further back in the list, yes there is a two in the numerator, but the number in the denominator will get smaller. It doesn't get smaller very fast because of the logarithm but it does get smaller. Whereas a 2 in the beginning is really good because log 2 of 2 is 1. This log 2 of 4 is already 2, gets divided by 2. This is much better. So that's the idea of discounted. Yes it's good to have a relevant document but it's much better to So that's the idea of discounted. Yes, it's good to have a relevant document, but it's much better to have it at the front. And a very relevant one at the front is really good. So it's actually a simple idea. So here I have two over one, that's one, and this is two plus. This we have to Google log 2 of 3 and this log 2 of 4 is 4 so that's 1 half. And that's, let's just google it, 2.5 plus, oh it's 1 over, no that was wrong. Log two of three is 0.63. So this is 0.63. It's just 0.63 something. So this whole thing is two point, it's 3.13. And now, oh my, now I knew this was coming, okay. That's very bad. What do I do now? I think I will just write it again. That's a fast 0.63. You have to remember 0.63. you have to remember 0.63. So this is two plus one over log two of three plus 0.5, which is 3.13 dot dot. And now we know, okay, this is the value we got. this is the best value, let's just divide this by that. And then I get a number between zero and one. So this is now, and let me not write the real terms, but just the approximation. So this is 2.43 something divided by 3.13 something something divided by 3.13 something is equal to, let's maybe write to compute it here, so that was 2 1 over log 2 of 3, right? I think so. Yeah, so apparently 0.7763, which is 78%. So kind of, yeah, you get the idea. So this, yeah, not bad. Some relevant ones at the top, the very relevant one at position three could have been better. The optimal value of this measure would have been this and we reached 78% of the optimum measure. So that's a measure you will also find very frequently. 2.43, people in the chat are helping with the computation. Any question about discounted cumulative gain? Is it clear enough? These questions about these measures are super typical in the exam. You will certainly get questions about this. Not only, let me just explain, there will of course be questions, or might be questions about here's an example, do it, but there are also, you could also ask what's the, is the DCG at five always bounded by one? No, it's not bounded by one. It can be larger than one, right? Only when you, but the NDCG, that's a number between zero and one. So there are some things to understand here, and you can ask simple math questions about it. And here's another one that's also nice. So it's the last measure we do today so just concentrate for one last time. But it's also very important and the motivation is the following. important and the motivation is the following. If you do competitions and they are very frequent, they have always been very frequent and they still are. Now here for the exercise sheets Sebastian and others spend a lot of time creating this benchmark. That's a lot of work and so we don't film shot in Spain how do you do it? There are different ways to do it but it's you somehow have to go to the collection and find all the films shot in Spain and this is a collection with 100 something thousand documents. If you have a collection with billions of documents you can't do that, you can't have a complete ground truth. That's very typical, very normal. But still you want to do a competition, you want to compare systems. So you have the situation where you have some relevant judgments but not others. For most of the documents you just don't know whether they are relevant or not. you just don't know whether they are relevant or not. So now what you do, you have participants in your competitions and they have their search engines and hopefully they are good search engines and each of the search engines will give you results, maybe 100 results and now you just, maybe you have 10 people participating in your competition, you just look at their search engine and see which results they return and now you just take let's say the top 10 results from each search engine and now you just judge those. You just go over those manually and say relevant, not relevant, relevant, not relevant. Now at least you get some relevance judgment. You have different search engines and you just look at their top results and you judge those. This approach is called pooling, but this is just the motivation for this measure. And now again, and you will see this a lot with all these measures you get a formula thrown at you. Here's the formula 1 over R sum drR1 minus nR, right? You have seen it for NDCG, you have seen it for BM25, there's a formula and when you see it for the first time you have no idea what it means. And the whole point of this lecture and of this is to understand, actually you can understand very well what it means and we will do that with the example now in a second. So let's not spend too much time on this slide and formula, it's just there for your reference. Let's maybe look at what are the entities that are used in this formula because they can be understood well. R is, and that's actually a... So some of the documents were looked at and R is just a set of documents where somebody looked at and said that's relevant. That's easy to understand so we have a set of relevant documents, we have a set of non relevant documents, somebody looked at them and said this is not relevant. It was returned by one of the engines in the competition, not relevant. Now we are talking about particular engine, we always want to judge a certain result list like with the previous measure, that's RR. We look at, now this is what our search engine returned and we will talk about this in a second. So here's an example. This is how it looks like, it's always best to understand this with an example but then you also have to understand them in the abstract. So our search engine returned this and now we have to say how good is this and we are in the following situation. We know for some documents they are not relevant, they were judged, for some we know they're relevant so we have no grading here and for some we just don't know. Maybe relevant, maybe not, nobody judged them because of what I just explained. And we also know, so this is what our search engine returned, we know there's one more relevant document which it did not return, that's not good but that's just how it is. So this of course should give a penalty and there are 10 more non-relevant documents which it also did not return. So that's what we know and now let's look at the numbers. And actually you don't have to remember the formula, I will just explain it again. And it's actually very simple and also intuitive. So RR is just the relevant documents which we returned. It's the set of relevant documents which we returned, which means in this case, it's number three and number eight. Okay, then we have the number of relevant documents overall that were judged. For this particular query, three relevant documents were found. There may be more, but only so many were judged. For this particular query, three relevant documents were found. There may be more, but only so many were judged, so this is three. How many non-relevant documents were found? Usually there are more. So also in this example, one, two, three, ten more. So it's thirteen. For some reason you take the minimum of the two, you can't understand this at this point. Let's just take it for granted that we should compute the minimum and it's three. These two numbers, I didn't talk about them yet, but they are also very easy to understand, also in the context of this. NR at three, you can't really compute the precision now, right? I mean, before we said, okay, let's look at the top three, how many of them are relevant? Well, you don't know whether that's relevant or not, but what you can do is how many non-relevant ones come before these relevant ones. This you can do, and this is what we do here. And we can say, here's a relevant one, and one not relevant one comes before this, which is already not good, right? You shouldn't have not relevant ones come before relevant ones. And that's just what this counts. Just how many non-relevant ones come before it? So it's actually very simple. So this is just one. And you will see on the previous slide that what we actually computed was this divided by this minimum thing here and this is then 2 over 3. So we just take 1 minus this divided by 3 because kind of, let's just do it for this one. So number 8. How many non-relevant documents come before the number 8? 3, yes, it's 3. And now we have nr number 8 divided by this minimum. And let's, you may wonder why the minimum, I have a slide on this, I will explain it in a second and that's zero. So even if you didn't understand everything yet, you can understand a few things. So we kind of, we look at each relevant documents and we give each a score. So if you have a document and there's no relevant one before it, it will get a score of one, right? We don't have this here, the first relevant one. This has one non-relevant one before it, which is already not bad, so we give it a score of two over three. And here we have one, there are only three relevant documents overall and this has three non-relevant documents before it. That already gets a score of zero and it only gets worse when you go down. And now B-Pref is just, you take, you look at all the relevant ones and you just take the average of these. That's it. So it's actually, so it's two-third plus zero. But you also take into account those which are not even in your list, which is kind of similar to what we did earlier. So there's a third one which is not in the list and being not in the list is like being very far down with all the non-relevant ones coming before it. Divided by three, which is, is there anything else on this slide? No, it's two over nine. Two over nine, let's see your computation in the head scale. What percentage is two over nine? It's actually, what? 28? 22, I agree. One over nine is 11, right? 22%, 22.2 and so on. And I think by this example, a few things already became clearer. Let's go now, this formula already lost some of its terror. So actually what you have here, and now you can understand also what I wrote here, but that's the typical way to understand something. An example, try to understand the example, then go back to the abstract. A score based on how many judged non-come before it. That's this part, right? For each relevant document, we compute a score based on how many non-relevant ones come before it. That's what we did here. For this one, okay, one non-relevant came before it, got a score of two over three. For this one, three came before it, got a score of zero. So that's the scores you compute, and then you just do it for all the relevant ones in your list, and then you divide by the number of, it's just the average of these scores. It's actually a very simple formula once you understood it, but then you see it for the first time, it's like, what? And now there are two questions, and I have a slide on this. I think it's easy to understand, module or two questions. Why do we have the minimum here? Why don't we just divide by n? That would be more natural, right? You have number of not relevant ones coming before it. Why not just divide by N? I mean there are at most 13 non-relevant documents. If I have just one coming before it, why not 1 over 13, but 1 over 3? And this sum is over RR, but I divide by, so it's just two relevant documents in our example in the list, but I divide by three. Why not divide by two? And this is something, let me only hint at it here, but it's something you have to, I mean, you absolutely have to think about it yourself at home to really understand it, and this gives you all the, I think, everything you need to be able to think about it. Why this number and not this number? And here you also have typical exam questions hidden, which also show you the kind of math which you're supposed to understand. It's not very deep but still math. First when you divide by the minimum of something, and this is supposed to be a number in the range 0 or 1, you should verify that this number is indeed smaller than R and it's smaller than N, it's smaller than both of them and I leave that to you. This one is obvious because it counts the number of non-relevant documents coming before that document and there are only so many non-relevant ones and this is just by the way how it's defined. You can see it two slides earlier. So this number is actually if you divide it by this it's between 0 and 1 which is what you want. So this is in the range 0 of 1 and then you take 1 minus this so that 1 becomes good and 0 becomes bad. And then note by taking the minimum with R, this is what you do. I have only three relevant documents and as soon as I have three non relevant ones becoming before it, I already say okay that's bad. That's just the decision this measure makes, right? So it's kind of this minimum with R is making it harder. So right, this would be an exam question for example. Does the minimum with R make it harder to get a better value or make it easier? It makes it harder because now just three not relevant ones are enough and you already get a score of zero. Which is kind of makes sense, right? It's, I have three not relevant ones coming before the relevant ones, that's bad, I don't want that. Given that I haven't even found all the relevant ones. So, yeah, taking the minimum makes it harder to achieve a good B-Pref score. It's just a decision by the makers of this score. And then why? 1 over R, that's actually easy. I mean that's basically what we did here. I have these two relevant ones in our list. One got a score of 2 over 3, one a score of 0. Now I could just take the average of these two. Right, two third plus zero over two is one over three. But there's a third one which my engine didn't even return. I mean, that's not fair not to take this into account. I should handle it as if it was at the very bottom of my list and if you think about it, this is exactly, and let me find this, it's not so easy to find, but I will manage the AP slide. Yeah, it's exactly what we did here, right? I mean, it wouldn't be fair if my search engine, take this example, my search engine just returns one document and it's relevant. But there are four relevant documents. I shouldn't give it a score of 100%, right? I mean yes, it returned one relevant document at number one, but the other three just completely forgot. It didn't even return them. Shouldn't give this engine a 100% score. It will get a 25% score. It got the one document right, that's the 100%, and the others are nowhere to be seen. It's zero plus zero plus zero divided by four. I should punish that, and that's what all these measures do. If a relevant document is not even in my list, it counts as zero. And that's the reason, that's why I have the plus zero divided by three here. Not being in the list is like being very, very far down and this measure also does that. And given's, and given all that now, I think this formula is actually pretty easy to understand. I mean, you just compute this score per relevant document, which just counts how many non-relevant ones come before it, and you take the average of those overall relevant ones. It's actually a simple measure, but when you see it for the first time, it looks crazy. So what would happen, there's a good question, what would happen if number six was also relevant? Wouldn't NR, oh, wouldn't NR be negative? That's a good number. Somebody said this can become, it absolutely shouldn't become negative. Ah, good question. Here's a detail I kind of hinted at it in one of the sentences I said. NR is just it's capped at R. It doesn't become worse than that. Docs from the R top ranked non relevant ones. So I kind of stop. Yeah, yeah, I know that you meant not relevant. So it's a very valid question. So let's assume number six is also not relevant. Now I would have four non-relevant ones before it. Does it mean, so this would be four. Does this mean this measure now becomes negative? No, because I only count until R. So if I have three relevant ones I just count not, not, not. Now it's already zero, doesn't become less. And that's what's written here. Docs from the R, from the top ranked non-relevant ones. I only consider the first three non-relevant ones. And the only reason for that is so that that score doesn't become negative. It's actually just for that reason. It's a very good question. But yeah, that's the way to understand these formulas, really just understand everything, why is that so, and then you see, okay, that's just the reason for that. Some general remarks then back to the exercise sheet and then we are done. Now I think it's the time to go to the exercise sheet once more what will be the second part of the exercise sheet. So you implement BM25 in the way we explained in the first part. Now you get these benchmarks and now you should compute, okay, my search engine with this BM25, with this choice of parameters, maybe I implemented some other heuristics as well, how good is it on these benchmarks? So you should compute this precision at K, average precision. So from the first part of the lecture P at K and AP, no discounted average here but the simple measures. And then you should just write an evaluation script. That's very typical in research. You have a benchmark, you have to write an evaluation script which actually computes these measures. And now comes the third exercise which is also which kinds of mimics competition. Now you have your code and you have your benchmark and now you want to improve your result. So now you want to, and that was the first question in the chat today, how do I pick B and K? Now you play around, let me make B a little smaller, larger, K. You also have some heuristics, maybe you say, let me take the popularity into account, and then you see, does it help or not? And for that, now it's very important that we have two data sets here. One is called train and one is called test. And now listen carefully because that's super important not only for this kind of queries documents but for every competition on this kind. And let's look at them first. This is how Train looks like, 28 films with their relevant documents. And this is how Test looks like. Also 28 films, queries with their ground truth. But it's different queries. They are kind of a similar similar kind but different. So here we have the first one is James Bond movies with Daniel Craig. Here we have music movies with music composed by Howard Shawn. Different queries. Why is that important? Well what you could do is and it's kind of trivial, you take one of these sets and now you work hours or days just to improve the result on this set. And think about it, it's trivial. Actually one way you could do it to write in your code, if query equals movies with music composed by Howard Shore then output 10, 11, 13. Yeah, that would be the crassest form of overfitting. You just, if it's this query, then output this. Now you have a perfect score. You just implemented the program that outputs the ground truth. You can even input the input file. I mean, it doesn't make sense, right? You get a perfect score on your training set, but it will only work for your training set. So that's called overfitting. Now this was a crass example of overfitting. Let's just leave it open where we're done. Or maybe close it for the, I think we won't suffocate. It's actually below what it was at the beginning. Just five more minutes. I think it will. Oh yeah, that's the reason, thank you. So that's what the crass form of overfitting, but the very, very important message here that you unconsciously do this, do something similar when you just, you look at this set and then you say, ah, now I understand, I should pick B larger. You think you understand something about the general problem and you're playing around with your parameter and you're putting an if in your code here or there, you're doing 100 things which you think they are actually good for the problem in general but actually you're overfitting to the training set because you only look at these queries all the time. Now I understand when the query contains Leo DiCaprio, I should do this, now I understand understand so you think you're doing something general but maybe it's not so general. Maybe you are tuned too much to that training data set. And how do you solve that problem? How do you deal with that bias? Well, that's the way to do it, that's why you have two sets. You can do all that work on the training data set. Here you can do as much overfitting as you like on the training data set and then after you're done you say okay now I'm done. This is my code, I take B equals 0.54, I take K equals 1.31. Now I'm done, and now it's evaluated on this data set. That's the point, that's how it's done in all the competitions. You do all your training and everything on this data set, and now it's evaluated on this one. And of course, if you have overfitted a lot, it will be good here but bad here. What counts is the result here. But you have the opportunity to train on something meaningful. So that's why we do that. Now of course you can cheat, I mean it's stupid if you do, I mean that's not the point of the exercise. You shouldn't, the point is, and in real competitions you don't get the test data set of course, right? In real competitions you will only get this one and then you commit it and then the people from the commission will sum, or they have set up a system which does this. Here we give you both data sets but we ask you to do all your tuning only on the train data set, do not even look at this data set do not even look at it, forget about its existence and then when you're done with the tuning, once, just once, run it on this set and see how good it is. So that's the way how to do it. And you learn an important lesson here, how it's done for these competitions. Okay. And now the final point, I announced this earlier and then we're done. So that's the exercise sheet. And maybe somebody has an idea. Let me just go back and it's the final thought, but it's an interesting one. Isn't it weird that the ground truth is a set, but our search engine returns ranked lists? This is weird, right? If you think about it, at least I find it weird. So the ground truth is a set. Here are the five, why, why? And this gives rise to a number of questions. So why does our search engine not also output a set? Why does search engine not say, here are the five relevant movies? Why does it output a ranked list? And I mean you can say something if you want, but I can also, does anybody want to say something? At least these are questions to think about. Why is the ground truth the set, but you output the ranked list? Why is the ground truth a set but you output the ranked list? Well, the point is if you output a set, if you want to output a set, Google would have to say, okay, internally it does this ranked list thing and now it has to say, okay, I'm supposed to return a set. Let me just cut off here and just give the user the first six. Well, that's dangerous. Why not? Maybe the seventh one was the good one. Why not give them the ranked list and they have the opportunity, you empower the user to scroll further down. Most of them won't, but that's the reason why search engines return ranked lists because otherwise they would have to cut it off. Okay, so it's natural to have ranked lists. So why isn't our ground truth also a ranked list? Now our ground truth was here five relevant documents. Well, if you ask for the matrix movies, it's kind of hard to put them in an order. I mean, you can say these are the four matrix movies, but yeah, you can't really give them an order. Maybe you can say, yeah, this is the best one or the first one, but you don't really have an order. The reason that you have ranking in your result list is what I just explained, but actually the relevant documents often don't have an order. The reason that you have ranking in your result list is what I just explained, but actually the relevant documents often don't have an order. It's just a set. And here's the third one. Okay, so why don't you just, why don't you evaluate the scores? Like here's my ground truth. This movie should get this score and your search engine also computes scores and now you compare how well do the scores match. Well that also doesn't work because the scores really don't mean that much. We only compute scores to rank the documents. Let me go back one more time. These scores, I mean the absolute values of the scores, this 30 or here, this value doesn't mean much. The only reason we compute these scores is so that we can rank the documents later. I mean there are problems where these are called regression problems where the goal is actually to compute the score but not here. That's it for today. So is there any question at this point? Okay, there's one question. Yes, yeah sure. Oh yeah, absolutely. It's free. Submission of exercises is free. You can decide from week to week whether you want to submit something or not. There's no need to communicate your strategy or anything. You can just decide that on a weekly basis. And you are welcome to. Any other question? Okay, so have fun with the sheet. See you again next week. Bye-bye.Welcome everybody to lecture three, database and information systems on database basics today winter semester 23-24. This course can also be taken as information retrieval. So this is, we will first talk about exercise sheet 2, ranking and evaluation. Something about the exercise sheets and the exam. You will see what this is a brand new lecture. I've never given it before. We have spent a lot of time preparing it, at least 20 hours, probably much more. I hope you like it. I think it came out very well, but let's see. So we will talk about its database and information systems. I mean, I know a lot about databases, haven't given a lecture about it before, and we will start with the very basics, but you will see that already the basics are quite interesting. Tables, we will look at the simple system design and some basics on the query language and the exercise sheet we will talk about that. First about your experiences with the last exercise sheet. First two lectures were about search engines, so Google how this works in principle. Now we are going to databases so you wonder is there any connection? Right now there is no connection, but already in the next lecture and later you will see that it's all connected, everything is connected. So, excerpts from your feedback. You found it interesting, those who did it significantly more work. Lecture was well explained, here are some quotes, overall good task, the lecture really did help to understand it, great lecture again, once again very nice to follow, about twice as time consuming as the previous sheet, I mean the first sheet was really just getting started, there was discussion, some people even wrote, I heard a factor of three and one, okay it was conclusion man, this is difficult. This was not referring, I think, to the sheet, but tuning parameters, right? It was, you had this training and test benchmark, you had these parameters B and K of BM25. And just getting good results, we will see it on the next slide, is hard, but that's a typical research work. Problems understanding average precisions. So there was a post in the forum, there was a slight mistake in the explanation that led to some confusion that can happen of course, but it was clarified on the forum. The 80 character line limit is annoying. Yeah, but it's standard. This comes every semester. At least one person says it. Why? It's not coming from us, right? And this is, let me spend one minute on this because it's so typical and says a lot. I understand it that this comment comes. And it's a look why am I spending this one minute it's a typical instance of this my view as an individual and the world as a whole. You say I have this wide screen I have 320 characters why am I limited to 80 characters? Why do I get the sound effects? There's no hair right now. Oh, the tifa, okay. So you say I have this white screen, why 80 characters? Well, other people don't have this white screen, and then there are all kinds of tools which show code in a particular way, and I haven't prepared this now, but a typical view on GitHub if you work on big projects is that you have your two versions of the code side by side, the new code when you open a pull request and the old code and then you have it in two columns. Just one example where and then if your code has very long line and it breaks, it's totally unreadable. But it doesn't really matter, the point is, it's a standard, everybody uses it, and it's very hard for humans. You start, you see only your perspective, and you say, this is annoying, can't we just drop this? If something is a standard in a very big community, there are probably very good reasons for it. But it's interesting that every semester people fight with this. But it just says something about how humans work. I like that we are building something more complicated step by step. Yes, we'll become more complicated even. We suffered physically and mentally, I hope not due to the exercise sheet. I think I was not sure what this was referring to. Please give us night feedback, yet we will. Unfortunately, I did not have enough time to finish this sheet. Several people wrote that. Thank you for being so honest. And I will come back to this in two slides. Just very quickly, the results. to this in two slides. Just very quickly the results. There were these two parameters. I'm not repeating the explanations from the last lecture, just saying there was this PM25 formula with a parameter B and a parameter K. The parameter B tunes how much do you take into account when one document is very long and the other is very short. Of course the long one contains words more often. Somehow you should adjust for this. For our collection which was movie description, it turned out that a little adjustment, not a large adjustment, a little but also not zero, gave the best results. Over the years we are often doing this benchmark, this turned out to gave the best results. That's over the years we are often doing this benchmark. This turned out to be the best setting. Small values of k and understand again or try also when you did the exercise or when you do it again, a value of k corresponds to tf star. So adjust the term frequency in the range one to two. Understand that. So if the original TF which just counts how often a word occurs in document is not zero, then it's 1, 2, 3, it's an integer, can also be 15 if the word occurs 15 times. With this formula with K1 it means if it's there once, it will be, the value will be 1. If it's there very many time it will approach 2 somewhere in between. So at least we see that this formula makes some sense. A nonzero value of B gave the best results and a value of K which was larger than 0. That we discussed also in the lecture if you set to zero, you just get the normal formula. So baseline results, these were the results. So we deliberately, so Sebastian did the solution. We always do the exercise sheet. We always post, do the, solve the sheets before so that we see that everything works and he deliberately did not put a lot of work into getting great results so that you could do better. These were the results he got. These were I think the best results at least when I looked at the table and you see there was room for improvement but you also see something here which was kind of an important message of this sheet. These are typical, even good results for, if you look at, I mean this is precision at three is you look at the top three documents. So how many of these are relevant? Ideally, of course, you want 100%. All the top three are relevant. We're pretty far away from that. Here you average over all relevant documents. Really also care if there are 17 relevant documents in your ground truth where the position of the 17th one is. And here we get values below 50%. But this is typical. This is a common measure. It's a very hard measure. You won't see a benchmark or a method which gets you close to 100%. Think about it. You have a big corpus for the exercise sheet. It was just 100,000 plus for web search. It's billions. You have in our corpus 17 documents and among these 100,000 plus you want the 17, all 17 at the top and only those, that's a very, very hard, extremely hard problem. So these are typical scores. And here's another very important remark before we go to the actual content. So already we were surprised, so this semester it's voluntary, you don't have to submit. And already for the first sheet, so we expect that in the end, I think 180 people or so will write the exam or maybe 160, just estimating from the number of people who signed up. It's always more, over 300. It's about half which then write the exam. And we only got how many? 60 or 80 submissions for the first sheet, something in that range. Now it's down to 40, which means 25% of people are submitting. most of you are not submitting, of course it's up to you, that's the rules this semester, but let me tell you a few things. And already, and that's why I'm coming back to this comment, I mean I'm happy that you are so honest, already now, even of the 40 people who submitted, half wrote, I don't have time, I didn't finish, it's the second week of the semester, already people are not having time. I see a problem here and I want to spend a minute on this. So for the exam, you eventually need this stuff anyway. I mean, in particular, writing code, of course not writing code 100 lines like for the exercise sheet, but 10 lines is an important part of the exam. Just look back at the exams from previous year. And especially with coding, but actually with everything, also with mathematics, everything, only practice makes perfect. If two weeks before the exam you start, oh, I need to write Python code, let me just look at a little bit of Python or write a few programs. Too late, only practice. Practice means regular practice. It's like this with everything. So if you start practicing your code two weeks before the exam, it's too late. And I mean we have been doing exams for decades and we always see code by students also in oral exams. It's absolutely clear this person has not coded in a while. Please don't be one of these people, right? I mean use the exercises, we put so much work in the exercises. We have five tutors, more than enough, I think, and use the opportunity to practice. Get individual feedback. You also don't get that if you do it before the exam. And I think something is wrong if already now you don't have time to do the exercise sheets. I mean, then you should just cancel this lecture, cancel one of the other lectures. So please take this opportunity. And yeah, and this is obvious, right? Only when you work on something, you realize whether you actually understood it. These lectures are entertaining, but just listening to them, I can't say that often enough. You don't learn anything by just listening. It's a 98% entertainment and it goes into one ear out of the other. You have to put it to use and a typical symptom is shortly before the exam we get all these detailed questions and slide so and so, What does this actually mean? People even find small mistakes, obvious mistakes, even nobody found them in the second week of the lecture. Why? Because you haven't really paid attention. You haven't really thought it through. I mean this should happen now ideally in the lecture, after the lecture. Something is wrong if we have 160 people participating and even small errors on the slide get pointed out shortly before the exam. Will we have to write pseudo code or Python code in the exam? Yeah, very good question. It has to be Python code of course. That's a, it has to be the code we, you are not allowed to write pseudo code. You see it in the exams. Both are the same. That are not allowed to write pseudocode. You see it in the exams. Both are the same, that's not true. Of course, since you are asking that, if you make tiny, we don't expect you to write Python code which compiles out of the box, although I would argue for a 10 line function, it's not that hard. So if you miss a small syntactic thing, you don't get subtraction of point for that. But if you write code which looks like you haven't really written Python code in a long time, there will be subtraction for that. So you should know Python. Okay, that is that. So that's three minutes over the time I reserved for this. So please take this to heart, what I just said. And now let's start with databases. Databases is a fascinating topic. And we start with tables. Databases is all about storing data in tables. I think the lecture is relatively lightweight, but please, so even so, try to pay attention and see if you really understand it. Structure of a table is defined by its schema, and here we have an example table, and we will work with that in a second. So that's what databases is about. You have a lot of, and we keep with our movie theme, you have a lot of tables like this, and a table, let's look at the components of the table. You have the headings here, the headings. That's called the schema. The schema has three, two components. One is just a name. Names are title, year score, and of course they mean something. And we will see on the next slide, we have a formal definition, the domain. So what's supposed to be here? Title, that's strings, movie titles. It doesn't have to be the domain, doesn't have to be defined very narrow, but just from which set do the values come? So here are all strings. Year, integer, score, it's a floating point number. We will see it on the values come. So here all strings. Year, integer. Score, it's a floating point number. We will see it on the next slide and we have more in the following. And then we have the contents of the table and it's just rows. And there's nothing hierarchical here. Let me just get this clear right from the beginning. It's not a table, it's like an HTML where you can split a cell vertically, horizontally or anything like that. It's just tables like this, the simplest form of tables. Columns are fixed and then you have rows and the rows, and that's also important, it's written here. Once you define a table, the rows, the columns are fixed. You can't add another column later, but you can change the number of rows. You can insert rows, delete rows, and the table can also have zero rows, and it's an empty table. So all of databases somehow based on this model, it's a very simple model, it's just a formalization of what I just said. It's a table with K columns, and I think it's easy enough to understand, just putting it in a little bit more mathematical way. You have the column names, which is just a tuple. Why a tuple? Because the order matters. First column, second column, third column. Then you have the domains, one per column. Let me just see, yeah, here's an example. Let me put the example first. So C, these are the column names, title, year, score, the domains, all strings for title, all integers for year, all floats for score. And then this is a set of tuples. Set is important. We have two slides about this. This is a set, so the rows do not have an order. The order doesn't mean anything. The columns have an order. I have to know that this one is actually the title thing, right? And it's a multi-set. What does that mean? We will see in a second. And this is not blue. I hope this cannot stand. We have to correct this immediately, otherwise we cannot continue with this lecture. So please do ask a question. If anything, I mean it's simple enough, but still minor things might be unclear. So that's our table just written as sets of tuples. Same thing as we have seen before. And why does it say, first it says set here and not, right? The order does not matter of the row and it says multiset. That's important, it's actually very important, which is why I have two slides on this. So a set contains each element at most once. That's a set 1, 3, 7. Each element from the domain is either in it or not. This set has three elements. This is a multi-set. In a multi-set I can have any one element in there multiple times. So the seven is contained three times in here. That's a multi-set. That's why it's called multi. You have a notion of multiple occurrences and you also know how often an element occurs. So this multi-set has size five. Which means, So this multi set has size 5. Which means, just going back to the table, I don't have an example here, you could have this line, the second line, the dark night 2008 9.0 repeated two times. So it's there three times. And then it's there three times. And you also have the information that it's there three times. This you can do. And it's also important to understand that in the original when you do database theory, you don't have that. You have sets and the original paper is in the references from 1969. So databases is a very fundamental thing. Modern databases allow multi-sets by default and so we will also do that in our definition and there's a very simple reason. Understand that if you can insert or delete rows and you want to disallow duplicates, this costs time and space. You somehow have to check, build a hash set or something. Okay, these are the rows I have. Now comes a new row, is it already in there? So it's easier if you don't have to check this. If you want to insert the same row twice, fine, just do it. That's why multi-sets are the default. However, most systems will allow you when you create the table to say something like, please check for uniqueness. And then when you insert the same load twice, the database will ring an alarm, say error. But that costs time, so it has to be done explicitly. Also important, I mean it's simple but important, neither of these have an order. So let me go to this slide again here. This table, the columns have an order, the rows do not. Of course when I show them I have to give them an order. It absolutely makes no difference semantically when I change Inception with the Shawshank Redemption. Same table. Just have to understand this. Okay, so even if you insert them in a particular order, I insert them in the order I've just shown you, the database does not have to keep or remember that order in any way. So there's no way to say, give me the first row I inserted. You just can't do that. It's simple but you have to understand this stuff. And actually it's important because a database efficiency is a big topic, not today. In the next lecture we will talk about efficiency. It can make things more efficiently when it's allowed to reorder rows and databases do that. However, I will come back to this question in the chat in a second. If you need an order, maybe it's important for you that the database remembers the order, then simply have an additional column with a counter. Right, you can just do that. Let me just go back to that slide again here, where is it? I can just have a counter column here at the beginning and just remember one, two, three, four, so I can just do that. And there are database systems even have a way to specify in this column, I don't even have to say it, just increment by one whenever I have a new row. But only do that when you need it. I mean then you have to store that column, cost time, cost space. So there's a question. Regarding the schema, what about alter table? Alter table, okay, that's a good point. There is, where did I say this? With a fixed, it's a good point. Is altered, actually we haven't looked at, does altered table exist? Or did you just make it up? Because I don't, I've never used altered table in my life. Altered table, yes there's alter table, okay. It has all kinds of problems, but we'll talk about that more. Thank you for pointing it out, but there is alter, alter means not alter, but alter to change alternate altering something, alter table. Let's not talk more about this, but okay, this is, it has, but it has problems. You shouldn't do it basically. You shouldn't, changing the schema of a database is terrible. But thanks for pointing it out. So where are we? We are here. Multiple tables. So that concludes the tables part of the lecture. It's very simple. You have tables. We have seen some things which we need to remember. Of course in a database you don't have one table, you have multiple tables and all of the rest of the lecture will be about that. Tables can share columns, so here I had the title of a movie here, maybe I have the title also in another table, nothing wrong with that and as we will see, it's actually important because if the tables all have totally different columns, you can't do much with them. You need to combine them. They need to have something in common. And this is what much of the exercise will be about. If you have many tables, how to store your data in tables, which tables to choose, how to pick the columns. That's actually challenging and much of the first exercise is about it, so called database design. And yeah, that will be, and I have a whole section about this in a second. And then of course you have queries, so you have your tables, all your tables and your data in there can be very large. Now I want to ask something. I want to extract data from these tables. And there's a query language, it's called SQL, it's called structured query language. Tables are what is also called structured data. Basic SQL queries are really easy. Here's a, I don't go back to the table, you have seen it so often now, the table year score table, I can say, from this table, from movies, we called it movies, select just the columns title and score, so forget year, and only take those rows where the score is greater equal 8.0. And I forgot a semicolon here which is a deadly sin in SQL, but I realized it. And you can see, that's why I'm showing it already here, SQL, basic SQL, you can just read it like a sentence. Even if you have never seen SQL before, you can understand what this does. Of course, SQL is Turing complete. You can do everything with it. You can use it as a programming language. So the complete language is very complex, but simple. SQL queries are very easy to understand and write. There are very many dialects. So basically every database system has its own version of SQL, but the core functionality, so a query like this, will work exactly like this with every database system. Today I will not give you a full introduction into SQL, I will not even give you a formal definition, but we will just learn SQL by example. And that will be enough for the exercise sheet. Also just a few example queries there. In the next lecture we will be more rigorous about this. But today is the introduction. Okay, enough of the theory. Now let's look at a concrete engine. So you need systems implementing this. System implementing this stuff is called, has this nice name, which is, you will find this a lot in the literature, relational database management system. Relational database management Data Base Management System, RDBMS. So you should remember this RDBMS. This is a very common abbreviation, not acronym. Is it an acronym? Rhythms, no, okay. So there are very many rhythms on the market, commercial as well as free, open source as well as closed source. Maybe you have heard some of them, MySQL, Postgres, many of them have SQL in their name, Microsoft, SAP, the biggest big German company has their own database system. Many companies have their own system because they rely on database technology, Oracle. And we will use SQLite 3 in the lecture. Actually setting up a database system can be pretty hard. Not SQLite 3, I will show it to you in a second. It's fully functional, so it can do everything, really everything you expect. And it's super easy to use, So that's why it's called Lite, SQLite N3, because it's the version three. It's also reason of it's not the most efficient system, but it's not that. So actually many, many other frameworks which use a database use SQLite three in the background, because it's so easy to use and very comfortable. And we will now provide a crash course. And as you can see this section is just four slides. This was already the first slide. It's very easy to use a basic database system which is very nice and you will also use it for the exercise sheet. So you can just install it like that. Now I'm not sure, is it installed on our Linux image, Sebastian? Maybe not, right? Because we, yeah, what we will, but you can install it yourself. So on Debian Ubuntu, if you have Mac, you use Brew, if you have CentOS, you use Yum or whatever. So apt is what you use for Debian. There are two modes to start it, simple enough. SQLite 3, then you get into interactive mode. And that's maybe just so I can just here on the command line SQLite 3. Now I'm in SQLite and now it's like you can also use Python interactively. And now one of the, I think the hardest part of SQLite 3 is getting out of SQLite 3. So if I do, this doesn't work. Oh it worked. Oh wow, this is new. I did it. So if I do a quit, doesn't work. Okay, exit doesn't. So you see it's not so easy. Control C doesn't work. But it's dot quit. Dot quit. Now you know it. You can say whatever you do, maybe in the system you will create tables and stuff. You will ask queries. If you want to record this, you call it with a file name and that's then the database and they're stored in that file and if that file exists and you call it like this, then you already have all your tables in there. We will see that in a second. We will do it. We will work with SQLite quite a bit in the following. Now important, there are two types of commands, not really commands. One is the SQLite commands, they start with a dot, so we've already seen the most important one here, dot quit. That's not SQL, that's SQLite, so of this particular system. These commands start with a dot. And there's SQL, so the SQLite commands of the database system, they start with a dot and have no semicolon in the end and SQL commands don't start with a dot but have a semicolon in the end. So that's really confusing but that's how it is. Okay, so that's already the second slide and here's some useful commands and we will put it into practice in a second. So somehow if you read data from a file you have to say, and let's just start right away. Let's maybe, I use NeoVim, let's have a file, movies, TSV. I use NeoVim, let's have a file, movies, TSV. This is also something you should do for the exercise and maybe let me use that occasion to already jump to the exercise sheet. So here's the exercise sheet, see if it works. Okay, this is Geir Heim. So, and the exercise sheet, so you have to work with data and it's part of the exercise sheet to gather the data because you have to come up yourself how to put that data into tables. So we can't really give you any data because if we give you the data, we are already giving something away how you should do it. So you should really start by looking up the data yourself. So we will keep with the movie themes. We have selected four movies for you. Fargo, Kramer vs. Kramer, Three Billboards Outside Ebbing, Missouri, Titanic. Who knows all four of these movies? What? Who knows three out of four of these movies? Three? Okay, who has watched two out of four of these movies? Who has watched none of these movies? None, okay, and one out of four? Okay, that's most, one. Which one is it, Titanic? Interesting, wow. You're not much of, what are you doing in your free time? Anyway, so I think that's the movies are, I think half of my education is from movies. Okay, you can click on these links. You don't, they are not underlined. You can also click on this link here. We don't underline them. If you go there, you get to the IMDb movie page. So here we have, yeah, I don't want to buy a car right now, here we have Fargo. Fargo is a great movie and you have all the information you need. So that's the first part of the sheet, is just go to the sites of these four movies and then gather some data. The IMDb score, the year, who directed it, who produced it, Not all the cast, too much work. So this is some work, but it shouldn't be too much work while we kept it low profile for movies. Three actors take the most prominent one, at least one male, one female, including the character played. So here it's, you see the plot here. So it's Frances McDormand, she played Marge Gunderson, police officer. So that's the kind of data you should gather. And the Oscars won by the movie. Now Oscars are synonymously Academy Awards. There are very many of them. Also again, for the sake of keeping this simple, just restrict to these categories, best actor, best actress, supporting actor, supporting actress, best director and best picture. Best picture is what the producer gets. So you start with that information and then you enter it and that's already the second exercise into TSV files, tap separated value files which we will now do for our movie data. And here I already gave you the data in a table, so where is it? Where is my table? Oh, it's up there, here. That was this table. But you have to think for yourself how to design that table, which is why we are not giving you the table. This is just here for explanation. So how does it look in the tab separated value? So I have title, year score, title. Now to type a tab in this editor, control V and then tab, because if I do a tab it's a command, right? So let me just delete this again. So I do control V, you don't see it, tap. Now I get it. Tap character, then I get the year, and then I get the score. And now the first movie is the Shawshank Redemption. Okay, you see some magic here. The score is almost right, 9.3. Second movie is The Dark Knight. Okay, this was wrong here. The Dark Knight, 2008, 9.0. Not bad. See the magic. Inception is the third one. Not bad, right? This is, for those of you people criticizing large language models, they should do this. It's just, it's magic. This is just magic. Fight Club, man. I mean, I haven't even typed the first letter. This looks like this thing knows my slide, right? And it's, I mean, this is just amazing. It's just amazing I just wanted to show this to you in case you show this to your friends who if you have someone who say yeah these language models they are not that good. So now we have our table it's a very simple table it's there and now we wanted to learn some SQLite. Let's go back to that slide. And now we, yeah let's just do it interactively for now. So I just do the following SQLite3. And now, yeah why not store everything in a database? So this is initially empty. And now I, okay. So this is initially empty and now I... Okay, I think I have to explain something first. Now I have to import that table into the database, this TSV file. I somehow have to read it and we have seen a command here. It works like this. I actually I can... Okay, I think I can do it. Import movies, TSV movies, yeah. And now it's in there, so it did it. It can just read the TSV file, we make a table out of it. If I type.schema, it now, okay, that's interesting. There's no, no, it shows me the, it shows me the schema of the table. And this thing here, we will learn it in a second, that's already some sequel. So it shows me it's movies here, title here, score. It's all text, we will see more about this in a second. I can write a simple select SQL statement here. Select star everything from that movie. This is about the simplest SQL query. Select everything from a query. Without ever having seen SQL before, it should be clear what this does. Select everything from that table and this is now our table in the database. So you see, really simple to use. Now we already know how to quit. I leave it, and now I have a movies DB here, and you see it has some content. So it has, because I have called it like this, with some, if I would have called it without argument, and I leave the program, now the data is done. Now if I call it again, it's just, it knows the table, it even remembers, arrow up, arrow down, my commands, and it's just in that. So simple enough, right? That's SQLite 3, you will work with this, and it really, it's a great program, couldn't be any simpler. Okay, back to the slides. So you can import, you can read something. We will also do that in a, I think I will show it to you right now. Let's just do, let me remove my movie sequel again. No, no, my movies, no, no, my movies DB, my database. And let me write a file movies SQL where I just do what I just did and I write it in a file. Movies, TSV movies and then let me write that query. Thank you. Don't have to write anything. Select star from movies. That's what I just wrote. So that's just, and if I want to, if I want SQLite to execute these commands, I have to, I can do it like this for example, SQLite 3 movies SQL. And now it will just, so this is read the commands from standard in, I can also do it like this, movies SQL, so this will just output the commands and pipe it into SQLite 3. SQLite 3, I don't have to give it any arguments. Now it will not store it in the database. It will just process these commands. So now it just imported the data and executed the commands. And now you already understood a lot about how SQLite works. That's basically all you have to understand. How you create a table, how you can import data, how you can ask sparkle queries. So you can just have a file with commands, then you can pipe it. That's also how you should do it for the exercise sheet. Pipe it into the program and then you get results. Really nice. Just a random selection here. There are so many commands of course. The big system timer on if you want timing information That's really nice. Just a random selection here. There are so many commands of course. The big system timer on if you want timing information. There is help of course. Let me just show it to you very quickly so that you have seen it once. Dot help gives you help. You can also do dot help on a particular command. Now confusingly you don't write the dot so it's dot import but dot help on a particular command. Now confusingly you don't write the dot, so it's dot import, but dot help import. Now you get information about how that works. So it's also very nice you have the documentation inside the program. SQL commands, this is how you create a table. And we will work more with that in the following. You write create table. Actually we have seen create table if not exists. We don't need that. Then you would get an error message if that table already exists so that you do not overwrite it because this I think will create an empty table. And here we see a few things. I haven't talked about them yet. I will now. This is of course the name of the table. And here we see a few things. I haven't talked about them yet. I will now. This is of course the name of the table. These are the names of the column. And here we have the data type. This is actually, this is exactly the domain from our earlier definition, right? And it's, you can understand it by just reading the name. This is text, any string. This is integer and this is floating-point number. That's just how you do it. I think not only in SQLite 3 but in all database systems. These are very common names. And of course they are different from how you do it in programming language because every language needs to have their own name. So to just maximize confusion. Launch a SQL query. Yeah, it's just another example here. I will not try this out. It's also obvious from this table, select all columns. I could have also written star here because it's all column, but I can also also list them, separate them by commas. So as I say we will do it by example here, you learn this comma separated otherwise you will get an error and then you can specify some conditions here also. Where if you do that you only get the movies where the score is greater equal 8.0. You can also delete the table if exists then you will get an error message. If it does not exist, enter. On the last slide, but you can also just Google, I mean, it's not the job of the lecture to provide you a manual or something. There's the SQLite page page and there's also a page which is I think the one linked in the, it's called lang HTML. Yeah, here you have the SQL as understood by SQLite3. Oh, here we have alter table by the way. It's the first one. Vacuum, so if you need to clean your house, you can use that one, vacuum. So it's not that many commands. Okay, that's the SQLite part. And now I think, because we're just getting started, let's start with the design part. That's the main part of the exercise sheet. And that's actually, as I said, this is a brand new lecture. When I started this, it's so funny when you think about things, sometimes you have thoughts in your mind and in your mind it's just one thought, but when you work out that thought you realize, oh that's a big thought. It's a lot of things and database design in my mind, it was one slide, now it's 11 slides because actually there is a, there's a lot about database design and it's the main part of the exercise sheet. Let's see what I mean. Three billboards outside having Missouri is one movie. Yes, yes, it's one movie. It's a very good movie. Three billboards outside having a story is one movie. That's the name of the movie and there's a comma. We deliberately included a movie with a comma because just so that you know that there are movies with a comma in the title. So we have seen one table movie. Oh, it's called movies, right? I think it's just wrong, it's movies. I think it was called movies, yeah. And it's important to note, I don't show, let me maybe switch not to the previous slide, but just to the table here. not to the previous slide but just to the table here. Oh, I don't get, ah, I have to type it just enough times with the right frequency. Okay, never mind. So here's our table and actually let me just show it more nicely. Columns minus S. That's I think a way to show, no column, it's called column not columns. Yes. So this is how you show a table nicely on the command line. You do column minus S, this is also a tab character here, minus T and then you see it. So that was our table and note that every movie has exactly one year, exactly one score. This is why this is kind of the simplest thing you have in a database. Every movie has exactly one year, exactly one score. It's like functional. You give me a movie, I give you the year. You can't have two scores. But that's not typical. Most data is not of that kind. Now let's talk about actors. Who played in that movie? Well, there's not just one actor for a movie. There are many actors for a movie. Yeah? Each movie. But how do we do that? What do I do now? Do I have an additional column actor here? And how do I put, how do I do it? And now let's go through it and I deliberately now x this out in the following by just what are the first things that come to mind and are they good or bad? And this is really, it's the central question. It's what's often done wrong. For example, his in one, which you all know has, I'm sure, I know, has a terrible database design most because it's hard to do database design. And let me also say that this is so important, it's the first thing you do, right? You start creating a database, the first thing you have to do before you even start asking queries, you need a design, you need to decide how do I put my data into tables and then you are stuck with it. There is yes, there is altering tables but you should not do it and you're stuck with it. Once you have a certain design and then later you realize it wasn't that good, too late. So very important problem which is also why half of the sheet is about it. So let's, what are the first things that come to mind? And these are solutions. I mean, I didn't put solution in quotes because it is a solution. Let's just have an additional column and let's call it cast. Cast is actually, I should have called it cast. Cast is actually, I should have called it, yeah, let's, we come back to this. Oh yeah, I already said this is a terrible solution. Okay, I should have changed the order of this. Let me change the order of this before I, let me just go to, you haven't seen this. Just forget it, just delete it from your memory and look away, just look away. So I'm, slide, bam. So here we have a table. Is this a good solution? Is this a good solution? And just look at the table and think about it. You haven't seen the red thing, you haven't, the last minute was erased from your mind. I need a pendulum. Last minute has been erased from your mind. This doesn't look so bad. And actually that's important. People do it like this. People do this stuff. They say, yeah, it's a database. And the database of course allows you to do this stuff. So here I have the actors, Shawshank, Redemption, or let's maybe take a note, I don't know, Inception, Leonardo DiCaprio, Elliot Page, Marion, Cotillard, and other actors separated by commas. Okay, it's a terrible solution. Why? Now we have, it's a terrible solution. Why? Now you have, let's say I want, is just let's look at one query, do the dark knight and Shorshang redemption have an actor in common? Now I have to take these two strings and look, somehow parse it, separate it by comma, put it in a hash map, stuff which actually the database is supposed to do, I now have to do myself, check whether they, and maybe there's an actor which has a comma in their name, I'm sure that there are movies which have a comma in their name, so probably also an actor with a, so that's bad. It's also, let's say I want more information, like Morgan Freeman or what did we have? Oh, I don't have the... I don't know. Christian Bale plays Batman here. I want the information that he played Batman. How do I do this? Do I put it in parentheses? A new table also with a comma separated list and then there's a correspondence. That's not a good idea. So, correspondence. That's not a good idea. So okay, that's not a good idea. So let's just, let's just do it like this and again I have a look away. It's a new lecture so these things can't all be perfect. Just look away, you haven't seen this. You haven't seen this. So maybe that's a good, is this a good solution? Let's just repeat, I mean it's tables, I can only have one thing in each row, so just repeat it. Shawshank Redemption 1994, 9.3, Morgan Freeman. Shawshank Redemption 1994, 93 Morgan Freeman. Shawshank Redemption 1994 9.3 Tim Robbins. One row. If a movie has 27 actors, I have 27 rows. Actually, not that bad. And actually, if you do queries later, you will get exactly such tables. So actually you might say, that's a stupid idea. It's actually a good idea, but you shouldn't create tables like that. If you create tables like that, just think about typing it, but it's also, it's very redundant, right? I mean, I have the title now 27 times, and the database has to store it 27 times. Maybe it uses compression, but it's not a good idea. So there are some queries in the chat. No, because, oh yeah, that was just... Better a new table just for actors and link it, okay, you're already looking into the future. Yes. Exactly. Yeah, you are just thinking with us, so in the chat there's already some discussion on how to do it, but that's not too bad. You have a problem understanding you because it's either you have to speak up or someone has to close. It's actually back to you because you have two actors with the same name but you have no idea. Oh yeah, that's another very good point. Another very good point, what if I have two actors calling Morgan, called Morgan Freeman and I have a problem. Another very good point. Thank you. So we have two problems. We are repeating information and we have, it's a very good point. Thank you. So we have two problems. We are repeating information and we have, it's a very good point. We might have two actors, we might have two movies with the same. Okay, let's look at the third solution. Haven't you, and I haven't looked away. Just look away. Just look away. Copy effects to slide. So I have a separate table where I just have this information. So now I don't repeat the year and score information, I just say okay the movie, but here I have some repetition. So I say okay, somehow I need this maybe, this movie has this actor, this movie has this actor, this movie has this actor. And the year and score information is in the other movie. And I would create it like this. So whenever I specify a schema now, I would just write the command, which you would write in a database management system. This creates, and this is also wrong here because cast is missing. I have to specify the name of the table and the semicolon is also missing. These are just normal things when you create new slides. So that's much better. And also note, if you wanted now additional information about the role, so for example Christian Bale is playing Batman in the Dark Knight, we could just have an additional column here with role, Batman, it's very easy now, very natural. So that looks good. But there's still a problem and you named it. You named the problem. So let's just go to the next slide. Here it says, this guy plays in this movie. She plays in this movie, she plays in this movie. Well, the title looks like it's unique but it's not. There are three movies called All Quiet on the Western Front. And let's actually, so one from 1930, so it's a very popular movie, it's after a popular book about the First World War. And there's a remake and there's another remake. And actually here's the Wikipedia page. This is why you have so many, half of Wikipedia pages are disambiguation pages, right? It says here, yes, okay, not, yes, I will donate money but not right now. See there are three movies here. So if you just say All Quiet on the Western Front, it's not clear which one, not now, which one you need. And there's even a song, there's a book. So that's not unique. Actor names, oh yeah, I found one for you, Anne Hathaway. There's actually at least two Anne Hathaways. I mean, there are many Anne Hathaways in the world, but there are two actresses. One the one which you, do you know Anne Hathaway at least? Yeah, okay. I see some nodding. And there's one born 1556. She was also an actor and she was the wife of William Shakespeare. Okay, let's see, as again, yeah, Anne Hathaway, wife of Shakespeare, it's even written there. And there's Anne Hathaway, the American actress and singer. So, very important to understand, these are simple concepts, but you have to understand them. Names are good for human communication. I say, do you know Anne Hathaway? You say, yeah. And you think of the Anne Hathaway. You don't think probably of William Shakespeare's wife. If you are into history or something, you will probably ask which one, or you will think it's that one. And the URLs of Wikipedia pages are actually a very good example, right? So if you, Wikipedia pages are actually a very good example, right? So if you, Wikipedia pages, if I go to here, that's Anne Hathaway, the Wikipedia page is just called Anne Hathaway. And if I go here, it says that, it doesn't say Anne Hathaway too, it says Anne Hathaway in parenthesis wife of Shakespeare. So that's the solution Wikipedia has found for that problem by just if there are many of them, one of them gets the name without further notice and the others get the name with something in parentheses. So that's also a way we could have also done that, right? Making names unique by adding something. That's actually you encounter that a lot. Yes, cities for example. Berlin, there are probably 17 Berlin's or 27 in the world. So what do we do and you already said it in the chat, we also have IDs and so for example for tables for movies I add yeah I just add an ID for everything. I have a separate table with persons and I have a separate table with the cast and I will just show you an example here. So this is how we will do it. And there's already, I mean I could have just shown this to you, but there's already significant thought and understanding all the problems behind it. So this is really, really important. That's maybe the most important non-trivial thing to understand in this lecture. Why this is not only a good, but really the canonical and the best design for this kind of data. So I now have a table for movies. I'm not saying which one is which because it's obvious. Table for movies, like before, but I have something for the ID. What do I take as ID? I mean here I have one row for movie. I could also have taken one, two, three, four, but I took the IMDb movie ID. IMDb has, let's just, where do we have it here, Fargo right here in the URL, maybe a bit small for you, that's the IMDb ID. So actually every system will have their own IDs. Wikipedia pages have their own IDs and we have seen Wikidata. Let's just look at Anne Hathaway here. Here she has Wikidata is for all languages together. They also have ID of the form Q386301. IMDB has their IDs, they start with TT. And Wikipedia articles use text and with parenthesis stuff to make it unambiguous. So we have the IDs here. Now we have a table with persons. It could be any persons, does not have to be actors. We can easily add more information here. I added birth date and here I took Wikidata IDs. For the exercise sheet it's up to you which IDs you take. And now I have a third table, which just, and this table is now already hard to read for us, but that's the way to go. What's this information? Well, this says Q4-8377, which is Morgan Freeman, acted in the movie 17-2241, which is Morgan Freeman, acted in the movie 172241, which is the Shawshank Redemption. Now I got rid of all the problems. I don't have, all the problems I named. And you see here, one thing remains, but that is inevitable. This you also need to understand. I do have repetition here. There's nothing like the movie here is just there once and then I have all the actors comma separated or their IDs, I'm repeating the movie here. If I have the Shawshank Redemption has 27 actors and they are cast, I will have 27 rows with the same movie ID. Which by the way is why compression is very important for database efficiency. But I don't think we will have that in that course because we have so many. Usually I have a lecture only about compression. You have these rows with runs of the same number and compression is just super important when you store this. Okay, no more new questions on the, let's see when, yeah? Did we just discuss the five normal forms? Come again. Did we just discuss the five normal forms in the database? The normal forms, what do you mean? When we normalize the database to more simpler tables like this. What do you mean? So what exactly is your question? Is it the same as normalization? Okay, let's see what's defined as database normalization. So what's the difference? It's a good question. What's the difference between normalization and design? Yeah, this is now a question about terminology. I always use database design for this. You now ask what's normalization. Let's maybe take this offline. It's a question about terminology. I haven't, it's a good question. So right now I don't understand the subtle, I mean we can also search database design and see what it, yeah. So it kind of looks similar, right? Design, so I'm not sure now what's the difference. And maybe there isn't a difference, maybe it's used synonymously. Okay, this may be the, and now I think we will make a break in a second. And before we do the break, let's go to the exercise sheet so that you understand. And please do the sheet. It's a very nice sheet, it's simple and you learn a lot. So, and just so that you understand why we gave the exercise the way we give it. Now exercise two is come up with something like this yourself and now the data is more complex for the exercise sheet, right? It's a, you have movies, you have the year, you have the score, you have who directed the movie. The movie can have more than one director, producer can have more than one producer, actors, you have the Oscars, you have who won an Oscar for which movie. There could be more information, but that's it. And what Oscar is it? And the question is, how do you organize this into tables? One, two, three, four, five tables. And if we would have given you the data, we would have done the job for you, but it's your job to think. And do this exercise because this could also be a task in the exam where you have to come up with a design yourself and you just have to do it once to understand why it's non-trivial. So that's why we let you gather the data yourself and then you have to, and it's so few data, so you have to just type it into TSV files yourself and then come up with your database schema. Okay, and I think that's a good point to make a break and we resume in five minutes. So this was like the most important slide. Let me just say it again. I could have just given this to you and you would have said yeah. And I don't repeat it again. I just repeat the metal statement that there's a lot of thought and reasoning behind putting the data in tables like this. This is something you should absolutely understand and I recommend doing the exercise sheet. And it's non-trivial, you already see it by looking at the tables. Okay, and here comes another thing, primary key and foreign key. And I think it's best understood, so are you back listening to me? So we have these IDs here, and the IDs are called keys in the database world. So here in the table there's always a table introducing the ID which is this table for movies here. So this is the first time, I mean there's no order of the table, but this table is kind of introducing it. So I have a line for every movie. So this here is called the primary key. This column is called the primary key of this table because it defines the IDs. Here I also have a primary key, it defines IDs for person. And just note, I can use the same column name in different tables, nothing wrong with that. And here I am using the ID and I don't have to call it the same way. Actually I can't call it the same way. I can't call it ID ID. I mean I have to give it different names obviously in the same table. So here I call it movie ID. But this I can tell the database look these IDs and I should are actually these IDs defined here. So this is the primary key, this is the foreign key referencing that primary key. That's simple enough once you have understood this design and you can tell that to, yeah, actually it's in the, that's what I just explained and that's how you specify it. It's just like before, the only difference here, let me underline it, that's the only difference to before. I just write at the end, I'm creating that table, this here is primary key and now the database know, okay, that's the table introducing that ID. I do the same for persons. So this is compared to what we had before. That's the new thing. And now if I have a table cast, that's what's new before. I'm saying, okay, here's an ID. And actually that refers to an ID defined in a different table. So I can just, that's just the way how you specify this in SQL. And then you have the advantage that the database can do checks. I mean you can omit this, it's still correct, but if you do that and now you insert something that references an ID and that ID is not there, right? That could easily happen now here. I enter something here with an ID that's not in that table and it will say, no such ID there. And also if I insert another movie here with the same ID as one that's already there, then it will say, no, ID has to be unique. So that's why typically we will have, you will do that. And since keys are something that's succinct, often integers, the efficiency overhead here is okay. But it's just important, it's important for the integrity of your database. Right? If something goes wrong here, everything goes wrong. If you have the same ID twice, or here you reference things which don't exist, probably you have mistyped, something is wrong. So that's a very important integrity check. Okay, and I think, yeah, let's just do that maybe as a preparation for the last part. And I think I have, yeah, I have, so that's the second part. I have just prepared these movies here. Yeah, I don't, so now I have my movies file here. It looks like this and maybe let's look at them in file form. So that's the tabular, it's just looking at, okay, that's this movie. Then we have persons that should be now exactly the table from the slide. So we have just four, Morgan Freeman, Tim Robbins, Christian Bale and Maggie Gyllenhaal. And then we have cast. Oh, I called it roles here because cast is actually a keyword in SQL. That's why the name is a bit unfortunate. So that's exactly, I hope, the data from the... unfortunate. So it's exactly I hope the data from the... Okay now let's just, I would say we just read this into our... and let's just, yeah let's maybe do the following. Let me read the files here and that file for a second just so we see them. No, that was not the one. So let me just do the following and you will see in a second why I'm doing this. Movies.tsv, so that's the movies.tsv file. Let's persons.tsv. Let me just read it into the file so that we just see it. And let me also do a cast roles.tsv. You see it here when I do cast, it's highlighted because it's a keyword in SPARQL. So it's not a good idea to call this table cast. That's why I called it roles.tsv. Okay, so these are my files. I will delete them in a second, it's just so that we can see them. And now I, oh I wanted NeoVim, okay. Now I want to create the tables for them, so let me create the tables for them. And you see I get, that's also amazing, right? So this is, I'm showing this to you, I mean I could deactivate it, but actually I'm showing it to you on purpose. This is, let's go to the slide, where are the slides? Yeah, let's look at that one. That was what I wrote here on the slide. I'm sorry. ID text primary key title text here integer score real. Ah, it's not, it's one thing is different. What's different? ID's integer. Yeah, ID's integer. And I'm saying ID's text because I'm using text ID's here. I mean that's okay, I can also say this is an integer. I'm making it a little more inefficient and actually I don't have a space here. And I, yeah, so that's my line. So that's creating this table. Now I want a table for per, okay. That, see it learned that I'm using text here. I think that's the right birth date. Okay, for date, actually I want date. That was not quite correct. And roles. Okay, it's doing it a little bit differently. So here I have movie ID, text, references, movies ID. There are different ways to say this in a sequel. Let me just... Okay. References and I don't want the primary key. Okay, that was on there. I thought about whether I should show you this or not. It's pretty amazing that it's completing these lines, right? That's pretty amazing. I mean, I just have the tables here and you have to understand quite a bit, right? You have to, it has to understand what this data means that's why I pasted it here. I thought about whether I should I'm showing it to you because maybe you're also using this or now you want to use it. It's a very useful tool but it's super dangerous for learning right because you saw, it does it, but it's making these small things differently and sometimes also wrong. It's making mistakes. And if you just use this while you're still learning, you are not learning, you don't recognize the mistakes. It's a gigantic tool. It's amazing, it's magic, but it's very dangerous when you're just learning. So be very careful, don't use it, maybe use it, but then be very careful, that's very important. Okay, now we want to read the files. So we're creating the schema and now we want to create the files. Okay, I don't even have to, not bad, right? But that was like boilerplate almost. Okay, and now I think we can delete these files. Not bad. So, and now we could just for fun have a select query here. Yeah, why not? Select movies, thank you. And now I can just do the same thing as before. SQL, thank you. And now I can just do the same thing as before, SQL, pipe this. Now, this is also what you will do for the exercise sheet, just for your data, there's more data there. I pipe these commands, right? Let me just show them to you again. This is just create a table, import the data, ask the query, you can do this all in one. And just note how simple this is, right? It couldn't be any simpler. So, accept that it's wrong. What did I do wrong? So, what did I do wrong? I have an idea. Something with separators, yeah, exactly. It was on a previous slide. The default separator, I'm not sure what it is. Either it's white space or a comma, but there should be a, yeah, exactly. These are tab separated files, so I should say, you can also write the tab separator like this. And you see, it wasn't, so it's, I have to type this and now it knows it enough. So what's the question on the, how is the auto completion toll called and how can I get it? Really, you want to get it? It's a co-pilot, that's co-pilot and that's neovim. But you shouldn't use it. Only it's not so easy to install. Actually, I think that's a good safety measure by making it hard to install. So you need to pass a test before installing it. But there I think other, I don't know, VS code or something I think comes with Copilot. Yeah. Exactly. So let's try again. Yeah, now it works. So now just understand what I did. It just defined the separator to be tab, tabulator, it created the table, it imported these things, it executed the query. I mean, this is so nice, right? It couldn't be any nicer. You have this powerful software and this now also works with tables with millions of lines or anything. It just works. And here I have the table, all movies. Very very nice. So here's one thing to think about. The relationship between tables. There are actually four kinds of relationship, one to one, one to n, n to one and n to m. And let me just look at one of them and then we just quickly do it for three table pairs. One to n, that's a common one, means if you look at the one table then each ID there can occur multiple times in the other table, but when you look at the other table then each ID there can only occur once in this table. And let's just, and then, so this is one too many, this is maybe I should write that here because there are also common names here. So that's also called one too, and that's not only in databases. That's one too many, right? ID here can have many multiple times here, but not multiple times here, and this is called many to many. Many to many. And this is called many to one and this is called one to one. Okay and now let's go back to the slide where we had these three tables. And now the first question is, when this table is A and this table is B, this table to this table. So we have IDs here. Is this to this, this to this, is it one to one, one to many, many to one, or many to many? One to many. One to many, that is correct. It's one to many because each idea here can occur multiple times here, but here I have each idea only one. What about this to this? Same. Same, one to many, that's correct. Each person here, and what does it mean? Each movie can have many actors. This is also one too many, because an actor can play in multiple movies. What about this to this? Now these are not directly related, but they are related via this one. Yeah, that's many to many, that's correct for the reasons I just saw. Every movie can have multiple actors, every actor here can act in multiple movies. So that's just an aside, let me just write it here on the slide. So one to one is rather rare. So this is movies to cast, so I call it cast here. That was one to n, one to many. This was also one to n, and this was n to m, many to many. Many to many. Yeah, okay. And this is, right, this is typical, understand this, Yeah, okay. And this is, right, this is typical, understand this, when you have the table with the primary key and relate it to the table with the foreign key, that is by almost, that's by definition one to n, could also be one to one, but because the primary key is unique. Okay, last part, SQL basics. So let's quickly go to the exercise sheet. The third part of the exercise, now you have done the hard part, you have done the design and hopefully you have done a good design just like I did now. And now you want to ask some SQL queries and here we just give you three examples and you have to find the right SQL query. Which year was Titanic released and what's its rating? Who directed Fargo? Which actress won Oscars for which roles in which movies and in which categories? And the interesting thing you will see here is, so this is a very, this looks like, oh that's a simple SQL query, but now your data, excuse me, is in four tables or five or I don't know how many and you will see that even very simple questions now can have pretty complex SQL queries and you will learn something while doing it. So I highly recommend that you do it because it will be very useful for understanding the... Yes, but you shouldn't use it. So... I shouldn't use it. So, that's the last part of the exercise and you will learn a little bit of SQL. So here's the basic form of a SQL query. We have already seen some. I need to rest my voice for three seconds before I continue speaking. We have already seen some SQL queries. And they were, SQL is a very complex language. There are a lot more constructs, but the basic SQL queries often have these three components, select, and then you have a comma separate list of column names. I want these columns in my result. From, I have a comma separated list of tables, so far we have only seen one table, you can actually have multiple tables here, always comma separated, and then where, you somehow specify an expression, a boolean expression which says I want these results in my table. Score greater equal 8.0. We have seen that. That's a Boolean expression. You can evaluate on a row. You look at the row. Score greater equal 8.0. If yes, true. No, false. If yes, included. So that's the basic form of a SQL query. I think just by seeing one example, you already understand a lot. And for today, that's exactly how I'm going to do it. It's actually very hard to define such a query language formally, so I won't do it. We will do it more formally in the next lecture. So there we will define the semantics more rigorously. Today we will just do it by example. And in the following, that's really all the rest of this lecture, we will understand this query here. And it's interesting. It's interesting and not trivial. Because look what I'm doing now. Now I'm saying from persons, movies, cast. I have three tables here. What does that mean? This SQL query, I would say, is now not so easy to understand if you have never seen this before. From three tables and then where this ID from, okay this from that, is it from that, I mean kind of clear what it does but why does it do what it does, we will understand that now. And let's go in order. Let's start with from, then where, then select, because you always need from, you don't need where, and you can have a star here, select everything. So let's start with the one which you always need, which is, and there's again a serious mistake on this slide, namely the semicolon is missing. Missing semicolon is the end of the universe. Where are white spaces? Oh, oh yeah. It's terrible because I made some changes. Thank you. Thank you for paying attention. So let's start with the from clause. That's the most complicated one actually. And now pay attention one more time because it's important. It's I think the most important thing to understand about SQL and database queries now. So here's an example queries. Nowhere clause and I'm selecting everything from three tables now. What does it mean to select from three tables? If you have never seen this before, what does it mean to... and let's look at my tables here. Let's just do that for TSV in start.tsv. Let's just look at them here. here, echo, the table name, show the table and let's do it nicely so that I don't have to switch between the slides. And you also learn some things. You learn how to write for loops in Bash, how to not write for loops in Bash. That's because there's a do machine here. No such file, yeah, because TSV files have a, you have to tell me when I'm making mistakes. So, there we go. It's our table, so some basic bash scripting. Everybody should know just how you should know all the movies I'm talking about here. Okay, so what does it mean to do select star from these three tables? What does it mean? Is it any ideas? If you don't, or if you already know it, I think you shouldn't say anything. If you don't know it, what does it mean? I mean, that's a query you can type. I mean, it's not a query you can type because the semicolon is missing, but I won't edit now, because then you will see the rest of the slide. What does it mean, select star from these three queries? Any ideas? Yeah? Select each row from every table. Yeah, select each row. This I think is very natural that you get the three queries back to back. But that would be hard, right? Because the result is one table and these have difference, you can't concatenate them, right? It doesn't work. Because you always need one table and these have difference, you can't concatenate them, right? It doesn't work. Because you always need one table with, every row needs to have the same column. So concatenation doesn't work. Yeah? Yes? that the large table that has all the columns with the C here and for like the first row of the first table you have that like three times repeated and it's combined the first time that the. Okay, now you are talking about repeating things on the floor. So you already have an idea. It goes in the right direction. I think the formula. If you think of tables like SATs. Tables like? Like SATs. Like formula. SATs. That you learn in mathematics. And it's. Very good. It's efficient but I like formal like. Sets. After you learn mathematics. I think so. Very good. The efficient product of the like. Okay, very good. You named a lot of very good ideas. So if you think of table like sets, which is how we define them, and now we have the product of sets, you set Cartesian product. So maybe you know some mathematics already, sets Cartesian product, and that goes in the right direction. Actually that is exactly the right. But it's kind of not obvious if you have never seen this before. And let me very quickly go to that slide 28 to our definition, which you maybe forgot in the meantime, a table is a set. And now we are having three sets and we want to combine them somehow. Let's go back, we are here. And actually what the query does is, if I have let these be my tables, K tables, so I can have any number of tables, then this query computes the cross product, Cartesian product. So let me use that Cartesian product, that's also right, this is also called Cartesian. Why is it called Cartesian? What does Cartesian mean? Cartesian? what does Cartesian mean? Cartesian? Yeah, René Descartes, it's after René Descartes who said funny things like I know, I think therefore I am and such things. And he also computed products of sets apparently. So cross product, and this is what is the cross product of sets and I have two slides on this because maybe you have seen this before, maybe not. Intuitively it's all combinations. You have three sets and now I take all combinations. And before I go back to that slide in a second, let's just understand it for sets. Let's just do it by example. The formal definition is when you work on these a second. Let's just understand it for sets. Let's just do it by example. The formal definition is when you work on these slides yourself. What's the, here I have two sets. That's like the simplest example maybe. So the result is also a set. And now it's a set of tuples. And I give you the first one and then you, you are also language models in case you didn't know. We are all language models. That's my strong hypothesis. No difference. Your brain works just the same. What's next? What? A comma two, okay. That's one way to do it. I want to do it in this order, B2. What's next? B, B1, sorry, that's stupid. I have to, and I have to write it a little bit smaller so that I don't, okay. I start with A1 and now I vary. You need to find B1. What's next in this schema? What does your auto completion say? A2, yes. A2, B2, exactly. And now I have A3, B2, exactly. And now I have A3, B3. Exactly. So it's all a combination of an element from the first set with an element of the second set. And there are two ways here I could have varied. So here, this is kind of, you can think of it like in a binary representation or P-adic representation. So this is like here the most significant digit and this is the least significant digit, right? I could have also written it without sets or comma, just like this, just to A1, B1, A2, B2, A3, B3. And note one thing, and maybe let's go back to that slide for that to just show that. Check this at home, I'm just saying it here. What's the number of the size of the, it's the product of the sizes of the individual sets and the number of columns, okay you will see that in a second. But let's just write it down here. So the size, and there's another example here. So let's just, what's the size of S1? How many elements in S1? Two, what's the size of S2? Three, what's the size of the cross product? What? Two times three, yeah, Two times three, six. You could just count. What's important to understand, and that's also why it's called product, the size of the Cartesian or cross product is the product of the, and you should understand this. For sure there will be a question about this in the exam. About, you have to understand this mathematically. It's not hard, but you have to understand it. This example we don't do by hand. I have written it down for you here. Here I have a, and let me just write down the sizes as well. You can check in the meantime. So here I have a set of size two, multiplied with a set of size three, multiplied with another set of size three, of course they can have the same sizes, and the size of the cross product is, what's the size of the cross product without counting? 18, yeah, it's 2 times 3 times 3. And if you, and understand why that is so. There's something to understand here. Don't just learn, you have to multiply it. Understand why you have to multiply it. This is the cross product for sets and now we do the cross products for tables and that's actually very similar. And this is really the one thing you have to understand about SQL and databases. And now I, yeah, this was actually a lot of work because it's hard to show this on a slide. Now I just take reduced form of the tables on the previous slide. I omitted birth date, I omitted year and score, not because I have to, because otherwise I can't show you the cross product on a slide, this is why I'm merely for presentation reasons. And here, and now it's the same. This is a set with two elements, two rows here. This is two rows, and this is three rows. And now pay attention, this is the cross product. That's how it looks like. And this is like the central slide of understanding SQL queries. This is when I do a cross product of three tables. And this is not so easy to understand. And so what we see here is, so this here comes from the table movies. This here is the part from the table persons. And I had to do some magic here to fit, but I think it's quite readable, right? And this here is cast, which I in my code called it roles because cast is a reserved keyword. So now I have, I mean, conceptually it's easy. I have every combination from a row from that table with a row from the person's table with a row from the cast table, right? And I had two, two, and this should be one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, because it's two times two times three. Any questions about this? This you really have to understand to understand SQL. And it can't be any other way. This I think is the, okay. And now when you understood that, I'm sorry. So the idea was to wiggle and funny sounds. So now, and actually let's do that for our, now I have different sound of course. Let's just do that by going here and doing select from movies, persons, that's now the full and this is mini tables, right? Select star from, let's just look at this, maybe it was a bit too fast, I do the usual thing here. Select everything from movies, persons, roles. How large, how many roles will the result have? 80, 120't it? Four times four is 16 times, but maybe do four times five is 20 times four is 80. So it's 80. Let's see whether it's 80. And we have, I have to now, I mean, I could also do it interactively, but let's do it like this, more fun. Okay, let's pipe it into less with line numbers. It's okay. What did I do? I'm surprised that I, oh no, no, no, wait. Why is it 150? The headings? I just saw the headings included in the URL myself. The headings were included, yeah. Why are the, oh. Oh, I see. Do you know how to exclude the headings when importing the data? The TSV headings? We don't know, so we have to. Okay, let's quickly do that. Sorry, let me just... Let's quickly do that, sorry, let me just, let's quickly do that, okay. Yeah because I specified that the schema in the database file so I, now it's 80. And now look this is a bit hard to see because now let's maybe do it's 80. And now, look, this is a bit hard to see because now let's maybe do it like this. This is now all combination of it's just what we saw on the slides, but just with more. And you see, even for this simple example, it gets huge. So it's actually doing that. It's exactly what you're seeing here. Okay, now the rest is simple, but this is, I mean, it's, I say it now and you say yeah, yeah, but this you have to understand deeply why it has to be done like that, why that's the right way to do it, what it means, you have to spend time yourself. Do the exercise, it's the only way to learn it. Now the rest is simple once this is clear. The where clause, and I said I do it by example and this is not quite correct here. This is a, I said, selecting all columns here. Now I just do the same thing as before. Our three tables here. Here I call it cast and now I have a condition which is just a boolean expression which I can evaluate on each slide row, right? Now for each row I just check what's written in the WHERE clause. Is it true or is it false? And we will do that in the following. And one thing to note here, I said I introduced SQL by example. Now I prefix the column names by the table. So that's how you usually do it. Table name dot column name. Why do I have to do this? Well, here I have ID, for example, twice, right? Which ID do I mean? I have to say movies dot ID, then it's that one, persons.id. Each of these columns comes from a certain table. So that's simple enough to understand. So that's why you write this. Actually, SQL allows to drop it when it's unique. So for example, title, I don't have to write movies.title, I can just write title because there's just one title. Sequel will allow that. For ID, it will not allow that. Name, that's also, you don't have to call this person's name in the column, right? Because in the table, persons. You don't have to call this person ID here, it's in table persons. That's why I chose that naming. But in that table here, I have to call it movie ID, person ID because I can't call them both ID. Okay, so, and now what it does, what this query does, it's now really simple, you have to cross product here or Cartesian product, and now I go over each row and just check this condition. And now remember this condition and let's do it together for the table. So for each row from this big table, we just check cast person ID is person's ID, cast movie ID is movie's ID. And let's just do it together and before we do it, you can already warm yourself up. I'm just, so this here was the table movies. This here was the table persons. And I don't have space to repeat the condition. And this was the table cast. and I don't have space to repeat the condition. And this was the table cast. And now I want to select all those where ID is equal to movie, where this ID, let me just do it, this ID should be equal to movie ID, this ID should be equal to person ID. And now let's play this game and you tell me for each row whether first row yes or no? Yes. Yes, okay, yes. So let's do that. Oh no, that's wrong. Eraser, I need some highlighter here. Let's take the highlighter, maybe not. Okay, yes. How do we do it? Maybe like this, this one we take. Second one? No. Third one? Fourth one? Fifth one? Yes? No, I hear no. This one? Yes, okay. So now we are doing what the database does. And look, it's kind of irregular. You have to understand this. This one? This one? This one? This one? No? This one? No? This one? I heard there must be one more yes here. The last one. Oh yeah, the last one. Actually I didn't. Okay. I won't explain this now. You have to understand this yourself. So now we did what the database did. So we kind of just went through it and checked the condition but what we get now, now we get who acted in which movie, right? We get Morgan Freeman in the Shawshank Redemption, yes. Morgan Freeman played in The Dark Knight, yes. Maggie Gyllenhaal played in the Dark Knight. Yes. So we get these three and there's no pattern here or something we really had to check. So we get this result now. And I'm not explaining this now because this is something you have to understand yourself. I just showed you how conceptually the database does it, but you have to understand why this does what it should do on the level of what we want, right? Why does it give us now the actors for each movie? This is something you have to understand yourself. It's not trivial. So now we get exactly, yeah, this is exactly the cast for our tiny little example. And we even have many to many here, right? You see it now. I said this earlier, so this actor plays in two movies, Shawshank and Dark Knight and The Dark Knight, so Batman movie has two actors here. And now the last one is really simple, select clause, same thing here, now I select some columns and what this does, this summarizes like this last part of the lecture. So what does a select from where query to? It takes all the rows from the cross product according to the from. Then for each of these rows, it checks the condition of the where clause. We have just done it. And from the result table, it just selects the columns which you write here. So if we go here, now it just says in the select clause, I think I just said, select this column here, title, and this column here, name, and then I get this table. That's what the select does. So, the where in select is really simple. The from part is hard. It does this cross product thing. And the where selects rows, the select selects columns. And this is my final result here. So now I have the cast for each movie. And you see I do have the repetitions here, like in my solution two in the beginning. Yes? So a database would have to go through every single cartesian combination. That's why there are two more slides. It's a great question now. Absolutely great question. Explanation, it's like you foresaw what's on the slide. Explanation versus what an RDBMS actually does. The semicolon is missing. Thank you for spotting mistakes, not one week before the exam, but now this is how it should be. Oh no, now I discarded accidentally my, I will add them in the end, I'm sorry. But it doesn't matter. It was just a little bit. No, no, it was just the last ones I added. So now we, so the previous slide, was the explanation wrong I gave you? No, it's absolutely right. So I just, the semicolon is part of the dot dot here. That is not how a database actually computes the results. Why not? Well, because look, even for these super baby tables, I get this huge, I mean, product of sizes, that's just huge, right? It can't be iterating over a cross product of many tables. This would be way too expensive. But important to understand, when, and this is really important to understand, when this is my query, then that is the result. I mean, that's just the correct result for that query. Database can't do anything different, has to produce these. When I, you shouldn't ask that query, that's a query Database can't do anything different, has to produce these. When I, you shouldn't ask that query, that's a query you shouldn't ask a database, it will just be huge. But when that is the query, the result is the cross product. So a database actually computes the cross product. But typically, and that's what I wrote here, so when you don't have a where clause, it has to compute the cross product and it will. We just saw it. But very often you have these where this column is equal to this column, this column is equal to that column. That's called the equijoin. Don't have to understand this now. We will see it in the next lecture and then you can do something much more efficient and that's key to efficient databases. But that's not the topic of today. That's the last slide. The point of today's lecture and also of the exercise sheet is you don't have to care how the database does it. Does it really compute this cross product? Does it do any other way? The point was to understand these basic concepts, how to create tables properly, how to use SQLite 3, and how to ask simple SQL queries. Just going back to the sheet, right? Here you just, now you have your design, might be different, little bit different for every one of you, and now you have to, okay, how do I ask SQL queries for this? And really if you want to understand databases, you should do the exercise. It's a super fun exercise. We spend a lot of time on it. Here's the original paper, some Wikipedia article, SQLite 3, I showed that to you earlier. Is there any questions about anything today? No more questions. Frank comes in, so we are done. Thank you and have fun with the sheet. Bye bye.Welcome everybody to lecture four, databases and information systems in the winter semester 23-24. As I was saying, we are having technical problems on all kinds of fronts, but yeah, I hope we will handle. So overview of today's lecture, I will say something about this already the first mistake here. lecture I will say something about, this is already the first mistake here, the third exercise sheet was I think database design. Yeah but that's completely in line with the happenings of the first and last hour. And the content is today about more database stuff. And we will talk about it as we go along. So right now the exercise sheet is not yet uploaded on Dafne because there are connection problems, microphone problems, all kinds of problems. But we just tango on experience with exercise sheet three. This is also wrong. So a very nice lecture, easy sheet. Data collection was a bit tedious. Some of you said many of you already attended the old course. So in the feedback you said, yeah, I watched this, but it was still very interesting. Here are some quotes. Lecture was a really great introduction to databases. Databases used to scare me, but the way you teach this is actually fun. I also think database is a really super interesting topic, very good sheet, was fun to think about the design. Feel like I'm learning a lot by doing each of the exercises. For all of you who are not doing the exercises, do them. Really easy, but also good to get into SQLite 3, so that was the simple, easy to use database system we used. Normally database systems are not easy to use. Several of you said that, a little tedious to gather the information. I can't show you the exercise sheet again at the moment, but the first exercise was to gather the data yourself. Just to clarify again why we did it like that. By the way, is the sound too loud? Is it louder than usual? It's not too loud. It's just too loud for my ears. Okay, please complain if we should turn it down. I think it's a little louder than usual, but fine with me. So the reason we didn't give you the information already was because you had to come up with a schema yourself, how to put it into tables, and if we would have given you the information, we would have already solved part of the task. And if some of you found that a little tedious, well that's I think part of this work whenever. So in our group we work a lot of with data and just doing something with the data, also these tedious tasks is always part of the work. And sometimes a big part of the work, I actually like it. It's very meditative. Some of you didn't seem to like it. And I think it wasn't a lot of work here. It was data about four movies. I wrote R scripts and used Excel to tidy the data. I was surprised by that comment, these were, but that's maybe the typical computer science attitude. You could do it manually in five minutes, but why not write complicated scripts to do it automatically in three hours? But it's more fun. Okay, you certainly didn't need R scripts and Excel for data about four movies. Sad to see how sexist the film industry is. I also realized when by looking at the, all the data in preparation of this sheet, you see a lot, it's clearly male dominated, which it shouldn't be. I mean, it's film. It's a bit louder, but not too loud. Yeah, somebody wrote that, but we turned it's film. It's a bit louder but not too loud. Yeah, somebody wrote that but we turned it down now. No, no, that comment was, somebody used two, over 50% females to counterbalance this. So we started late but that was a short organizational part so we are back in time again. So we have a lot of slides today, but I think many of them are very easy and quick to go through. So this was, it's again a completely new lecture, so you see all these small little mistakes, that's typical, it's an extreme amount of work to prepare a completely new lecture from nothing. So this is not from a textbook or not from previous stuff we have done. It's completely new but let's see how it goes. So the first part is a bit of a recap of the last part of the last lecture but in a new way. So in case, yeah let's just see. So in lecture three we have learned like by example lecture three was mainly about database design. You have data, how do you put it into tables, surprisingly non-trivial. But we also learned a little bit of SQL, otherwise it's no fun if you cannot ask your query your data. And we had these queries of the form select from where, that's how the typical SQL query looks data. And we had these queries of the form select from where. That's how the typical SQL query looks like. And we kind of learned it by example. And we just learned how to, and it was part of the sheet to also run some queries, find some queries. But we didn't really bother how a system actually executes such queries. And how it executes the queries, so that's today's topic. Internally, so what does the database engine do if you give it such a query? We will see lots of examples in the following. It implements operations, so like, and what operations are, we will see them in the following. And then it translates the SQL query to a sequence of operations. And then this is called query planning, which is also part of the title of this lecture. And then it executes the sequence of operations. So that's very high level, you will see now what this means. And so we start again with three operations which we kind of already saw in the last lecture, but we didn't, now we have some new names, projection, selection, Cartesian product, we call this, maybe I quickly write this down, we call this cross product in the last lecture, but it's the same. Cross product, it's absolutely the same, but in the database, but you typically call it Cartesian product, we already established that Cartesian comes from Rene Descartes. So, in these operations we kind of already introduced them informally in the last lecture and today, so this will be a bit of a repetition, but now we will define them mathematically. Now each operation takes some input, the output is always a single table. And before we provide the definition let's first repeat this slide from the last lecture. It's our formal definition of a table. And a table, so a table you will see many examples again in the lecture but you know what a table is, column rows. So you have a sequence of column names, we always call that C. You have a sequence of domain names, so the values in every column come from some domain, can be the strings, the integers, the real numbers, and then our rows are a multi-set of tuples. Multi-sets because the exact same row can occur multiple times. So it's a set, the order doesn't mean anything, there is no order, but you can have the same item multiple times. And we talked about multisets in the last lecture. This should be three here and not four. As I said, I think we will have a lot of these small little mistakes. And we also call it tuple, when we say tuple here, that's the mathematical way. If you have something with k components, tuple or row, I will use them synonymously. But then you understand why it's called r, tuples r because the tuples are the rows. Okay, and here's some basic notation. We'll just make our work simpler in the rows. Okay, and here's some basic notation. We'll just make our work simpler in the following. The animation was missing on that slide, but I think that doesn't matter. So if I have two tuples, for example, the column names or even two rows, if I just write them side by side like this, this isn't supposed to mean some nested thing, but it's just a concatenation of the two. There's not really a standard notation for concatenating two tuples, we will just write it like this. And then we will also write if I have a row or a tuple like this and I call it R, then I just use the programming notation with square brackets if I want to denote a value in a particular column. So we will have that several times. And we will see Boolean values quite a bit in the following, we could just write true and false, but true and false are not really, these are strings, not really mathematical values and mathematics, you have these funny symbols here to denote true and false are not really, these are strings, not really mathematical values in mathematics. You have these funny symbols here to denote true and false, so you will see those a bit in the following. Now the microphone, this is also having problems, so this is true and false. Okay, and now we start with the, I just shouldn't move, I don't move. Projection is the first operation. And it's really simple. And for each of the slides, I will give you the formal definition and then an example. So if you realize what's causing these noises in the microphone, just tell me. You see me, maybe it's subtle movements or I don't know what's happening. So what Project does is it takes a table, it takes a sequence of columns and an additional Boolean which can be either true or false. And if you leave this out, somehow we leave it out and it's just false. And what does it do? It produces a table, so this is essentially select some columns from a table. So I have the, let's maybe try to understand it, maybe let's first go to the example. And here's an example and then I go maybe that's the way to explain it. First understand it by the example and you have seen this in the last lecture already, it's just more formal. Here I have a table and I always do it like this. In this lecture I have the signature of the schema of the table by just writing down how I created this. So this is about people, I have an ID, a first and a last name. Here we see two people with the same first and last name. And now I have this sparkle, a SQL query, select distinct, I just pick two columns here. So I have a list of columns. Here it's column names, first and last. And in the operation, the operation takes as argument a table, so that's this table. Column indices, starting from one, maybe I should write that on the slide, so first, last is two, three, and note that's a subtle thing, this is a sequence of column indices, I could also have the same column index twice, so I could even duplicate columns if I wanted to, I could also here write first, first, last, last, and then I would write here two, two, three, three. And then distinct or not, here it's true. And distinct means if I would just cut, if I just take these two columns and remove, Anne Hathaway is there twice, so the exact same row is there twice. In the original table, it was not the same row because the idea was different. But if I just take these two rows it's there twice and by using the distinct this row is then there only once. So that's, so we have this simple SQL query and that's the operation it translates to. Now let's go back to the formal definition here and I just put this formal definition so that it's absolutely clear. An example is always good to understand how it works in principle. A definition tells you how it works in detail. So arbitrary sequence of column indices, and then I just pick those columns with those indices. That's why I have the double indices here. And then I just pick the corresponding values from the columns. And if this thing is true, I remove all duplicates. Is there any question? You should ask it now. Any question about the notation or about what this operation does? Because we will now see several operations like this. Okay, I should do something with the, so, why does it say select from T and not from people on slide nine? It's a good question. The only reason is, I just took T as a shortcut for people here, I could have also written people here. It's a good question but it doesn't really mean anything. It's just so that this looks more compact. Because in the following I will often have variable names here. Otherwise it's project people and then it's kind of hard to see that people is a table and project is an... and yes here it says, oh yeah you're right, it should be ID int, it should be text int. Okay now I have a problem because now it will be, the line will be too long. Okay, now I have to show my amazing PowerPoint skills. Let's see. So now I... It should be, oh yeah, you're right. My PowerPoint skills may be amazing, my other skills not, it should be like this, yeah. So you're completely right. thank you for paying attention. So it's all text here because I took Wikidata IDs for IDs. Any other question about this slide before we move on or about this definition? Okay, let's move on. So this is selecting columns. Now the next operation is selecting rows. And let's first look at an example and then at the formal definition. So here it's the same problem, okay. We will correct these small problems while we're at, because it's the same table so we have the same problem here. So now I'm selecting some of the operation is now given a table and now you have a predicate here. So it's an, I think I wrote this differently in the following. Maybe let's go back to the formal definition first in this case, I think it's better. So my input is a table, columns, domains, rows, and I have a function now, so the function takes, this here is just give me a row, so something from the domain for row one and so on. And then for each row, tell me for each potential row, tell me true or false. So this is also called a predicate. And then I just return all the rows where this predicate evaluates to true. So this is the select function. So it looks more complicated than it is. It doesn't do anything with the columns, the same columns, the same domains, it just changes the row. So it selects certain rows. And note the confusion here. This operation is called select, but when you say something about the rows and it appears in the where clause of the select, of the SQL query. So you see it here, let's see this example here. So here I'm saying select from this table the rows where the first name is N. So this is in the where clause, not in the select clause, but the operation is called select. And that's just how it is. That's historically so. So these operations, they are very old. They come from this 1969 papers and they just have their names and SQL is a bit younger. And I want to, this here is written in a bit strange way and I want to write it a bit differently. I think in the following I often write it like this. The second argument here is a function and now how do I write an anonymous function in, okay, I need to, I think I need to steal, this is how I write it here. Let me do some, yeah. This is how I write an anonymous function and the following many times. Yeah. So this is just a function which, which just, yeah, for a given row checks, just checks whether the second column is n and then it's true and otherwise it's false. So it's quite simple really, it's just introducing the formalism. Okay, let's go on. The Cartesian product, so it's a bit longer than cross product, so you get two tables now like this. And then the result is a table T. And this is what we have seen on the slide further, some slides ago. This is just concatenating the two column names and corresponding domain names. And this is now the cross product. And we have defined this in the last lecture. So it's all concatenations of tuples, rows from the two tables, and it's just all combination. Any row from the first table with any row from the second table. So this is the mathematical correct way to write the product. And note, we have seen this in the last lecture, if you do this, if you form the cross product or Cartesian product of two tables, the size of the column names add up but the number of rows multiplies. So you get these huge tables. And let's look at a table here. So different examples. So I have here some arbitrary categorization of the human race into three genders, politically correct, female, male and other. We have these three in Germany. These are the correct Wikidata IDs. And then we also divide the word into stupid and wise people. And now you get all combinations. So this is a statement about humankind. So however you divide the world, you always have stupid and wise people in all camps. It's not the privilege of any one camp. So here we have, and this is also wrong. Oh my, there are a lot of small mistakes here. You can see the development process of these slides. You get the final product, but actually 500 versions before this. So you can see the development process of these slides. You get the final product, but actually 500 versions before this. So what do we get? So this is actually an example where all combinations make sense. You just get, let's take this row here, and you get all combination of this row with this row. And you see here the number of columns in this table is just 2 plus 2, so 4 and the number of rows is 3 times 2, so 6. You get the combination of every row from this table with every row from that table and you just write this operation as Cartesian product T1, T2. So we have already seen this in the last lecture, it's nothing really new, but you should just rewrite these things as operations now. And one thing which you have seen on the slide but I haven't mentioned, but let me go back and mention it now, there are also these symbols for, so traditionally, so this is called project and you often write it like this. So in the literature you will sometimes see this capital Greek letter pi. Let me maybe also write project, it's just the same thing as a symbol, so it's nothing different, and projection you always see it as, often see it as sigma. Let me just write sigma here, so that's the small sigma. So it's a Greek letter, it's just a different way of writing it but we will often write it like this, or like this today, you have a question? Yes? Why are the inputs of this function D1 to Dk and R1? Okay, that's a good question, and that's one reason why I'm putting these formal definitions. This is a function which takes an arbitrary row as input. And, how do I explain it? So R is a subset of D1, Dk. So yeah, it's a, let me first ask this question, is this, do I agree with this? Okay, you agree with this right? This is, this here is the set of all possible tuples and this is the set of tuples which we actually have in our table. And you are right, I think the why did I do it like this because the function you typically give, I think that's the only reason I did it and I have to think about it whether both are right or whether this is better. The function you put in is typically agnostic of what's in the table, right? You just give a function which you could evaluate on any row if it's a valid row of this domain. So that's why I put it like this. The function you put in, otherwise this function if I put R here, this function would depend on the table and I didn't want that. That's the reason why I wrote it like this. Yes please. Can you not specify an empty column in the row so you have null for some value? Yes. So then you would need null in every domain? Yes, you are completely right. We didn't talk about null values yet, but yeah, why not? They're in our domain. I mean, we didn't talk about null values yet, but yeah, why not there in our domain? I mean we leave the domain open, so we will see code in a second, so when I, it's again a matter of, so here we have some commands, this will be for the tables we see later here. So this is the domain of strings and now it's again a matter of definition are we saying, let me maybe go back to the slide with the formal definition of the table. Yeah, I would just say that these domains include the null value and then we are fine, right? Because we are not doing anything special with the null value right now. So saying that there is nothing in the table, do you agree? I mean, we could just do that, right? We didn't really, they don't have a special treatment currently, so that's fine. Okay, so we had the short names, we have the, so we have seen these three basic operations, we got used to this formalism a bit, that was the purpose of this so far. And now we talk about what we actually want to talk about. So we have a SQL query and how we execute it on our machine. So now I claim, and we kind of already did that in the last lecture, but very informally and just intuitively, any query of this kind, select comma separated list of column names from some tables, where, some expression, you can translate it to a sequence of operations, which the database engine then executes. And right now you may think, yeah, well, that's all trivial. I kind of write the things. I mean, this is what you might think here, right? Here I'm just writing select star from tables and it translates to this operation. What's the point you might think? It's just writing the same thing differently. But you will see now what the point is. There's an important difference. And let's first, so let's take one example query. And I always do it as follows. So I have different examples, not necessarily always the same tables and so on. I put these, how I created the tables first so that we can also spot mistakes so that it's also clear, okay, this is a table with, okay, I didn't put the domains here. I omitted the domains for, I did that on purpose because there, I omitted the domains for space reasons only. Should write that here, the domains to save space. The purpose is really that you see how many columns which to save space. to save space. And now, what's the sequence of operations here? And now you already see that's not trivial. Now pay attention. So I have this SQL query which is already a little bit more complicated. You have a question? I think it should be the roles person ID equals the persons of persons ID because the person's ID is not person ID. Yes, you're completely right. Thank you for paying attention. And I will just solve it by abbreviating this here, right? Then it's also right, is it? Let's quickly check whether it's right now. so I just call it ID here in movies and persons and here I have to call it by a different name because otherwise I would have the same, is it correct now? Roles, person, ID, person's ID, roles, movie, ID, movies, ID, yeah, okay. So now how do I translate that to a sequence of operations? And here's how I could do it. I could first, okay I have the Cartesian product between three things. I don't have an operation for, let's say our database has just implementations for these three operations which I have just seen. So let's take first a Cartesian product between movies and person. This gives me a table T1. Now I take the Cartesian product of that with roles. Now I have a table T2. T2 is now the Cartesian product of all three. So I could have also done this in another order. Note this. Now from T2 I have to select some rows. So this table now lets, we don't have a picture now, this is now movies, persons, roles. So like they are written above here, first I have these four columns, then these three columns, and these two columns, and let's hope that this is correct. So now I want this ID to be equal to this ID. So that's one, two, three, four. Maybe let's write it here. So if I concatenate them, I don't want to show these big tables all the time, then it's like this, five, six, seven, eight, nine, 10. like this 5, 6, 7, 8, 9, 10. So now it's 1 should we equal to, no this is not correct I did, birth date is 2, it's not the same thing, it's just one thing, I think it was correct. Birth date is one, seven, eight, nine. Okay, yeah, if I have the cross product, they all get concatenated. I concatenate in the order of movie person's role, so these are of my T2 table. These are, maybe I should write these, so these are the column indices of my table T2. And now I want this ID to be this ID, ID movie ID, right, this is R1, it's written here. This is first equal to the eighth and then fifth equal to the ninth. And then in the end I want just title and person's name and that's project, projects gets column indices and I think we have, yeah, I deliberately left this out here. There's a third argument for distinctness and note. Third argument is, oh, writing down here is false by default. And you see, that's already not so easy, right? Translating this SQL query into this sequence of operations. And that's what a query engine has to do. And that's also what you have to do for the exercise sheet, but you have to do that manually. So this is something you, it's maybe the most important things you should learn for this exercise, for this exercise, for this lecture and exercise sheet. Take a SQL query and translate it to a sequence of operation. This is exactly what a database engine does. And look, this is something now we can easily execute, right? This is what I said on an earlier slide, maybe was not clear then. For each of these I could have a function. This is also what you should do for the exercise sheet. I can't show it to you yet because Daphne is down. You will just implement all three, actually instead of this one you will implement another one. And then you can just, these are just function which you can call. It's a function which takes two tables, computes another table. This is a function which takes a table and the predicate produces another table. This is just, you could write this almost in code and this is also how your code will look like. So this is, so what you will do for the exercise sheet, you will see a SQL query, you will write down the sequence of operations in code and then you can execute it. And here's one more important thing. So this was one sequence but it was not the only sequence. Okay, before we continue, before I say that, two other things. So there are two ways, other ways this could have been written. This is as a sequence of operations. I think it's the easiest to understand how you would write it in code. You could write the exact same thing as a nested expression. This is the exact same thing like in one line, right? So inside here, Cartesian product of movies, persons, I take that as the argument of Cartesian product roles. If I go back one slide, it's just plugging T1 into here, then I get this larger expression, then plugging this T2 into here, and then plugging this into here, and then I get a nested expression, right? And this looks scary maybe, so this is really, this SQL query has a mathematical expression. But think about it, this is not any more difficult than how mathematical expressions work, right? If you know how maths work, so this is some basic arithmetic here, you are not scared by such an expression, you will just evaluate it. You will see okay, let me start here, three plus four, seven times four, 28 minus three, 25 and so on, 57. And guess what the result is. And there's one other way, this is typically shown, and that's as a tree. You can also do the same for mathematical expressions by the way. This is also sometimes very useful. So this same query, so here we have the Cartesian product of two tables, movies and person. This gives me something, the result from that becomes one argument of this, the other argument is this. And here I've just written the additional arguments at the bottom. So this is, get an expression tree. You could do the same thing for a mathematical expression. And maybe let's do that, just so that it's absolutely clear. So if I have an expression like three plus four times seven, I could also write this as a tree, right? And this is often done. So now I have the multiplication here of 7. Yeah, so I have 3, 4 here for operator. And also in this tree you could also see how you could evaluate this so at every node here you could do the plus you get 7 and then 7 times 7, 49. Any questions about this part? So this was setting the scene for what comes next. Can it happen that the intermediate results are used in multiple places? That's a good question. So then this would not be a tree anymore, right? In principle you could have it like this, but actually how it's typically done is trees. So what you would do, so what we do in our engine, we would just compute the result again, but we have a caching mechanism, so we just see, oh this has already be computed, I don't compute it again. But you're right in principle. Let me maybe quickly show you this, that this is not just theory, but how it's actually done. So you remember this from the first lecture, let me just, this is not just theory, but how it's actually done. So you remember this from the first lecture. Let me just, this is now sparkle. We will do that in two lectures. So dialect of SQL, but it's the same thing. I execute a query here, and here I get a query execution tree. So this is a tree with all kinds of operations. So this is how database engines always work. I'm surprised how big the tree is. I was expecting a smaller one. Maybe, no these are all relatively, do I have a simple query? Yeah, I can do that. Queries quickly become, this may be a little bit simpler query here. So also you see here it's a tree. Here the operations have different names, but this is an operation, it's also operating on tables, it's the same here. And then you do something with that, you get another table, this table gets input to here. So it's like operations which take tables as input, produce other tables and so on. And up here you have the result. This is how all database engines work. Any questions about this before we move on? Okay. Now what I wanted to say earlier. So it's very typical that you have more than one sequences and actually it's typical yet that you have very, very many sequences of operations. How you do it. We will see that in the following. The results are the same for each sequence but the processing time may be vastly different. Let's just look at a few examples. So here is a very simple query where I just have a cross product of three tables. And obviously I could do this in two ways. I could first, okay here I have cast again, not roles, but yeah let's maybe also fix this for consistency. Let me just write roles here. I have three tables and now it's, since I only have a binary Cartesian product I have to make up my mind in which order I do it. So I could do it like this, first cross product of movies and persons, then with roles or first movies and roles and then with persons and there are more possibilities. Okay, and now the not so easy question, which one should the database engine choose? What do you say? Now, I deliberately wrote this. You have to make some assumptions on the data. So assuming this is real data, and real data means you know something about the sizes and the nature of the tables. Well what do you think? You are the database engine now, you have to decide, we go with this plan or this plan, this program or this program. Yes? Like not all movies will have the old rules, so I think the first one is better. You say the first one is better? I didn't quite understand the reason you gave. Not all movies will have all the roles, like all possible roles. Yes, yes, you are saying not all movies have all the possible roles, but this is the cross product. So this will compute all combinations either way. It will compute, the end result will be the same, and the end result will be huge. It will be the number of rows of this times the number of rows times this. Yes? Yeah, I think basically you should first do it the one way you want it, the person ID, because then the person ID that's the second matching sequence is set up. Which one is better? The sequence two, because of the matching key, person ID is in the table. Okay, interesting, but you said something similar. You said something of matching ID. This is Cartesian product. You're not matching anything. You're taking all combinations. Yes? Okay, both sequences should be the same. Interesting. So already with this simple example we get different answers and actually it's not so trivial, which is the point of the second part. Let's keep tuned for the answer and let's show two other examples. Depends on the data somebody writes in the chat, yes, I would take the biggest table at the end. That's a good thinking. I agree with that but we will come back to that. Yes, so there is a difference. It's a small difference because the end result is the same but it's interesting that already for this simple query it's not so obvious. Here's another example. So this is, and look at this, now I have two tables, I take a cross product here, but I do nothing with the roles table. And, now I do something, yeah, I do something with the... So which one is, yeah it's a strange query, but people write strange queries, I mean people also write strange programs, so I take the cross product of the two tables, movies and roles, and then I select all roles where the score is greater or equal to 8.0. I can do that. And now here are two ways to do this. First compute the Cartesian product. Oh, again another mistake here. Lots of small mistakes because I made... Yeah, but this doesn't really... So which way should I do it? First the Cartesian product, then select the rows, or first filter the rows and then do the Cartesian product. Yes? First filter in the movie database to print that down, then filter the Cartesian product. Yeah, yeah, first filter the rows, then cartesian product is very expensive, so it's all rows here times all rows here. So if you have anything that makes it smaller, the table better do that first, right? So the end result will be the same, that's important, but this year will make T1 is now significantly smaller than movies because it only contains the movies with the highest score. And so now this Cartesian product will be, T2 will be much cheaper than T1. And here's another one, which is also strange. So now I take again the cross product of two tables. I ignore the second table, I have a distinct here, distinct title and here's one table. It's kind of the obvious way how to do it. It says from two tables so I take the cross Cartesian product. Then I select the rows. It's kind of the literal translation of this query, right? Just start with the from, it's how we did it in the last lecture. Select rows, select columns. And because it's distinct here, we have a true here. Can you think of another sequence? Now you want to be a good database engine, you want to produce the correct result. Can you think of, and now we haven't defined efficiency yet really, but can you do it more efficiently? Think about what the end product of this query is. Yes? Can we throw out the Cartesian product? Can we throw out the Cartesian product? Because title comes from movie. Yeah. And it's also from movie, so like, in the result, the role statement is not related. Yeah, can we throw out the Cartesian product? I mean we have a Cartesian product here with roles, but we don't need it, we have no reference to roles here, no reference to roles here, and yes, I mean you can just do that, we just take, I mean what is the end result of this? The end result is will be the distinct title from movies, from popular movies with score greater or equal to 8.0. We could just do that right away, right? Just take the movies table with score greater or equal 8.0 and then take the second column from that, the title, and also do distinct here because we might have the same title twice. So now you see it's getting interesting right, these super simple operations translating it and now this is already totally different, I'm even dropping an operation here and it's the same result. And now imagine you are the database system and you have to figure out that this is actually the same, that I can do this. That's not so easy. And here I think it's clear which one is better, right? This one is kind of trivial. I just take movies, filter, just look at those with a high score and take the second column and here I have this Cartesian product and have to do the same thing on the huge table. Okay, so that's query planning. Query planning is you are the database engine, you have your SQL query, you have the operations which you have actually implemented and now you have to translate from this SQL to there to what you actually execute. And now you have to figure out, that was on one of the very first slides, which of these sequences do I take? And this part of this problem is to first realize which sequences there are, right? There are actually very many sequences. Sebastian, does the exercise sheet work? Or not yet? Not yet, okay, Dafna problems are, so some of the problems are, maybe let me just, so the exercise sheet will be, no, we will come back to it later, maybe we see it then. Only do select and project, yeah, same suggestion. So, the obvious question now is, you have many sequences, which one is the best? And every database engine in the world has to deal with that. And here's the basic idea. You have all these operations here, let's look at this. This is an operation, this is something which the database engine actually has to execute, has to call, has a function for this, and this is where the time goes. So what I can do, what's written on the next slide is, estimate the cost of this. How long is this going to take? How long is this going to take? How long is this going to take? And from that I just sum it up, I get okay sequence one, here's my estimate. I do the same for sequence two so that's what's written here. For each sequence look at each operation estimate the cost this gives you a cost for the whole sequence and now you just pick the sequence with the lowest cost. That's query planning. Now how do we know how much an operation costs? So let's look at this again. Okay here and we will see cost of estimates, very simple ones and the following. Yeah, I kind of can say how long it will take to compute the product of these two tables because I know the tables. I can just look, oh, this one has a million rows, this one has 1,000 rows one has 1000 rows so it's going to be 1 billion rows in the result and this is that's how much it's going to cost me. Here or here let's take this operation that's already harder because this takes as input something from another operation and how expensive is this depends on how big this table is. So to be able to estimate the cost of this operation, I need an estimate of the size of my input table. So cost estimates for query planning, and this is a whole, people are working on this their whole life and nothing else because it's important. What will I just explain to you? If you want good cost estimates, you also need estimates of the size of the input tables and we will see example for this now as well. So if you want how expensive is this project operation and project remember is a very simple one, it's just taking the second column from a table. But still you have to copy something. It's basically how big is this table? So to have a cost estimate for this one, I need a size estimate for that one. And that's why cost estimate is really, you need an estimate for the cost and the size of each operation. What it says here an estimate for the cost and the size of each operation. What it says here. Estimate for the size. And let's, and in real query planning, you can even gather much more statistics. Not just the size, maybe also the distribution of the data in a table, because that might be important. And you could do a whole lecture about query planning. As I said, people are, they are specialty in research, but let's do something very simple just so that you see how it's done in principle. Here's some simple estimates. Project just selects columns from a table. If a table you have column indices distinct or not. How much, let's say this input table has n rows, k columns and k prime is the columns in my result. So this is, so I'm claiming, this is the size of the result, right? You don't delete any rows, you have n rows, you have k prime columns in the result, you have to produce that result. To go to my table T and copy these values to the result. So that's essentially the cost. That's a pretty good estimate here. If you do the distinct, so I swept that under the rug here, you have to do additional work, you have to maybe sort it or compute the hash map, I just ignored it, it's an estimate, I just want an idea how expensive is this going to be. The number of rows, I can actually say it exactly, it's the same as in the input table, and the number of columns I can also say it's just it's an argument to the project operation how many columns. Here are some assumptions and let me just why doesn't the animation work here? Let me just fix that on the fly. That's interesting, so that looks obvious like for the simple, but even that is not simple. So I'm just giving you hints here at why that is actually a big research area and not so. So why is that a good cost estimate? I'm saying, yeah, you have to produce the result table so you have to copy every element there. This is true if you have to really iterate over every row from your input table. You iterate over every row and then you say, okay let me just pick these column values, copy it to the result. And this is actually what you have to do if the, now it depends on how your table is stored. Do you store it row by row, row order, that's how it's called? Then this is a reasonable estimate. But actually you could also store the tables column by column. Didn't even talk about that yet. How do you store a table in your machine? Just store it column by column. And now if you have project, you're basically just saying, okay, pick these three columns. You don't need, maybe don't even have to copy them. You could just say, okay, these three columns, they are already here, I just store a pointer to them. And then this suddenly becomes a constant time operation. We're not going into any details here, but you see that it depends on details of the implementation of the representation, obviously how expensive operations are. And then depending on that, you might get totally different query plans, like these trees, how you execute a query. But that's just, these things I just mentioned we don't go into any depth in this lecture because this is a database introductory course. Now for select so that gets that selects row according to this function which says is this row should it be in there or not? Okay let's look at and again here the animation, let me fix it. So I'm saying okay the result table is as big as the input table so the cost is N times K and now how big is my result? The result table is as big as the input table, so the cost is N times K. And now, how big is my result? Well, that depends. Here are some, here make some arbitrary assumptions. I'm saying, okay, I'm iterating over each row of T, and now I have to evaluate phi. This here can be a very complicated function, right? So now I need an estimate of how expensive is it to evaluate for each row. What we have seen so far is always is this equal to this, but it can be a super complicated function, depending on how do we know how complicated it is before executing it. It might be very expensive on some rows, very cheap on other rows, and then the question is, on how many rows is it true? If it's always false, I just have to evaluate it, not copy anything. So a lot of unknowns here and that's typical for query planning. To estimate the cost you have to know a lot which you don't know. So here I'm just assuming it's cheap to evaluate and it's true half of the time. Totally arbitrary, but actually better than nothing. At least I have something now to compare different operation sequences. Here are just some things you could do. You could just evaluate the function on a number of rows. Let me get a feeling for this function. I have a table with a million rows. Let me evaluate on a thousand of them to get a feeling, yeah, that's the average cost, it's a true one tenth of the time, or something like that. And also, it's more complicated, maybe your predicate is of the kind greater or equal to 8.0, and maybe your table is sorted by that row and you don't even have to go over the table, you just do a binary search and see, okay, this is sorted, let me just find the row with the first one where it's 8.0 and take everything above them. And then the cost estimates completely change. So again, what's here in the reality check section is just to tell you how it's really done, we are not going into any depth here. Okay we'll have a break in a second. Let me just make some things blue here because we can't continue before things which should be blue are blue. Okay, the estimates for the, that's actually easy because it's always expensive. So, but writing it down it's also non-trivial. I mean, what's the size of the result? The result has this many rows, number of rows of one table times number of rows of the other table and the number of columns adds up. So this looks strange but if you think about it that's just the size of the result table and you have to produce the result table and that's just the size of the result table. And you have to produce the result table and that's just the cost of it. Naturally we know, here we could write equality and let's just do that. We know how big the result is. So the cost estimate for the Cartesian product is actually pretty simple because it's always expensive and I've written that there. Let me just, yeah, these are pretty accurate because there's not much, I mean how, let's say you're given the task of implementing Cartesian product. I mean what do you do? You have to compute it. You have to do two nested loops and compute all combinations. There's not much you can do. You can't make it more. If you have a large output and most of the work is produced the output. And that's why good query plans, any query engine will try to avoid Cartesian products as much as it can, which is the reason for the last part of the lecture, which we will see in a few minutes. And I think now we make a break and we resume in five minutes. Thank you. Let's go on. So query planning. And now let's maybe, so I have, we could now, so this is something we have seen earlier. We have two sequences and now we have to figure out which one we should take. And now we have to figure out which one we should take and now we have some cost estimates maybe let's not do it for all three, you can do that as an exercise, so it's a part of the exercise sheet. So I'm assuming everything is all right, right, you can hear me again and so on. Let's do the cost estimates here. So let's say this has these sizes n1, so this is number of rows, this is number of columns, I always use n, m for rows, k and so for columns. Columns and this here has time, has these dimensions, k2 and this here has these dimensions. So I claim that the cost of this final operation is the same for both. I mean the cost for a Cartesian product, that's what we say, two times and three times and then it's this sum of the, I mean it looks, it's actually quite simple, right? It's just a bit tedious to write. It's the same for both. In whichever order you evaluated the cross product, in the end you have this huge table with all combinations of all rows from the three tables. So the second, and that's what some of you said, after a while that, that they're actually the same. Now let's look at this one. So this is cost, yeah what's this movie and person so that's N1 times N3 times K1 plus k3. And here the cross tier of the first operation is, yeah, it's movies and cars, so it's n1 times n2 times k1 plus k2. So it's actually, it's pretty simple right, there's nothing, it's just these strange expressions, you write them down and now let's maybe we said that it's about real data. So if we have real data, what, which, which of these tables is would you say? Which one has the most rows? So let's compare these two movies compared to cast. Which one has more rows? Yes? Yeah, cast should have, right, every movie will probably have one actor but will have many actors so we expect N2 to be much larger than N1 let's just write that down, so this is N1 is probably and what about, now it depends a bit on N2 and N3 who we have in our, whether we have all actors then that's certainly true. Actors in persons. But this is something, I mean if the tables are given we should just compute it right? And then it's clear, then the difference in the cost between these two is the difference just in the first operation and then this is more expensive and this is also what some of you said right? Here I have the bigger table, so if I, the end result is the same, that's actually a pretty general principle, I do something in a certain order, the end result is the same, that's actually a pretty general principle, I do something in a certain order, the end result is the same, no matter in which order I do it, start with the smaller ones first, right? This here, starting with the bigger ones, N1, N2. So according to the above, we have N1 times N3 is smaller than N1 times N2 so we prefer sequence 1. Now we could do the same and I leave that as an exercise. The intuitive thing you can now make it, you could do the same thing now here, right? You could just write the cost here and please do it as an exercise, you should absolutely do it. I think it's even more instructive if you do it yourself. Here the end result will again be the same. Yeah, but the cost of this will be, and here you need of course, maybe we do it for, let's do it for that one as well and leave the third one open. So here we have, let's do it for that one as well and leave the third one open. So here we have, let's also make, just so that you see it for one more example. So what's the cost here is n1 times n2 and please pay attention that I don't make any mistakes. collecting rows, so the size, our size estimate here is, yeah, we know the size of this one, it's just the same. It's n1 times n2, so this is the number of rows, times k1 plus k2. We can estimate the size pretty exactly and then we have a cost estimate for this one here. And what did we say? We had this very simple cost estimate. Let's just go with that one where we say the expression is simple to evaluate and the cost is half. Now I think the cost was just going over the whole table, right, that was the cost, and one and two. We don't need the, yeah. I think that's what we said for the select one, the cost is essentially going over the whole table, because you have to go over each row and evaluate each of them. Okay, let's do the same thing here. What's the cost estimate of this one? Movies, what's the cost estimate? What did we say? We say we have to go over the table, right? And so here the cost is N1K1 and the size estimate is, what's the number of rows? We said we filter out every second row, so it's N1 times K1 divided by two, and it's the same number of columns. So that remains, no it's not n1 k1, it's n1 over 2. That's what we said. n1 over 2 times K1. So what's the cost estimate for that one? Now we take the cost estimate for, so T1, we know it's, yeah, now it's N2 times N1 over two times K1 plus K2. So we take the cost estimate we had for T1, the size estimate, the size estimate is that this filters out half of the rolls. So that's N1 over two here, that was our assumption. two here, that was our assumption. And now if we look, I mean now this one is obviously better, right? Because here I have the same huge cost twice, I have this N1 times N2 twice here, and here I only have it once, and here even with only half of the cost and this one is relatively cheap. And here's one important message and you should also do that for the exercise sheet and the important message and you should also do that for the exercise sheet. And the important message is, and let me say this because it's really important, there are even papers about this, these estimates, I mean here we use this super crappy estimates like we don't know nothing about our evaluation function, we just say it will reduce it somewhat by a factor of two. It doesn't matter, it's still enough to determine that this query plan is better than this one. Even with a very simple estimate, what happened now, with a very simplistic estimate it was enough. Actually any estimate because this plan just avoids doing going over this table twice. So that's really important. So yes please. Thank you. This is the same error from before, right? Let's now keep the, thank you very much. Let's fix that. Yeah, I did a lot of iterations on these slides. Actually not so easy to explain this stuff, okay. And this one I leave open. This one, that was this interesting one where you can leave out the Cartesian product altogether. It's clear what will come out, right? Here you have the cost of this one, which is huge. Here you don't have it. Okay. Oh yeah. Let me show that one. Okay. Oh yeah. Let me show that one. That's interesting. Actually I can ask. Did I want to do it at that point? Yeah, let's do that. Let's go back here. And maybe now is the time to explain something about the exercise sheet. So for the exercise sheet, now we just copy because Daphne and it's a problem with the authentication of the UNI accounts. It doesn't work at the moment for anybody. accounts, it doesn't work at the moment for anybody. So first thing for the exercise sheet, you now get much bigger data. So in the last lecture we played with toy data, now you get bigger data, you got a lot of TSV files, they are somehow related to the last sheet, but more information and bigger data. I've copied the tables here, so for example, now I have a movies TSV which is much bigger. Yeah, so over 100,000 movies, so it's a table here. Now we also have, let's look at the roles table. This now has three columns, it also has the role, not everybody, somehow the role is missing. So here we have someone with this ID playing in a movie with this ID, the role of Ophelia and so on. So these tables are pretty big now. This is now over a million rows, it's still small for compared to the real world but for our purposes now it's a really big one and we have all kinds of other tables here. Let me do the following now, let me just load them into SQLite now as follows. So here I'm just creating the corresponding tables. I hope that's correct and then I'm importing the data. And let me do the following. So we already did that in the last lecture. So that I can say something. SQLite. Ah, now I want to actually, want it to be stored in a database file so that I can reuse it. Now this is just creating the tables and importing the data. You can see it takes some time, but shouldn't take too long. At least it works, that's good. And while it's doing that, let's maybe write a query. That was wrong SQL. Which query should we take? It doesn't really, I don't know. Select and let's also maybe take timer on. Select, I don't know, movies. Let's see, let's take from, we have movies, we have persons, we have roles, where, movies, now I'm not sure what, let's just look at the schema. Let's maybe, and I should maybe make this smaller now, no let's make it larger, nVIM. Let's, so movies, ah it's called movie ID now in my schema again, movie ID is role, oh thank you. Should be roles. Should be roles, if I make the mistake here, and it will, okay. And roles, yes, thank you. Nice one. Okay, so let's just go with that query. And yes, sorry, too many things going on at the same time. Thank you for correcting me. So I have this query and now let's query SQL. And now I can just pipe it. I already have my tables in here. Let's see what it does. SQLite3. Okay, that's too many movies. Let's maybe, I could do the following. I could just count. We didn't have count yet, but that's just counting the number. Now I probably crashed this machine. Okay, see already producing the, okay. So this is a very simple, yeah, timer on. You just get timing information, you get the number of rows in the result. And now let me do the following. Let me do the query twice and write just before the first one explain query plan that's what the slide see now I get this now I see what what it does scan role search so it's it's kind of telling me what it does. It's using different operation names than the ones we are doing. And we'll come back to that at the end. I just wanted to show it for now. That's how you can do it. You just have the query and you write, explain query plan before the query. You can also do just explain. That's also written on the slide. It's interesting. And now I get a really low level query plan. So this is something which SQLite provides. This is basically on the instruction level, what it does. Rewind, I think, is something. This is like nested loops here, and now it does all kinds. These are registered. So we will not go into that, but I wanted to mention that this exists, and you can do it to get an idea of what the query engine does. It's actually very important if you use these things in practice to see what's going on. Unlike much of SQL, this is not standardized, right? This is now what you just saw, is what SQLite gives us. Other database engines will also do something like this, but they might do it differently. And let's just very quickly go to what I showed you here. So here, by the way, you also, this is our explain, this is our own engine here. Just so that you see that you also have all the components here. So for example here you have our size estimates, here you have the operations, here you have estimates for the cost of the query, estimates for the size of the query, and so on. So that's our, and behind this is a text format which is just displayed as a tree here. Here it's also kind of a tree, right, shown by this ASCII. And we will come back to this in a second, but you might want to play around with it in the exercise sheet. So the last part, joins. That's also important for the sheet and maybe let me first explain a little bit and then I can now show you the sheet, I will show it to you. So the following is very frequent in a SQL query. We have seen this a lot, you have two tables and the from clause, you're kind of combining them and then you have something like one column from this table is equal to one column from that table. We have seen that a lot. And it comes, you did that for the last exercise. If you design your tables in a certain way, like spreading, yeah, how we did it. Here just have the movies, here have the persons, here have the roles with IDs. It's natural that you have such queries. And in this query we had it twice, right? Here I hope this is correct now, I have movies and persons, and then I have it twice here, I have this pattern twice. And now it turns out that it's a good idea to define an operation specifically for this combination. So we don't want to do, and I've said this many times, first compute the cross product, the Cartesian product of three tables, and then do these select operations. Select they are called, right, when you filter out rows. But I want an operation specifically for this and this operation is called the single column equijoin and I have a slide explaining why it's called like this. And here's how it's defined. So understand, so far I've just defined three operations, but I can define more operations. And I just tell my database engine, look there's also that operation. If you think this is useful, use it. And then you can generate all kinds of query plans, sequences of operations, and try to figure out which one is the best. So even if these three operations which we have seen are enough to cover everything, it makes sense to throw more operations into the mix because maybe I can be more efficient then. So this is not needed for correctness or coverage, it's needed for efficiency. And it's called a join, it takes two tables as argument and two column indices. And what it does is pretty simple. You have seen this a lot now, so this first column index is just column index from the first table, this is from the second table, and then the result is like in the cross product, you concatenate the columns, so that's like in the cross product, but the result is you just get those combinations for which the values in these two columns are equal. And you can already see here why it's called the equijoin because you have equality here, single column, because it's just one column for each. I have a slide on this. This is kind of doing two things in one, combining the tables like in the cross product, but then only combining those rows for which this condition holds. It's doing two things at one, in one operation, and this operation has a name and it's like the most important operation in databases, which is why it has its own symbol here like this, it's a joint symbol, but it's just the same as this one. And here's an example, okay, something went wrong here with the formatting. This we are going to, yeah, I will keep it now, I will fix it later. It wasn't so easy to find an example which works for all the following slides. I hope this is correct now. So here I have two tables now. One is called profs, one is called studis. This one has IDs for courses. So these are IDs from HISSED1, abbreviated IDs, and I have the name of some professors here. So here this is I think artificial intelligence course from some semester, used to be teach by Frank Hutta and Dvoil van Burgaard who is no longer there but doesn't matter for now. This is our course here, this is another course here. And here I have some student IDs, made up IDs of course and the courses where they participate. And here I have the result of this join operation. Let's maybe not look at this. If it's not correct, please tell me. This is how I would write it in SQL. Combine these two tables. There shouldn't be a semicolon here, of course. That's part of the same query. And now that I can, I don't know why this is so, the indentation got lost here. So now we fix that. And let's just check that this is correct. So now I get all combinations of rows of these two tables, but coarse, the value in course has to be the same. And please check that it's correct. Maybe I made a mistake. And the size of this is now not so easy to predict, right? It really depends on the tables. Could be empty, it could be the whole Cartesian product. Think about it. It can be anything in between. So this says join these two tables just in the way how you would do it for the Cartesian product. And there's again, this should be, this is, you see, PowerPoint capitalizes even though I don't want to. So that's why. So, and it So that's why, so, and it will not only take those combination of rows where the value and the course column is the same. And this is the 2, 2, it's just take this as join column for the first table, take this as join column for the first table. Consider all combinations of rows where this value is the same. So for example, we have here Brox V 720, we don't have this here, so we have no such row here in the result, right? This makes sense. We have bust V 1304, and we have this 1304, so we would expect one result with this combination, yes. And now we have this V two zero four zero, we have it three times here and two times here, so we would expect six rows in the result and we do have six rows in the result. So that's important to understand and look at this at home. That happens with the understand that the join operation is in principle like the Cartesian product. It's just filtering out the stuff where it doesn't match. But on a lower level you also have this Cartesian product thing going on, right? Now we have all combinations and let's maybe highlight them here. So for example here this matches, these rows all match and so I get all six combinations of them. This is very typical for databases and this is also why the equijoin can be pretty expensive. If you have a lot of equal values there, and it will correspond to these six results here. Is there any question about this, what the equijoin does? Yes, please. Oh an arrow, that's... yes, you are completely right, there's an arrow. Thank you. That was PowerPoint correcting me. Thank you very much. Let's check if there are more errors. And what you should get in the result is by definition of the join, these columns should be identical. You should have the same column twice. Now it's debatable and we debated this whether you even wanted any result. You could define the operation such that you only have this column once. But in the literature and also in many engines, it's actually done like this. And I think it's nice because then you see how it's related to the Cartesian product. It's actually, it's like the Cartesian product with fewer rows sometimes. And understand, let's maybe stick with this table a little more. Understand the two, I think it's on the next slide but let me just say it here, two extreme cases here. If these two are disjoint, no intersection, then the result is empty. If all the values here are equal, there's just one value here and it's the same value here, then I get the full Cartesian product. So I can get anything between empty table and the full Cartesian product. So I can get anything between empty table and the full Cartesian product. So it's not that the Equijoin is always cheap, it can typically is because typically here, if you think about real data, you will have many different values, also some equal values. Here you only have equal values when you have several lecturers for the same course. Here you have equal values when you have several lecturers for the same course. Here you have equal values if you have several students for the same course which will happen. It's also not easy to predict the size of the result. Okay, let's go on. So why the name? I think that became clear now. It's called single column. You could also have two columns for each table and say the values have to be the same in both columns. Then it would be a multi column join and this thing exists but we don't deal with it here. And it's equijoin because we have this equality here. So that's on this slide. And that's what I just said. Yeah and that's just the obvious thing. So this is our query here and now we have a new, that's what I said, we have a new operation in the mix. What's the consequence? Well we could do it like before, that's how we did it all the time. And this should be, I hope it's not too annoying, these small little mistakes. But, yeah, that's just, oh, and again, this was corrected by PowerPoint. just, oh and again, studious was corrected by PowerPoint. So that's my one sequence, I just take the query literal, I say okay two tables, let's compute the Cartesian product, then let's select the rows where this is true, let's just check that this is two and this is four. Here we have another, studies course, that was PowerPoint, I know it. And this is the second sequence. I just say, okay, this is one join. I have an operation for this, right? I have one operation which does the whole query. And similarly the other query where I have two, where I have three tables and two where conditions, that would be two joints. One way to do it would be two Cartesian products and then one select, but I could just also do two joints. And now which one is better? I think it's clear depending on the implementation which we'll talk about next, this one is better. Any questions about this? implementation which we'll talk about next, this one is better. Any questions about this? Okay, so now, and this is, okay, is this better? That's the question. I mean now we have made one operation out of it, but I mean it now depends on how you implement it, right? I could implement this operation by computing the Cartesian product and then going over the rows. Then I haven't won anything. So this, the advantage of this is about the implementation and we will talk about this now. And there are two ways to implement joins in principle, hash join and merge join. And now you will see a connection to and a second to the information retrieval stuff. So the hash join, that's the simplest one and you will do that for the exercise sheet. And I don't think this is the formal definition but I think we should do it by example and then maybe go back is the formal definition, but I think we should do it by example and then maybe go back to the formal definition or maybe not. Let's do that. This is, I think the best way to understand this is by doing an example. So this is the hash join. And this is, I think in the, yeah, we have two tables X and Y and now it's of course we have to decide which one is X which one is Y this is symmetric so we could do it either way and it's an interesting question which is the best way but for now let me just do it so this one is y and this one is x and that's my result table. So the first thing I do, it works on the previous slide, I build, it's called a hash join, so I build a hash map for the table x. for the table X. And I will just compute, I will just look at all the distinct values in the course column, so I will have V720 and this is, and now I remember row IDs here. So this, I now give these in any order that's important. I don't, it's not important what the order is. I just take them in any order. And now I just say, okay, this here occurs in row one, then I have this one here, and this occurs in rows two and four, and then I have this other course, a note, I can hash anything I want, so here I'm caching strings. And this is just, what, what, what, 304 is three, yes. So here it's not a very interesting hash map, at least I have some duplicates here. It's two fours, so that's what I compute. Let's just go back in the formula, the first steps here. These are the rows of X in any order, so it's not important what the order is, I just go over them in some order for now, because I will, so that I can speak of column indices. This is just a set of distinct values, I may have duplicates, so in my example, this set has three elements, three different courses of the values in the join column. And I need to make something blue here. Now compute a hash map where I just remember for each distinct value in my join column, the sets of indices of row which contain it. It's actually harder to write it than to do it, right? So that's this here. Yeah, it's simple enough. Now I go over, now iterate over y. And you have to implement this. It's a good ints. Now I iterate over y and now I just, actually don't need the, now I just, in any order and let's start with 1304. 1304, I look it up in my hash table, yes 1304 is there and it's there with row 3. And now what I do, I will just, and sorry the things here are again, I've capitalized because PowerPoint is doing these intelligent things. And use the opportunity to think about it. It's not even trivial to correct it because it will always counter correct. But if you correct it enough, you win. Somehow remembers that when you correct it many times, it gives up, that's nice. And I will actually now do it in a different order than on the slide, so, on the way I showed you the results. So, 1304, I look it up here, it's three, and I say, okay, now for everything in this set here, I write a line to the result. So let's do that here, I will do that. So I will do seven, three I will do 7383436V1304 and here it says 1304, I think I should, it says 3, so I should take row 3. It's actually simple if you understood it, but you have to understand it. V 1304. And now let's go on. Now I take the second one. I say okay V 0240. Is it here? Yes, it's there with 2 and 4 and now I take all combinations of this row with rows yeah I do it with row 2 and I do it with row 4 so let's just do it so I will have this twice now 3, 4, 8, 7, 4, 6, 4, 3 and that's also pretty much how a database engine will implement it, how you will implement it for the exercise sheet. V2040 and now I will take row 2 here, which is this one and I will take row 4 here which is this one. And so on. I don't fill up the whole. You get the idea, right? Now I go here. This again, I look it up, I get two fours, so I take all combinations of this row with row two and four from my table X. This one does not exist, so nothing happens. This one again, so this is how I fill up four more rows. This one does not exist and so on. Let's go quickly jump to the formal description. So you go through those rows in any order, let's maybe see that in here so that I can correct the colors. Yes, also some blue. It's just the informal, it's just a formal description of what I just said. It's actually easy, it's just hard to describe it formally. Any questions about this algorithm, how it works? That's what you're supposed to implement. Maybe now is the time to show you the exercise sheet because we are almost done and I think it will also not say much more. The exercise sheet is really interesting and we spend a lot of time on this also on the lecture but also on this. Let's see, I should have... So the first exercise sheet will be to implement a very, very simple database engine where you just implement three operations, project, select, and these are really very simple, selecting a column, selecting rows, and doing this join, and you will implement the hash join. That's the first exercise. And you just have a, so here is a, and don't be afraid of the files we're giving to you, they are pretty big. So here it's like a function and you will recognize this. It's the function, it has basically the same signature as what I defined on the slides. And here you have extensive doc tests. And we are giving all the code for reading in the tables to you and also for pretty printing the tables. This is not something you have to do. So here it just says, okay, if for a certain input table here, which we also give to you, you call this operation, then you should get this table as a result and so on. And then you should just implement it here. And it's very little code, but you have to understand what you are doing. You should also implement select, and you should implement join. So that's exercise one. You should just implement these operations and we also, is there anything for the student to do in table.py or is this just given? Nothing to do? Yeah, we just give it to you. So Sebastian did this. So it's all the annoying work of reading the table and so on and pretty printing a table. And it's actually nice that you can do this. So after this exercise, you have written a very small database engine yourself. And what's the other one? Oh yeah. Okay, and then the second exercise is, and now pay attention. Or maybe I show it to you in the result. Yeah. We are giving you a SQL query, only one SQL query, a pretty complex one, but it's a typical SQL query which you have. You have more data, you have it in different tables, here it's movies, awards and so on. And here it's like we want everybody who played in a movie which won a Golden Globe and has a certain higher score. So it has a lot of joints here. By the way let me briefly mention that this cast is because in our Python program we don't have for simplicity, we don't have integer reels, everything is text. That's the only reason why you have this artifact here in SQL. Does it still work, Fang? Sure. You are not sure? I'm not sure. There's a message. Today is one of those everything. I won't touch it. What does the message say? Merry Christmas or? R00 R0, that's certainly a database Merry Christmas or? R00. R0, that's certainly a database error code. Okay, R00. I mean the camera works here. The combination of things that went wrong today is pretty amazing. I don't see the relation between uni account, elder problems in this one, but I'm sure there is pretty amazing. I don't see the relation between uni account elder problems in this one, but I'm sure there is a relation. So this here is just an artifact of in our Python program, everything is just text and so if we want to compare something like an integer, we need this cast. But you don't have to worry about this in your program. What you should do, to worry about this in your program. What you should do, you should translate this manually into a sequence of operations. This is something a database engine does automatically of course, but that would be a whole different game, automatically translating a SQL query to a sequence of operations. We have seen this many times. You do this manually here. You just write like we did it, a sequence of operations. You implement these operations and now you call them. And you do this three times. You think about three different sequences of operations, you implement them and then you run them. Of course the result should be the same each time. And then you also run a SQLite query. You can just put this into SQLite and see if the result is the same and you compare the running time. I think it's a great exercise to understand what we did today, to understand all kinds of details also and you should absolutely do it of course. Okay and I think this, yeah let's do the running time and then we should finish. No I don't think we, I don't think it's important now to do the running time. Let me just mention and we will do that in the next lecture because you don't need it for the sheet, there's also a merge join, and I will just do this in the next lecture, there's also, we can do the exact same thing differently by sorting these columns, and we will just continue with that in the next lecture. But you don't need it for the sheet, for the sheet you implement a hash join. So let's see, there's a question in a zoom flash, cart is full, pretty briefly a minute ago. Cart is full. Should not be full, cart is full. You get the best sequence of commands for a query, one has to calculate the cost for every possible sequence. Should we know how to construct? Okay, that's a good question. The question is for the exercise sheet, as I just explained to you. The number, that's a different problem we are not considering generating all possible sequences for a SQL query. It's a very hard problem. Can do this with dynamic programming, only generating the optimal one. It's not part of your task. Just generate three different sequences which come to your mind. And then estimate their cost and try to find one where you think this is a good one also pick ones which are bad it's up to you but you don't have to you don't have it's too much work just think about three different ones and one which is hopefully good. Any questions about this for now? Okay, so do the exercise sheet. It's very valuable to understand what we are doing today and what we did today and we will continue in the next lecture. So thank you, bye.Welcome everybody to lecture five, databases and information systems, the course which can also be taken as information retrieval this semester. Particularly nice weather today which is maybe the reason for the slightly fewer than usual people in the room. But nevertheless let's carry on. We will talk about your experiences with exercise sheet four which was about coding a simple database management system asking a few queries on it. And today we will continue with all this. I'm not going into the details here. The exercise sheet I've chosen the heading next level SQL. So you will, and we will talk about it later in the lecture. So it's one more lecture about databases and SQL, but next level. So if you sit down, please lift the chair and don't drag it over the ground. Thank you. So here are some quotes from your experiences with the last exercise sheet. The lecture and the exercise were very well explained. I really like that we discussed efficiency. Many of you said that, so I think in a typical database lecture course you don't hear much about efficiency, but I think it's just central. It doesn't make much sense to talk about databases without efficiency. Just think about Cartesian product, I mean this starts to become efficient already for a very small query. The efficiency aspects of different approaches is even exciting, thanks for this great exercise, the result table is motivating so you could code something, try out your query, do the theory, then also see what the real running times were. And some people found that their first attempts were not so good, but then they saw what others achieved and so, and by the way, if you don't do the exercises, absolutely, you have to do the exercises to understand this stuff. The sheet was the best one so far, I liked it. Interesting sheet, quite easy to accomplish but learned a lot. Okay, I don't think it was easy for everyone but would have been cool to come up with a SQL query myself, yes. For this exercise sheet we have more complicated questions and queries and one task will be to come up with a SQL query yourself and it will not be easy. Yeah, and a lot of things went wrong in the last lecture, basically everything, let's see what happens today. So maybe the weather, the bad weather compensates for everything. Okay, so let's continue from the last lecture and I've included a few slides from the last lecture just so to get into the mood again the hash join. And this was the, so you already worked with the hash join in the exercise sheet, this was the formal definition, I will not repeat it but let's look at the example. So please pay attention, understand the example again and then we will move on to the merge joint for which we didn't have time in the last lecture. Let's understand these example tables here, they are sort of very small so that we can put them on the slide. So we have here, okay the schema is not here, but you see the schema here, we have two tables, professors and the courses they give, so we have Broch, Hutter, Basque, Burgardt here, and here we have students participating in courses, just very few because the tables are small. And now we want to join the tables. And we have seen, okay the example, maybe let me show that again because that I think was, maybe I should have included that on the slides of the, just to get into this again. This should have been towards the end of the last lecture. This was the correct, here we have the schema and the table right, that's how you do it. So it's important also for what comes in ten minutes or so. There are some professors here which don't have students here, so for example Brox, there's a course here, there's no course here, so it will not be in the result. And there are some professors here where we have some courses with multiple professors, right? So for example this here was, so understand this again, here we have a course called V0240, that's an ID, and it has three students and two professors which means we have six results in the join. So let me say that once more, so this cross product, Cartesian product thing can also happen on a micro level and does happen on a micro level when you do joins. A course with two lecturers, two professors, three students, you get all combinations in the result. So that's why you have these six rows. Then you have one course here, 1304, with one student that gives one, another row, so seven rows, and this course here has no students in this table, so no more rows. So, and we talked about and you implemented this, how you do this with a hash map. Maybe let's also quickly recap that. There's enough time for that. I think it's important that you now, only for the purposes of the computation to give the row indices. I mean, it's not in the abstract model, this is just a set, but when you work with it algorithmically, then temporarily you have to give them row indices. So here we have one, two, three, four, here we have one, two, three, four, five, six. And now we compute this hash map for this one table. Also do it for the other table. We have three different courses, and the keys are the courses because that's the column on which we want to join. So three different courses, one course with two professors, and here we have a map, which rows, not which are the column, but these are the row indices where we have this thing. So course 2, 0, 4, 2 we have rows 2 and 4. And now with that hash map, that's how hash join works. We can now go over this table. Let's do that again. We do course after course. We can go in any order, V1304. Is it in the hash map? Yes, it's there with one, with this row three, and so I'm taking this and combining it with row three, and that's what I write here. And I take the next one V0240, is it in my hash map? Yes, with two rows, two, four. So I take this row and this row and combine it with this row, which is why I have these two rows here. That was the hash join. Is there any, you implemented it, at least those who did the exercise sheet, any question about this before we move on to the new stuff, which is the merge join, which is also not much more complicated. Okay, so this was just warming up so far, and we already talked about this. Maybe let's also recap this because the merge join has something similar. What's the time for this? I mean, I have to build a hash map over this table, so I go over the table once and I insert into the hash map, so if I have 20 rows here I have 20 inserts, and then I go over this one and do look ups in the hash table, so I have as many inserts if I have rows here, as many look ups as I have rows here. And then I have to create this table as I have rows here. And then I have to create this table, which is a certain number of rows and columns, and that's why I get this many, yeah. So this is like the number of inserts, number of rows in the first table, number of lookups, number of rows in the second table, and this is just the size of the result, which can be large due to this cross product effect. And also understand, let me also say this again, so we are doing join operations because it's more efficient than first doing Cartesian product and then filtering, this is really important, this cross product effect also occurs here on a micro level, but if it would so happen that all values here are the same and all values here are the same, even a hash join or merge join, a join, would compute the full Cartesian product. This can also happen on a, just assume this is a big table and half of the values are the same here and here. This would also give you a very big table, which is why it's important that I have this here in the running time. So if my table is large, this will be an expensive operation. Now let's do the merge join. And this is the abstract description, but let's first do it by an example. And now comes a very interesting connection to the first and second lecture, this is the abstract description but let's first do it by an example. And now comes a very interesting connection to the first and second lecture because now we will do something which very much looks like an inverted index, which is an inverted index. So for each list I first compute, so for x, I'm not sure how much space I need, so this is my table x here and this is my table y here. So, let's start with X. Now I do the same thing, not quite the same thing, I sort these values in the join column. And that's also what's written on the previous slide. We have the values in the join columns in sorted order. But I have to remember from which row it comes. So let's do that, best understood by an example. So I'm sorting these here and I sort them lexicographically. Let's see whether it works. So that should be V720 and this comes from one and maybe I do the one in another color so that it's clear that these are the row indices. So this here are the, I temporarily give them row indices again, four, five, six. Okay, now I need to do a bit of pen switching. And now I have to do, okay, what's the next one? It's V1304 comma something comma, okay. I think I have to write it here on a new line, two zero four two comma, and then I have another one, two zero four two comma, okay, and this is, so I sort by the values and I remember from which row they come. That's a typical thing to do in sorting two, four. Yeah, it's now the values of the joint column. Now I have to do the same thing for the other table and let's do that. So yeah, it has to go over multiple lines. So what's the first one? S becomes before V, right? 13 Let me leave the space for the row index open. What's next? 1304, right? 1304 comma and what's next? Next is, and now come four times V2040 comma something, V2040 comma something, yeah. A multi-line list. a multi-line list. And this should remind you of an inverted index. So, and this here comes from row six. This one comes from row one. So it's really just sorting the values and this I just do it in this order. Two, three, four, five. Any questions about this? Yes, please. There are no four of these 2014 years, one of them has been 2016. So I did a mistake. Where did I do a mistake? Which element in which list? The last element in the y list, the line 4 in the y table, it has another close one. Oh, that was wrong. Yeah, yeah, yeah. Yeah, yeah, exactly. Thank you for paying attention. Yeah, it's not a mistake here, it's actually, so yeah, it's more interesting then. That's good. Two, no, that was actually fine. Thank you very much. Could just claim I did it on purpose to check whether you pay attention, but I didn't do it on purpose. It was, so that's correct now, yes? Two, three, four, that's a six, five now, right? I'm sorry. Okay, now let's two, three, five, let's just check this here. Yeah, this is two, three, five, let's just check this here. This is two, three and five. And this should very much, and this V here is a bit malformed because I thought about why you shouldn't think about different things when you, and this should very much remind you of an inverted index, right? Here I have it in order of, this is an inverted index. Here I have the rows in order of some temporary row indices and now I take it just in different ordering. So if I would sort this by row index, I would just get the rows in this order and now I sort it by the values in this column so I'm inverting. It's effectively an inverted index. And now for the result, now I can do the same thing and that was on the previous slide and that's the important message here. I can use the zipper algorithm. This is now, now I have two sorted list of integers and whenever you have that, you can do this linear time intersect. And now let's, and maybe, I mean you don't have to compute what I now write, but maybe just for the sake of, so the result is, so how do I do this? I start here and I start here. So the result is, so how do I do this? I start here and I start here. And now I compare these two values here. Are they the same? No, they are not the same. And then I advance in the list with the smaller one, right? S is smaller than V, so I advance here. Are these the same? No, they are not the same. So now I advance here. Are these the same? No, they are not the same. So now I advance here. And now one important detail. So now they are equal. So I will certainly write and let me do that. This result in the... and now it's a little variation. The question is what do I write in the result? I mean this is... I don't really have to materialize this, but I am doing it, so maybe I just three comma one. So I have this now, so this means I have this in table in row three from this in table in row three from this result and in row one from this table here and so on. And now, okay, we don't see it here. I have to check whether this occurs multiple times here or there. It's not the case, but we will soon see it to be the case. So that's one important difference here. So now I'm continuing in both lists because I've already processed this one and now I'm here. So that's another match so I will get now, let me just write V2040 and now that's a little variation of the inside the zipper algorithm, when I have a, and this is actually something we could have talked about in the first lecture, what do you do in intersection if you have several times the same value here and here? You somehow have to define what you do. I have this value twice here and three times here. And a very common thing to do and the right thing to do for a join is to write all combinations in the result and that's what we will do. So inside the zipper algorithm you now need a little loop. You first, you know, that's when you would implement this. You don't have to implement it for the sheet, maybe for the exam. You now have to check, okay, how long is my sequence of V240s here? It's two. And maybe I symbolize this by having a dotted line here. And this is something you can easily check. Let me just check how much of the same ones I have. And now, and understand that this is efficient, now I have to output all combinations. And I mean I have to produce them, right? It's six combinations, I have to write six rows to the result, so it's also efficient to just do all combinations here. I can't do it any better, because I have to write six rows to the result. So, let's just write them here. And 2040, 1234. And this result table which I'm writing here, you don't really have to materialize it. You can directly write the table on the left, but I'm just doing it here to clarify what I'm doing. but I'm just doing it here to clarify what I'm doing. That's a bit, because I don't want to switch the pen so often. And now I, so now I have all combinations here. So it's a two from the first table and two from the second one, it's two, second one it's two three it's two five and now the same with the four from the first table four four four two three five so and this is let me just also clarify this, this is from X. Wow. Where does this come from? Outer space, toilet, cooking inside the body. So and now I have the result in this abstract form, right? What does it say? Let's just look at, for example, now let's maybe look at two rows here. For example, this one. And let's write this here to this first row. And what it just says, let's just make this, it just says okay this value here 1304, it's an equijoin so I will have that in both tables here, V 13 and the result is exactly the same as for the hash join of course, just a different algorithm and now I need the from table X the third row so it's a bust here and the first row here so it will be seven three eight three four three six and whenever you spot a mistake please tell me so there will be six rows let's maybe take one other one for example this one here. This is now one, two, three, four, five, one, two, three, four, five, it should be this one. This row here, so it's V. And think about whether you have any questions about this algorithm because it's very important and it's basically the zipper algorithm. So now I have table 4 from here, that's a Borgert, Borgert here and here I have table two, so that's three, four, eight, seven, four, six, four and so on. Any questions about this? Yes? I have a problem with the mistake, the last one in the result is there is a Y, not a, oh yeah, yeah. the last one and the result is very robotic. Oh yeah, yeah, I'm sorry. Thank you very much. This should be, is everything else correct? Yeah, computers are much better at executing algorithms. Either they are completely wrong or it's correct. This is the merge join, any questions? Why do we use strings as IDs? They would be much faster to use in the future. It's a very good point and we will see in a second whether that's true or not. It's a very good point. Is it a good idea to take the strings here? I mean, ideally, of course, the IDs are not strings, but integers. But I actually deliberately chose strings here. I mean, it works, but now I have to compare strings. I mean, zipper works with anything that can be ordered. It's a very good remark and question. And we will come to it in a second. it's a very good remark and question and we will come to it in a second. Let's first discuss the running time, what's the running time? So for the hash join it was size of this table, this many hash insertions into hash map, this many lookups, writing the result. Well what do we have here? We have two lists which are together as, well you have to sort these lists. Actually, okay what I, haha, this, I forgot an important takes time. If the tables are already sorted, that's important by the respective, by the values in the respective column. So if they are sorted then the zipper algorithm is actually quite efficient, it just takes time of the merged result which is, this is also not true here, it just takes time of the merged result, which is, this is also not true here, it's okay, but it's good that I realize this now, because this is not the size of the result, it's also n times k, right? It depends on the, I should also have a plus N times K. Maybe, where? Yeah, because the size can be larger than the sum of the two lists because I have this dot product thing going on. Okay, let me just add this here. Plus N times K, now I have to make things blue and now I have the where is too large here. How will I do this, where? Oh my, where? The join much join takes this much time. And think about what's already on the slide, it's not too bad if there are these small mistakes because, so that's the time, so I have the two lists together, so I at least have to go through both lists and I also because of this cross product thing the end result may be larger, so that's the size of the number of rows and columns in the result and I think that's also not quite, it's not, this here is just N when I go through this. Maybe now I can untighten this again, let me. It's actually not so bad that we are, let's check whether that's correct here. Please also think with me. So I have to, let's go back one slide, I have to press the's go back one slide, I have to, where's the algorithm, here it is, I have to go through, I have to iterate over the lists here, but I have to do more than, I'm touching each element more than once maybe, right? So it's this here for this two, three combination I actually do six iterations over the element. First all combinations of this with that one, this with that one. So it's size of this list plus size of this list plus number of elements here. That's I think accurate for the merge of these two algorithms and then I have to write the table which is n times k. So I think that's correct for the merging, just touching an element in the merging, it's sum of the two lists plus n and it's this. And this looks very similar to the hash join algorithm. It's essentially the same, but the only difference that I have, so it's basically also linear. I have size of the result table and here I have the inputs and the n I have here and there. But still, if the tables are sorted, this is faster because, let's just go through, I mean you wonder maybe why. Here, a hash map is just more complicated than scanning two lists, two sorted lists, right? Inserting into a hash map, a hash map is a not so trivial data structure, just imagine this table being very large, now I'm constructing a very big hash map. Which means this constant, ci here, number of injunctions, that's not such a small constant. Whereas the merge join, this constant is very small. I mean just iterating, just think about it, just iterating over these two lists, I don't have to do a lot of work, right? I just advance a pointer, I compare two values here, I have a loop like thing. So that is what's written here below. This is asymptotically the same as for the hash join but typically this constant here is much smaller than the CI. So it's just, I don't have a complicated hash map. Okay, I think for the exercise sheet for the new one you should work with cost estimates which we, I come to the chat in a second, which you didn't work so far in the last exercise sheet. Here's some simplistic estimates, just take the sum of the inputs and the size of the output. And you can take these for both. Now this is actually not an estimate, we know how large. Think about it with all the operations, that should also be clear in case that's not clear. Number of rows in the result is not clear, but number of columns is also clear. For every operation you can, I mean let's just go back to this example. Just by looking at the input, it's hard to say how many, I'm sorry, wow, this was a new key combination, I've never seen it before. It's kind of hard to say in advance how many rows your result will have but it's always clear how many columns it has. It's just the sum of the two columns here. So that's always easy. So we know that. Now what about the rows? So first here ignore the constants because I can't, we already said that, costumets just need to be good enough. Why do I take the minimum of the two? Well that's a super simplistic assumption. I just have to make, let's go back to one result here. Just by looking at the tables I have to say how big is my result, how many rows is it going to have. And I just make this super simplistic assumption. If I have four here and six here, I just say okay, everyone here will have at least one match there. Could be totally wrong, but it's not the worst assumption to make, right? Just assuming these values are all different and they all occur here, that's not a very good estimate, but as we have said in the last lecture, any estimate might be fine if it helps to differentiate different query plans enough. So for the exercise sheet you can work with that simple estimate. You can also come up with a better one if you want. So now back to your question. Now let's play around a bit with SQLite and let's look in the chat. So the merge join is better if the tables are similar size and hash join is better if one of the table is a lot larger than the other because we don't have to sort. Let's look at the tables again, it's actually a complicated question. One could have an, I mean people are doing hash versus merge join, there are like 10,000 favors about it I think and people spending half if one table is very small, let's maybe go to the, if one table is very small, then building this hash map is cheap. It essentially costs nothing. And then just going over the big table and looking it up in this small hash map, I think that's the most efficient way to go. Because creating the hash table costs nothing because this table is small and look up in a small hash table is also cheap. Now when it becomes larger, yeah, this is now the slide we are going to talk about in a second, if the tables are already sorted. You don't do a merge join, you think if the tables are not already sorted. Sometimes the tables are already sorted by the values and that's what we will talk about next. And then a merge join is usually better unless you have the special case where one table is very small. So there is, and we talk about that now, there is the operation create index, very important. You usually create it for a particular column of a particular table. And it amounts to approximately sorting the table by that column, not really sorting it but computing a permutation that sorts it. So for example for my movies table I compute an index on the first column, on the ID. So I say okay, sort it by ID or compute a permutation that sorts it by ID. Just a side remark, you can also have, you could also specify a comma separated list of columns here but that goes too far. But that's something you can do, and we will do it in a second on the command line. And now, pay attention when you do that, for example, I have my movies table, and in the first column, and let's maybe go right away here to just check how our, so this is how our movies table looked like, here I have the ID in the first column, and let's maybe sort it on the command line, I mean this we could do, we could do K11N, this is how you sort from the command line. Yeah, so now it would look like this, right? So now I have it sorted here according to this. Actually what create index does, it does not sort the whole table and store a sorted copy, but it will do something like this. So in this sense, it's highly related to the merge join, right? do something like this. So in this sense it's highly related to the merge join. This is one way, this is not really sorting this table according to column course because then you would have a copy of the whole table, but just a sorted order of the values in one particular column and the row index of where it came from. This is something which create index could do. But actually we have another slide concerning what actually happens. And what this should support is, now you can iterate over the rows of the table in sorted order. So if I go here, if I would actually sort it, now I can go over them in order of sorted IDs. And another thing I can do, if I want to know in this table now, I have this sort order now and I want to know is there a movie with ID 27179, then I can quickly answer that question, for example with a binary search here. If it's not sorted, I either need a hash map or I have to go through the whole table. If it's sorted or in some way, we will talk about which data structures achieve this. So one way would be to just sort it, sort the values and remember the row indices. You could also sort the whole table. That's wasteful. Before we say how we do it, let's try it. This now important, pay attention. This is a bit messy, so let me, I don't want to write this all from scratch. It takes too long. I've just commented out stuff you can ignore, so this is by the way the comment, where you comment out in SQL, it's just minus minus space, that's why it's in blue. Blue means commented out. Here, I just have, and that's the data from the last exercise sheet, also from the new one, with a small variation in the new one, please download the new data. This is our movies table, movie ID, integer, title, text and so on, we don't go through it all. So here I'm creating all the tables, here I'm importing the data and below I have a query. And we already had this in the last lecture, I have the query twice, once with explain query plan, so that the SQL engine, the SQLite will tell me how it's going to execute that query and then without it. And let's also look at the query, it's not very important what the query is. It joins three tables, namely it's persons, movies and roles. So it's just the join of these and it's just joining on the natural columns. So roles, person, ID is person, persons and movie, movie. And it's not doing something particular. Here I select count. Why do I select count? Because I don't want to output the whole table. I just want to execute this so that it takes some time and then I just want a one-liner. So I'm not actually interested in the result. I just want my SQL engine to have work. And what it has to do, it has to compute two joins here. That should be clear. To execute this, you have to execute two joins. And the order is up to the engine. So, now I have, and these create index things are now commented out. And what we can do now is we can just pipe this into SQLite 3. It will always take a little because I'm always importing the data from scratch. And now we will see a number of interesting things. So this is just a result here, it's just a count. By the way it's not important that I have the same count twice here. It's the same number, not important. It's just, yeah, the query took some time, four seconds. And here it says, use an automatic covering index. So what's that supposed to mean? I will tell you in a second. Let's just leave it there for a second and now let's do the following. Let's, and it's also really interesting to do this yourself, do it, check the running time and understand why is it faster, why is it slower and do I understand this query plan here? Do I understand what the engine said. Now I say okay I'm going to, I'm joining these things on IDs here, movie ID, person ID, let me create an index. So now I will create, compute some data structure there so that I can quickly find elements in this column, values in this column or iterate over them in sorted order. Let's do that. Let's do the same thing. Let's see whether it helps. And I will not explain too much now, but I will just, let's just observe and then the slides explain. But this is really, really interesting. So two things you should observe, it's four times faster, and it's also writing something different here, namely it's saying, thank you for the indices, I used them here. But it's also interesting now that it says, thank you for no indices, but I used my automatic covering index. What's that? Not clear now? I will explain it in a second. Let's do another thing. And really interesting, you should do it yourself at home. Let's do, and that's why I'm importing from scratch, don't create an index, but we have this primary key thing here, right? This is the same thing as before, but now I have, I say movie ID and person ID are primary keys. So let's do that. And let's do the same. I'm not creating an index, but I said these are my primary key columns. Ah, interesting. Now it's even faster and it's saying something else. Now it's also like it's also using, here it's using automatic covering index, it's using the index I created. Obviously here it's using integer primary key, which says, I'm already saying that, when you primary key it's also creating some, it knows this is the primary key, this is going to be used in all kinds of joins, that's what the primary key is for. I'm going to build an index for it, some kind of index and it's even more efficient than this one. And let's do one more thing. Let's revert this. Let's again, not primary key, just integer. And now let's build all kinds of indices. Here I just build indices on movies and persons, but there's a third table here, roles, where I also have movie ID and person ID. Let me also build indices for those, right? When I merge persons with roles or with movies, I have a primary key here, a foreign key here. It should help, think of the merge join. And let me go back to that slide. Understand that for this merge join to be efficient, I need this here somehow sorted and this sorted. So kind of an index here and an index here. I haven't done that so far. So now I'm creating all the indices on movies, person, roles. Let me just create an index movie ID in movies and persons and roles. And let's do that and let's see how fast this is. And now it's using a, okay, that's interesting. Now it's persons roles, I'm a little bit, no, no, no, no, no, no, no, I have to compare it with this one here, right? I have to check. This was the one where I had two indices, now I have four indices and it's actually slower. So it's using different indices now, but it's slower. Why is it slower if I build more indices? I mean, I can only hint at this. This is so interesting, but obviously very important in practice. Let's now go to the slides and understand everything on this page. We can understand all of it. It's actually quite easy if you know what's behind it. And then remind me please that we go back to this and replace this with text and see what happens. Because that was your question. We can easily now say, okay, let's say this is not an integer text. Maybe it's faster, slower, same speed. Let's go to the slides and try to understand this. And play around with yourself and try to understand it. So two more slides. How is create index implemented actually? What does it do? And then how does SQLite do it? So it is not defined by the SQL standard. What exactly create index does. It kind of says do something so that when join on these columns it's somehow faster. So exactly how it does this is SQLite computes an index B-tree, so I don't have time and would love to give a lecture about B-tree. B-tree is a kind of search tree, So if you maybe let's click on that link and see if it works. So a search tree, if you want to know about search tree, it's something like this. You have your elements here and now you want to, this is in German but nevertheless you will understand this is a search tree, right? You have elements, here it's just integer values and I somehow organize them in a tree. Here it's like everything left of the tree is smaller, everything to the right is larger than this, so if I want to find an element I'm just navigating through the tree. That's a search tree, so it's not exactly sorting the elements, but it's keeping them in some order so that I can easily find elements. And I can also easily iterate over them in sorted order. So that's what SQLite computes, and that's fair enough. And why B-tree and not search tree? I can only hint at this here. A B-tree is a search tree, but importantly, I mean if I sort my elements once, then it works for that sequence, but what if I insert or delete something in the table? That's what search trees are for, right? Now I could go and delete this 15 and I still have a tree. If I delete something from a sorted sequence, I have a problem, it's an array, I have to remove elements or something. And a B-tree is optimized for, and my data is very huge and part of it resides on disk. We will not talk about this further here. I could give three lectures about the B-tree. This we have already seen when you say something is a primary key, SQLite 3 will automatically do this, create an index, but not for foreign keys. I think we haven't seen this. For foreign keys doesn't do this and you wonder why, because you also want to join on foreign keys. And we have also seen this if I have two indices on both of the join columns it's even slower. Why is that? And now comes, oh, what was that? Can you draw here on the, I guess this was, was it me or was it somebody of you? How can I annotate Turgra, Gortich? Yeah, we just have to live with it, I don't know where it, so SQLite 3 does not have a hash join, that's very surprising, I mean it's kind of the most basic operation and if you read here it even says, I have to go there and show this, whether FN auto, yeah it says it here, an automatic index is about the same as a hash join. That's a very dubious statement. I mean an index is not about the same thing as a hash join. That's for someone who has heard algorithms and data structure lecture, that's a very strange statement. But it says here, so it doesn't have a hash join, but says such thing. It doesn't have an implementation of a merge join either. So it implements neither hash join nor merge join. Every normal database engine has hash join and merge join. Instead what it does, it always does the following. Instead what it does, it always does the following, so if I go back to my example here, it always picks one table, goes over the elements from one table and searches them in the other table. But it doesn't search them using a hashmap but it searches them using its index data structure. So if you have an index on that one what SQLite does, it goes over this column and for each element here I search it in that one, which is not typically not a good algorithm. But that's what it does. And now you can also understand, understanding that why it's doing this here, because that's the only algorithm it has implemented. When it has to join, and you haven't computed any indices, it first computes an index from scratch. That's what it's doing here, and that's why it's taking so long. It says, okay, I have to join two things. For example, these two tables, and I would need an index for this one, it's not there so let me compute it. So it kind of does create index just for this join operation, then it uses the index to find all these values here from this list and that list and then it throws away the index again. So terribly inefficient. But that's, I was surprised myself. I always thought SQLite is not the fastest engine but reasonable. I stand corrected. It's very unreasonable. It's terrible. I mean, that's a nice way to say it. It's optimized for inefficiency, I think. So more powerful, yeah, every normal engine supports that. But it's very easy to use, that certainly goes for. What they write, one side remark in the documentation is they say it's, I don't know how true this actually is, that this is supposed to work on small devices, embedded devices, so the code has to be very small so they can't do complicated algorithms. That's actually what they, I don't know if there's, oh yeah it was written here. That's, it's so interesting, yeah. Adding, I don't know, this sounds like a, I'm so sorry that this goes, yeah. That's their excuse. Adding a separate hash table's implementation would increase the size of the library for minimal performance gain. Hmm, implementing a proper algorithm would cost time and make my code larger and who knows if it even helps. That's a very, very, very dubious statement. So we were very surprised, Sebastian and I, when we read this to find these very dubious statements here. Of course they are deep, 14.1 somewhere on a subpage about optimization. Let's do one final thing here and let's do a text here and see what happens. Text what do you think? Same speed, faster, it's very hard to predict for SQLite 3 because it does this. We just have movies, persons and so let's just pick text here and there's of course more to understand but I will not talk about it here. Let's remove the indices here. So now I just, now my keys are text and let's just execute it. It will import the files. Actually I'm not sure. Okay, it computes an automatic covering index. I don't have an index so I compute one. Okay, so you see it's slower as one would expect, but also not super slow, but that's only because the other stuff is already so inefficient that it, you also see the performance differences are not huge, right? Factor two, three, but that's because it's like in Python, everything is inefficient already in principle, and then algorithms don't make such a big difference. In C++ you would see performance difference of a factor of 10 or so. If you do strings or integers. Okay, that was, I think any questions about this before we, I think make a short break and move on to the other stuff. Yes, please. Go back to slide nine. Yes. We've added the K and then we moved it. As I said, we looked at N multiple times. But we also looked at K multiple times to take the mini cross product on the left and on the right. So you are advocating for having n times k here? Okay let me just clarify what the k is. The k is the number of columns. And I think for doing what I'm doing here, the number of columns plays no role. The number of, so maybe it's a confusion about what the k is. The k is just when I write the results I'm actually writing all the columns. Here I'm dealing with abstract representations. So for all this it doesn't matter whether this table has 500 columns or 2. Good question. Any other question for now? for now. Okay, so let's have a break and then resume in five minutes. Thank you. Okay, so maybe a little bit scared that we only have one third of the slides so far, but it's half time already, but shouldn't be a problem because this was, yeah, this was very interesting and also a bit more deeper. The rest I think is easier and lots of examples so I think we will be fine but nevertheless important. So first more join operations and I think much of the following will be easy to understand but please pay attention. So far, so we have join, the join, there's a reason why we talk so much about joins because it's a central operation and databases, different tables combining them and what we typically did so far, let me just show it again, you have values here and values here and in the result you only have rows for those values which occur here and there and the question is what about this one? It's here but not there, should we include it or not? And there are variations of the join which included and we will talk about them now. So what we did so far is an inner join and it boils down to list intersection. Just consider those values which occur in both tables. We could also say and this T2 is not blue. So we cannot continue, oh wow, why is this not? We can't continue before we make this blue, I'm sorry. Roadblock. Let's just say, if a value does not occur, occurs on the left in the first table and not on the right, let's include it anyway. That's called the left outer join. And similarly the other way,'s called the left outer join and similarly the other way it's a right outer join and this should be let me also fix this right away because it's quick that should be here a subscript and you can also include them from both sides that would be called an outer join or full outer join. Here's an example so that's the same table as before. Without the yellow thing, it's the same table as before, but now I say, okay, here's a value in the join column, so I'm joining on the course column, which is only in the left table and not in the right table. Then I'm including it anyway. But now what do I include here? I don't have a row in the second table, so I just put null. And we haven't talked about null here, and I don't have a, so, null is a special value. I think that's good enough for now. is a special value. I think that's good enough for now. It's a special value where you just know it's null. I mean any programming language, any system has a special value for null. So I think that simple enough. You have to know the syntax and it's called left join. And I will come back to this thing here in a second. Left join, so instead of comma I write left join, instead of where I write on. So now I'm including also rows where this here is not occurring here. Let me just show you an example for the right outer join. So now here I have, maybe let's also highlight that here, so this is about this one here, right, which doesn't have a counterpart on the right. And here I have, I should have two students, namely these here, which don't have a, the course does not exist on the left, and I want to include them anyway in the result and this can be useful sometimes. And so I just include these results. And if I do a full outer join or just outer join, you could I think also write right outer join here, then I would have this row and this row. Yes? In the students' table, there is a student with the course number B2016. Shouldn't he be the one who did it? Oh my, yes, yes, it's the same mistake like before, right? Yeah, yeah, it's my, hmm, I somehow also in the preparation I overlooked this poor, poor, poor student. You are absolutely right. And I will solve this as follows. Thank you. I think I will just, let me just, no no this is not what I, thank you. Let's do some, let's solve this right away and let's, because actually I added a row here just so that you see there's not always just one row which is added here, so. Let's do this one, this is this one and here we should get this 260 students. Let's check whether it's correct. Is this correct now? 7, 3, 4, 5, 3, 4 Do you agree? Does it look correct? Now this one which doesn't have a counterpart and table on the left and this one. Yeah? Okay, thank you very much for paying attention. Yes, absolutely right. So it can be any number of rows here, which okay, let's continue. And if you just want both, and so yeah, we'll see in a second that SQL is a huge language. There are many, then there's the natural join, and I just wanted to mention it. It's not actually something new, but typically, I mean, so far we always wrote tables, comma tables, where, and then say this column is equal to this column, you can also write join on, I will come back to this in a second. If the columns are the same, you can just write natural join, and it just will join it on all the columns which have the same name in both tables, which is frequently the case, right? You have something movies ID, then in every table which has that ID, you call it movies ID, whether it's primary or foreign key. So these three, so that's what I was talking about. Now I have my tables profs and studis from before, everything is text for simplicity. So let's join them on the course column here, which is the same in both, so that's the old fashioned way to write it. Profs comma studis, so like Cartesian product, but then only take those where the two are equal. You can equivalently write it like this. All three give exactly the same result. Join on, that's now an inner join, so no extra rows if only occurs on one side. Or you say natural join, a natural join is just an equijoin on the columns where the name of the column is equal, which is here just one column. The question, so that's one thing about SQL which one cannot like with some justice, I think that you have so many different ways to write the same thing. So which one should you take? This is I think from ancient times. This is like very basic SQL. This is how we started. You can write all select queries like select from where. And you can even only do it with Cartesian product and project and select. This here is the more modern way where you make explicit that this is a join operation. Here the join operation is kind of implicit. I mean, let me just say that again, when you give this to a database engine it will do a join even though you didn't write join here. Here you make it explicit and here it's implicit on which columns you are joining, so kind of you won't see that so often but it's very easy to write. Okay, that's the natural join and just to show it to you that this exists so far our join conditions were always this is equal to this. Let me give you one natural example that this doesn't have to be so. So here, let's assume I have bus stops. So they have a name and some geolocation. And I have restaurants. They have a name and some geolocations. And now I want to join them, all bus stops and all restaurants, but I only want those pairs where the bus stop and the restaurant are nearby. So within 500 meters, less than 500 meters of each other. Now this syntax requires some special extension, but I think you get the idea. So now I have a join, consider all combinations where the two are close together. This is called a spatial join, maybe I should write this here and that's not the topic. So this is called, and it's a very spatial databases where things have geo information and then you could do all kinds of fancy stuff, you could give a whole course about this and we are also doing this a lot, so this is called spatial join. But it's an example and of course you have the same efficiency problems here. If you have one million bus stops all over the world and one million restaurants, you don't want to compute the Cartesian product here, right? Then filter. That takes way too long. You need a data structure. Actually the data structure here would be an R-tree, some geometric tree, some division of the plane. So you have similar efficiency issues. There's a question. Is it possible to join on that, the object shows the row of the one thing, the one table is smaller than the thing from the second table? So just instead of equal less than? Absolutely, yes, you can do that. And for every join you can, and you can always compute it like Cartesian product first and then just filter, that's the inefficient way. Then there's always the question to ask, let's say it's less or equal, can I profit from an index data structure? And for less or equal you also can, right? You can somehow, if you have a search tree over one column, you can make it more efficient. But yeah, you can do that. You can write anything, the question is just how efficient it is. And here there are special data structures for making that efficient. And actually, let me just, I think we have this, let me just quickly check, I think we have this query here, right? On our, do we have, I'm blind. Oh, maybe we only have it for Germany instance, let me see. Restaurants in Freiburg near Tramstopp. Let's see whether this works. Oh yeah, it works. Okay, here we have all restaurants in Freiburg. That's a query on it. So here we have it, yeah, it's exactly such a query. So now you have, yeah, Spice Trails is near Bertolt's pond and so on. So you just saw it in action. Okay but that's just one example of a non-equijoint. Now the whole section about grouping because group I, and this will also be on the exercise sheet, that's the one really basic operation which is non-trivial which we didn't talk about so far and we will just see it by an example. I will not talk too much about... let's say we have our movies file and we want all movies from the same director and take their average score. And this is also what you have on the exercise sheet. So they are all movies by Steven Spielberg, all movies by Christopher Nolan. So I want to group the movies in this case by director and then for each group compute something interesting. For example, the average movie score of a movie from that director. This you do with group by, and I'm not sure, I will do this for the last time in this lecture. Actually Sebastian, do we need this operation for the, yeah, you have to implement it, you don't have to, you don't really need the formal definition. I've added the formal definition here, and you can see it's super complicated, but it's actually not so hard. I will not talk about this slide, it costs way too much time to make it, because writing down more complex SQL operations in their full detail is quite cumbersome. It's easier to implement than to write down the definition, but I just want to mention that this P of X is supposed to be the of X is supposed to be the set of all multi sets X prime of, it's the power set, it's just all subsets but with multi. And this was written too hastily so I should just write it again. Because I was thinking and writing at the same time. And this is a calligraphic P. Let me just write that again. I mean you can look at the slide where it's there, p of x, that's the power, the set of all subsets, it's the... of all... multi sets x prime, which are subset of x. I think I could spend half an hour explaining this definition. This is actually a good example, which are subset of X. I think I could spend half an hour explaining this definition. This is actually a good example. Where I'm not sure mathematics is so helpful. You have to implement this for the exercise sheet, but the implementation is way easier than the formal definition. So it's probably easiest to explain this in terms of the implementation. And I will just explain it by an example because that's very easy to understand. Groupa is really easy to understand but it has a lot of border cases and maybe we'll talk about some of them. So here I have my movies table in the simplest of form, no IDs. Name title, actually yeah name, it says here name, maybe it should be title, I think it would be a bit better to call it title, because we called it title before so it's the title of the movie, let me also call it title here, powerpoint, yeah. And I've abbreviated the movies here a little bit so this is all quiet on the western front but actually I deliberately chose short movie titles, this movie in this year. Now this is my query, select something, we will look at this in a second, from this one table, no joins involved, group by year. So I'm grouping by year, which means in my result I will have one row per year now. So you have just three different years, right? It's 23, 22, 21. By the way, they don't have to be together in the table here. And now for each group I do something and what do I do? I now have to say this here. So for the things which I group by, I can just write them here. So I have one group by year, so I just have year here. And for the things which I don't group by, I have to say what do I do with them, right? Understand this thing. it's easy enough to understand. For example, let me just highlight this here. Here I have three entry in my result table. So understand why I don't have to write anything specific for year, because I'm grouping by year, which means all these things in the same group have the same year. So it's clear what to put here. It's just the common year of this group. But now I have to say which score should I put here. The result is always a table. So here's one thing, I should say max, just take the maximum of these. So it's like an aggregation function which takes a number of values, a set of values from this domain and gives me a value from another domain. Here it's the same domain, could also be a different domain. So, and here all kinds of function available. I could sum them up, I could compute the average, I could concatenate them. Yes please? Should the header of the resulting table be max score? Yeah, absolutely right. It should be max score, yes., and this is actually I'm not sure in SQL whether you could write it. I'm not sure whether you're allowed to call it the same name. Maybe you can look it up, someone can look it up. Max score. So now I have to, yeah, it should absolutely be max score. Thank you. Let me just write these. So I'm really bad at drawing rectangular things, but it's very hard with this pen. Okay, that's Group by. I think it's simple enough. And maybe let's go back to the definition now and try to understand it a little bit. I think it's clear to understand what's going in here. The input is a table. This is the column indices by which I group by. It's the second column. This could be, you could have any sequence of column indices here. You could even have a year twice if you want. Doesn't make sense, but you could have any sequence of column indices here, you could even have a year twice if you want, doesn't make sense but you could have. And here you say for the other columns, so for example for column three, I want to also have it in my result and then you have to say how do I aggregate all the things in the same group with max. Maybe let's go back to the previous slide, we can understand it a little bit. So that's what it says here. So it has two arguments. These are the columns by which I group, these are column indices, and these are the aggregation columns. So I also have column indices for each, I have an aggregation function. Think about this, why that's correct. It's important, this thing is important and you can understand it. Either a column is in the group by, it defines what the group is or it's a column for which I have to aggregate. If it's not something by which I group I might have several of them I have to say how do I turn several of them into one. And M, I must have one column by which I group, otherwise the group by doesn't make sense, right? So this M must be greater or equal to one. I can't say group by nothing. But these can be empty. I don't have to, actually this is a special case, if I don't have the max score here and just a year then this is equivalent to distinct year right and I just have the distinct years here so group by something without aggregation function is the same as distinct and another example that you have can write the exactly same thing in many different ways so in that case it wouldn case it would be good to just do a distinct. That's the group I and you have to implement it. It's actually, I think I have some, I think I have some implementation advice. Here's one more detail. There's another keyword having and if you've never seen this you wonder why do I need another keyword having and if you've never seen this you wonder why do I need another keyword having. Assume I want to show only groups where this max score fulfills a certain property like only show me the years where the max score is greater than 8.0 so this would go away. 8.0 so this would go away. Well how do I write this in this query? I can't really write it with, yeah, the where clause would go before the group by clause. The thing is that the max score only exists after the group by, right? It doesn't exist before. I mean I could write here where year is after the year 2000. This I could do, I could just write a year here, but I could not write where max score is greater than 80 because the max score is only defined by the group by. And that's why I have a having. Here's a having clause. Now and this is how I could include it in the definition. So now I'm saying my group by creates new variables, max score in this case, and I have to make the same correction here and I, I have a max score and now I have to say, and this is also wrong, this should be max score now, and yeah this is now, yeah. So this is having. If I have a constraint on something that is only created by the group by, I cannot write it in my where clause which would go here. I have to write it in a having clause which comes after the group by. That's why you have having. I think it's easy enough to understand. At some point you will need it and then you will see. How do I implement it? Well, the natural implementation is, and you should implement it for the exercise sheet, just think about a single column with values, then it's easiest. I go over this, I just do a hash map where the keys are the years and then I just put all the values in there. So let's just do the, how would the hash map look like here? It would look like this, 2023, and then I have, I just collect the values while I'm going over this. Then I have 2022, I go over the table and while I go over it, I collect the values 8.6, 7.8, 7.7, and then I have 20, 21. And by the way, they don't have to be in order for that to work and they typically are not, right? 8.0, 7.5, so that's how you would implement this. You would compute this hash map. And now you make a pass over the hash map and for each set of values here, you just apply your aggregation function. Which means you would just apply max here. So if you do max here, you get 8.5. You do max here, you get 8.6. You do max here, you get 8.6, you do max here, you get 8.0. So it's not hard to. Sebastian, do we need cost estimates for the group by operation for the exercise sheet? We don't need them. I've written them on the slides anyway, just in case. Now, we have a similar problem. If we need cost estimates, how do we know what the result is? It's very hard to know. It depends on how large the groups are, how often will I have each year here, so I can just make some educated guesses. And it's not important, I've just written it on the slides. Here's some random assumption. How hard is it to evaluate the aggregation function? This max here, max is easy, could also be something very complicated. So the cost of this operation depends on the details of the table, details of the function and so on. details of the table, details of the function and so on. And here's a, that's interesting and I think you also need it for the exercise sheet. Pay attention because that's interesting. Think about the following. Let's go back to this table. Here I have grouped by year and I have output the max score for year, maybe let's go to the table without the having because the having is not important here. Now assume for each year I want the movie which has this maximum score. Think about it, how would you express it in SQL? So it's very simple query for each year the highest score, that's what I've done here, group by year and then max score is max score and now I want the movie with that score. So here I would want Kashmir Fights, here I want Oppenheimer, here I would want Dune. And if you think about it you can't do this with the normal Group by. What you could do is, you could now write, you somehow what Group by allows you is, which of these three Kashmir files, Western Front, the whale, how do I select, I could say, give me the maximum of those, the lexicographically smallest or largest. What Group by does not allow to express, give me the movie for these three, how do I select one, give me the one where a value in the other column is maximal. Group by does not allow you to express that. That's what I want for example, in my example here. Give me, have a column title here where you should write the movie with a property that doesn't only depend on that column but the movie which achieves the maximum score, yeah maximum value in the score column. It's not easy, but it can be done, but with a detour. And that's a nice task for the exercise sheet. You have to think about how you do it. Because that's, and it's interesting that occurs very frequently and you can't do it easily. There is no, how do we get the, yeah, and that's what on that slide, and you have to figure it out yourself, how to do it. Which exercise is it, exercise two or three? Where you have to do that kind of query? It's three, yeah, it's the third exercise. Okay maybe we go to the exercise sheet now to just look at, to motivate you. So the exercise sheet is now more advanced stuff and it's like practicing all the things we have seen. So here you have, and now we don't, yeah, interesting queries. And by the way, we have new data with a little bit more information which we needed to make more interesting queries. Here we have all movies that won an Oscar in any category. The reason for the year restriction is just so that the result is not so big. And here you should play around with cost estimates. We have already talked about them in the last lecture, but you haven't really played with them in the exercise sheet. This is about implementing Groupby and then trying it out for a query. And this is, yeah. And this is the kind of query which we have just seen. You want the query, you want the highest IMDB score for each decade, again decade so that you don't have such a big result, so the best movie of the decade and then which movie it is. And you have to think about how can you do this with SQL given the problem I just said. So there are some questions in the chat. We cannot give one distinct title to the score because there could be multiple movies with the same score. Yeah that's true that's another complication if you have multiple with the highest score then you have to say what you want. Do I just want? Do I want all of them or just one of them? You could say give me just one of them. Or more of them, then you would have multiple rows. Ok, good questions. So here's some more SQL. Ok, this was just super basic SQL. There are many, many, many, many more. Here's the SQL standard. Let's look at the SQL standard, it's exciting. Let's look at it, here's the standard, and let's look at part one, that's also interesting because it's expensive. That's just part one of the SQL standards, I think it's 100 pages or so, and it costs 189, only if you are a full member. Otherwise it's, so it's interesting that the standard is not free. It's actually typical. We discussed about this yesterday. And this, so one, two, three, four, yeah, so the, actually it was five, six, seven, eight, I don't know. So the SQL standard is a huge, SQL is a monster, right? By the way, the C++ standard, it's similar. If you want to read the SQL standard is a huge, SQL is a monster, right? By the way, the C++ standard is similar, if you want to read the whole standard. It's a huge document and it's not free because standard committees are a lot of work and cost money. So in this course, we just provide the most basic operations which are enough for many SQL queries. I already showed you this page here. If you click on select here, we will see other queries in a second. Here you see always a nice diagram which shows you how you can create select queries. We have already seen some things distinct from where, group by, having, so some things should look familiar, but there are also actually not too many other things. We will see order by limit and offset in a second, we will not talk about values, we will not talk about width, so it's not, yeah, it's a... So these I will now explain by example, but just by example, no more formal operations, and they are very easy to understand, especially the first two subqueries, I have two slides, subqueries are very important. And you need all of them for the exercise sheet. Order by is easy. I mean so far things are always multi sets which means the order of the row does not matter. But in the end result the order of the rows often does matter. So yeah you can just do it like this. And I think I don't give an example because it's, maybe I, I don't know. Should I give an example, why not? Let's give an example. So we have, how do I, let's see, if I select star from movies, let's just see whether that works. Let's pipe it into our SQLite 3 and I think I've already created the database file where everything is in there. Let's just see whether it works. Movies, db. And let's do a no such file or directory or echo. Okay, I have to... No such table, okay, maybe I should. Let's see, do I have my tables in here? No. Oh my, my movies.tv is empty. Ah, because I haven't copied it, I'm sorry. I forgot to copy it. I think I have it here. Internal code lecture 4 movies DB. That's why it didn't work. Okay, that takes a while. That's good. That has a certain size. Otherwise I would have to import everything. What can't be that large? How big is it? 371 megabyte, that took a bit too long. Okay, yeah, that works. Okay, let's just make this a little bit larger because I don't need the other window. Let's just, yeah, okay. Let's maybe just select title, comma, here from movies, okay. Now I just head here, I just want the first ten lines, okay. So I'm just doing it here on the fly. I just pipe this string into this and see it. So now we can just do order by year. So now I get them in any order, it's probably the order in which they were in the input files and now you see no particular order, neither first column nor second column. Now let's order by year. Now I get them ordered by year and because because I did head, I just get the first 10. 1875, you see it's ordered by year. How do I order in the opposite direction? Well, there is ascending and descending. If you don't say anything at ascending, I could also write it here. ASC is the same thing. If I write DST. And note that this comes at the end. So order by year, the order by comes after everything else. Now it's descending, okay, now I get some with empty here. I wasn't aware that we had those, the empty year is apparently the largest here. Okay, that's order by. Then there's also limit. Here I did it like this, that I just, if I don't restrict it, let's go back to ask again, I get the full result of everything. Maybe I want it to be part of the sparkle query to just give me the first 10 rows, and you can just write it like at the very end limit. So title year of all movies, order by year, ascending, always good to write it explicitly, just give me the first ten. There we are. Or just give me the first twenty. These are the first 20. Or maybe I want a segment in between, I could also do that, give me two starting from offset 10. And this should be now, yeah, this should be number 10 and 11 here, right? So this, if you just want a segment from the sorted result. I can also do it without order by, but that doesn't make a lot of sense because I don't know which segment I'm getting. So that's a typical combination. Order my table somehow, and then give me a segment of the result. Typically don't have offset you use when you have big data and you want to read it in chunks, but simple enough to understand. And you could also have several keys here separated by comma then it's like first order by year and for the same year if you want to break ties then use this column and so on. So that's written here on the slide. Any question about order by and limit and offset? I think it's simple enough to, can you exclude empty year with this line? Yes, we can, it's the same question and maybe we should try it together. How do I exclude empty years? Who has an idea? I want to exclude, I think probably I want something not null, right? Where? Who knows how to do this? Where year is not null? I don't know. Sometimes in SQL you can just write it. I could write it but it didn't do what I intended. Maybe it's just the empty string and not null. Yeah, that was it. It was just the empty string. Okay. And here that was interesting because here again you see the order, order is important, select from then you have all your where constraints and then at the very end you say okay now my result ordered like this and just do a subset of it. So important but easy right? Now I have just the top 10 movies when I order them by an actual year which is not the empty string. What else? Okay, sub-queries is important but also not really new. Think about it, everything we did so far and maybe let me quickly go through, go back to lecture where we had this. So anything the database does is really it gets tables and has operations which again produce tables. Everything is tables all the way from bottom to top, tables all the way down. Tables, product, table, table, all operations take a table, produce a table. So a select query also gives you a table. So far we had one select query and it gives you a table in the end. Because it also gives you a table, you can also have a select query inside a select query or many of them. And here's an example. And now, I mean this is typical for computer science, you have relatively simple concepts, but when you plug them together, you get complicated things. Let's look at that query and try to understand it. So now I have, let's first understand it on a syntactic level. I have select from table, join with another table on join column, we have seen that syntax before, join on another table on join column. But now instead of another table, I just write column. But now instead of another table, I just write a whole select query, which gives me a table. So I can just do that, all I have to do is write parentheses around it. Wherever in a SQL query I can have a table, usually I put a table name, I now put a whole subquery and have to give it a name with as. By the way, as we haven't seen that, you can also use that for other table. I could write movies as M and then I could use M instead of movies or something like this. So I think understanding when you can do it is easy. Wherever you can write a table, you can also write a whole select query which gives you a table. The question is what it means. Maybe think about it a little bit and then let's look at what it means, this query. And I tell you how to. Does anybody see immediately what this query computes? Yes? All movies where the director won an academy award? All movies where the director won an academy award, yes. That's exactly right, very good. Okay, let's see how one could get at that. Actually these quickly become frightening because they can become big and you need this for exercise three. And hint, hint for those listening here, I had this, and this is often the answer, we said here if you want this, you want the movie, the highest score per year or per decade and the movie belonging to it. You can't do this with group by alone, you can do it with group by and subqueries. And very frequently when you can't do something immediately, you can somehow do it with sub-queries. So hint hint, that's also the solution here. How do we understand the query? Well let's just look, let's just assume the result of this sub-query were an input table. And let's look what this query does. It just, I mean that that's the subquery itself is easy enough to understand, right? It's just all, I have this awards table and I just look at the persons who have won an Academy Award and each person at most once. So let's just replace this with a table like if it was one of our input tables. And then we have this situation. We have our typical movies table, the director who directed which movie and who won an Oscar. I could have called this Oscar winners but then it would have been too long. And now my query becomes the following. It's just, I mean look at the query, it's just joining three tables. Movies, directors and the Oscar winners. And it's even joining them in the natural way. Namely here on movie ID and on person ID. So using what I wrote earlier, this is a natural join. So I'm just joining these three tables together in the obvious way, namely on the join column, movie ID, person ID. What happens if I, movies, directors, Oscar winners, then it's easier to see that this is all movies with a director who is an Oscar winner. I mean that's easier to see. So if you want to understand a query with sub-queries, think about what's the table produced by the sub-query, think about it as if it were an input table and you can do that recursively if you have many of those then it's usually easy to understand. Okay and you need that for exercise three and for this special group by thing. So now we go acid, last part. This is this is quick. Is there any question about this before we go to this last part, which I think will not take long, just five minutes? Okay, let's go on to an acid trip. So dynamically changing tables, so far we always did the following and it's actually a frequent use case. We read our tables and then we ask queries on it, right? And let me just say it again, that's a very frequent use case. For example, Wikidata, which we have seen now several times, many data sets out there, they get updated once per week, maybe once per day if they're smaller, some even once every three months because they're so big. So it's perfectly reasonable. There's a new version. I load this new version into my database, maybe takes a day and then I work with it for three months until the next version comes around. So nothing changes. I read the data once and then I query it. But of course there are other applications where that is not so. Typical example is bank data. Banks have all, who has how much money on their accounts. This changes all the time. People are transferring money. And of course this is stored in kind of database. Product data, you're a company, you have lots of products, the products change, you buy new products, you sell products, the price changes, the description changes. And of course His in One is, yeah, it's also, it's a database and you have people have these grades, people joining, people leaving. I mean that's also very, very normal. His in One just consists of hundreds of tables and the contents changes all the time, constantly. And so just two slides, how do you change rows? There's of course also SQL commands for that and it's very easy, the basic commands. Let's just start with our very simple movie from the beginning and this is very easy, I just wanted to show it to you. There are three things you can do with the table. Add a new row, insert into table and then you just specify the values of the new row. Maybe you want to change a row. Okay, I want to change the row where the title is Inception. Usually you would do this via the primary key, but maybe let's do it like this here. Maybe I want to update the score. You do it like this. Or maybe I want to delete the row or many rows. Delete all rows where the year, all movies before 2000. So changing the rows, these are the three main commands. Insert, change, delete. So we had this at the very beginning, change in the columns, don't do it. But you can do it. Again, here I don't have a table to go with it, just a, have a table to go with it, just you can add a column. When you add a column you have to say what should I put in the column? It will be all null values I think is the default. You can remove a column, everything that's in that column will then be gone and you can rename a column. And you can change the domain name of a column. So maybe I've changed score to rating and it should be no longer real but int. You can also do that. Wait, wait. Careful just has one L, right? What? Oh no, I'm here under. Careful. has one L, right? What? Oh no, I'm here on the... Careful, I think that comes from carefully for my correction. Now it looks like it's perfectly reasonable to, and just understand this, I mean, changing the rows in a table is perfectly normal. It's not symmetric. I want you to understand this, right? Changing the content, his in one, changing the content, new grades or something, that's perfectly normal. Changing the columns is not perfectly normal, right? Think about his in one, there's all kinds of services, things doing something with the table, they are relying on a structure of a table. If you now, oh let me rename that table to something else, let me remove that column, then a thousand things, thousand people using your table services, they won't work anymore. So this is something, don't do it. Maybe they are special, I mean there are cases where one wants to do it but then it usually has a whole tale of things which follow. So this is not symmetric, the columns of the table, think of them more as fixed. And this is something I only want to mention because it's important. We don't go any deeper here. Transactions. But I did want to mention it. These are the last two slides of the lecture. When you change the contents, you have to, yeah, you change the grade, you transfer money. You have to be very careful what happens when something goes wrong. And if you just, in our setting so far, where you just read it once and then you ask queries, all that happens is the query fails, you have to ask it again. But if you transfer money and something goes wrong, you have to pay attention to things like the money being subtracted from my account but not being added to the other account. The bank just said, oh sorry, that happens. Something went wrong, it wasn't our fault. It just shouldn't happen, right? Subtracted here, not added here. That's very frequent. And so that's why in real databases in these situations where the data is not read only, you have the concept of a transaction. So things which somehow belong together, money being subtracted here and added here should be considered as one unit. And that's where ACID comes from and you could give again a whole course or many lectures about acid, but I just wanted to mention them and you should at least understand them. So atomicity is think of the money transfer thing, either this works or it doesn't, nothing in between. Either the money successfully is subtracted here and added here or nothing happened. There shouldn't be an in between thing because something. And you have to implement this. What happens, you're doing this, your code runs and then something crashes. You have to, you somehow have to take care of this. Then there is consistency, isolation, and maybe I will spend one more minute to explain them. And durability, that's again easy to explain. The transfer succeeded, you go to your web app, your banking app, the money was transferred, it says transfer successful and at that moment the bank computer crashes or something. After it told you that it actually worked, then somehow they have to make sure that it still doesn't get lost because there are actual systems behind them, data being stored somewhere and so on. So actually the Wikipedia article has very nice examples for all four of them for atomicity and I couldn't explain it any better here. And I'm not sure and they give four examples for each of them. So you maybe wonder what you should remember from this, from the exam you should know that acid exists, these four it's just you should just know them and you should understand what the difference between the four is. So I would say go to the Wikipedia page and read the examples and maybe before we close I think I should yeah I don't know maybe I do it at the beginning of the next lecture because we're already over time now maybe I will I will come back to this now I don't want to start this from scratch. But anyway, we're not going any deeper here, but I did want to mention it because it's very important in many applications. Any questions about the material today or about the exercise sheet? Okay, no questions, so have fun with the sheet and see you again next week. Bye bye.So, welcome everybody to lecture 6, databases and information systems, which of course can also be taken as information retrieval. So today I will first say something about your experiences with exercise sheet 5, which was about more advanced SQL, and there will be no lecture next week. As you maybe remember from the beginning, I think it's also written on the wiki, since the lectures are a little bit longer than usual, here we almost take two hours every time, there will be two dates, two weeks without a holiday but we just create our own holiday. So one of these wonderful holidays will be next week and you also have more time for your exercise sheet. Of course that doesn't mean that you should start one week later, you should start now and then you just have more time for it. It's not that the sheet is twice as much work, it's normal work but still it's always good to start after the lecture. So the next lecture will be in two weeks from now. And it will also be kind of a new topic, so that's why we did it like that. So after that we have three, so no lecture next week, and then three more lectures before Christmas about something else. Okay, today we have a slightly new topic, so we had three lectures, we had two lectures about search engines, three lectures about databases and today we have a lecture about knowledge graphs and SPARQL and I will explain this and relate this to, it's not completely unrelated to database, actually strongly related and we will see that in the exercise sheet. I will talk about it later. Let's first talk about your experience with the fifth sheet. So interesting, it was a little more work, it was more advanced stuff. So some of you wanted a new topic. Let's see, exercise was very diverse in terms of tasks. Very interesting how queries are processed by an RDBMS. We looked at that in detail. I really like the style of the exercises and that we have a nice environment to implement and test our written functions. I like the sheet even though more work than last sheets. Very interesting to see which of the sequences of the first exercise was better because my intuition was different at first. Very interesting, you have some intuition but that often is wrong and by just doing the math you get more precise results. It felt a bit repetitive to write the queries. I think that has to do with the fact that it's already the third database lecture. Then there was a group concat required, I think I wasn't aware of that, otherwise I would have explained it, so you had to Google that. When you do group by, the only thing I mentioned in the lecture is that you can aggregate, you somehow have to, for the values which are not for the variables that are not for the variables that are not in the group by, you have to say, how do I aggregate them? Let's maybe quickly go through that again so that we all see it again. That was here on sheet five somewhere where the part about group by. Yes, and then the question is how do you aggregate and of course you have many ways to aggregate. We just talked about mux, of course you can also do min, you can take the average, right? The question is here you group by year, so all these have the same year, what do you do with these values? How do you get one value from all three? Of course you can, then there is mux, there's min, there's average, there is also concatenate them all together with a separator. And that's group conquered and you had to Google that, just forgot to mention that in the lecture. And somebody wrote, I think several wrote craving for a new topic. I understand that, of course I should say that in a real database lecture you have like 12 lectures only about the stuff which we now compressed in three lectures, I think these three lectures were needed, but now we actually go to a new topic. And the new topic is knowledge graphs. So let me start by and it will be a lot of examples, demos and I think it's a very interesting lecture it's also it's not completely new but we revised it a lot because now we already know a lot about databases so it was again quite a lot of work for us and I'm sure there will be all kinds of small mistakes. I will come back to that later. So the resource description framework is really a data model and we will see what it means in a second. All the data, whatever it is, so in the database world your view on data is data is tables Another view on data is data is tables. And now in the knowledge graph resource description framework word RDF, you say everything is triples. And what is triple? A triple is like a simple sentence with a subject, a predicate and an object and this full stop is here for a reason. It's like a full stop end of the sentence. And here are some example triples. So they are called triples. And already here, by the way, you see it also looks a bit like a table. So it's not like totally different from databases. Actually it's closely related. You can just say, okay, this is just a table with three columns. The first column is the subject, the second is the predicate, the third is the object and actually you're right. In a way it's just that. And let's look, you can read these like a short sentence. Nicole Kidman acted in the movie Eyes Wide Shut. So here we are still with the movie theme. Many more examples and the following. Brad Pitt acted in Burn After Reading, the Coen brothers directed Burn After Reading, who knows the movie Burn After Reading? Ok some people with good movie taste here, so it's very important for the exam we might ask you questions about movie scenes, no we won't, ok but it's anyway an important aspect of the lecture to educate you about good movies. These are good movies. So this is yeah this one is directed by Stanley Kubrick and so on. So I think that's clear enough. Again we have no order. It's a set concerning the input. It's not a multi-set. You can only have each triple at once. Why is this called knowledge graph? So this, now I said it's simple sentences triple. Well you can view the exact same thing as a graph. And note the logo here of RDF of this data model is three things, but they are like also connected in a graph like thing. And here's the exact same information as a graph. Just when you have triples and now when you have a table, you couldn't do that, but you have triple and the second thing is like a predicate and you can do that very naturally, right? For example, here we have a movie, Burn After Reading, and now we have several triples associated with this, we just do this as edges here. Ethan Cohen, the director directed, so the predicate is the label of the edge and here an entity. So we have all kinds of entities which are notes in the graph, here it's directors, actors and movies but it can be anything and triple that's what's written here is just an edge so one edge you can read it as Stanley Kubrick the subject, directed, the predicate, the movie, eyes wide shut, the object. So very naturally you get a graph so set of triples and a graph is really the same thing. Now, so I don't know who heard information retrieval before, this was always a lecture in information retrieval but in a very basic form and now that we do more databases and we have already heard more databases we do more of the real stuff. And this is like a toy version of a knowledge graph. They don't really look like that in the real world. And in the real world, you have things like Iris. And Iris is like the thing you have in the web browser, which is in URI, except that you have all kinds of funny characters allowed, like German umlauts. So, erie is just a little bit more general than what you can put in a browser. And eries are often not human readable. So, for example, now let's look at, so we've already heard about Wikidata quite a bit. This, Nicole Kidman has this, and it looks like a URL, and actually you can also click on it. Let's just click on it and it's interesting that on some browsers you get this. This is not the link which I typed so I have to, if I go here, so there should be a wiki, didn't I type wiki data? Yeah, I typed wiki data, but still I don't get the, yeah, I'm on a different browser. I did type wiki data in Google. Yeah, Google is confused probably of what happened with open AI, so they're all going mad at the moment. It's very interesting that I, I mean it's not important but I googled Nicole Kidman, Wikidata and I don't get the Wikidata link. Or do you see it, am I blind? The one with the picture, oh it says Wikipedia on top, oh yes. That's confusing, it says Wikipedia but it's Wikidata, I'm sorry. Okay, so here we have a URL and it's not quite that one but it's important that you have this ID here, Q37459, Q37459, so that's just the ID of Nicole Kidman in Wikidata and I think we have already looked at that a bit in the first lecture, right, that things have an ID in Wikidata, I won't do that again. So the identifier is not the name and this looks like something we already had in databases, right? It's just an ID except now that the ID is not only this Q thing but it's a whole URL so that it's on the whole planet unique, right? Wherever you type this on the planet this has a unique meaning. And now here are some example triples and let me introduce one more thing. These full IRIs, like these extended URIs, are a bit long, which is why you abbreviate them. So here, and we will see that this is just this WD colon is just an abbreviation for this HTTPS www.wikidata.org entity slash. We will see more of that in the following. And there should not be a, either the one is missing here, or the one is missing here, and the one, let me just fix that. There should be no one here. And here, and now we have triples, so for example we have the triple, the name, this entity which stands for Nicole Kidman has the label and here is the name and this is the name in Hindi, yeah, this is Devanagari script I think, did I pronounce that correctly, anybody? Devanagari I think. It's just Hindi script so you have that, it's very nice. You have actually the names in 300 different languages. Here we have and I think up here I have some more information. So here I have the triple. This thing which stands for Nicole Kidman has this property which is also an IRI WDT P19 which stands for place of birth. We will see more of that. Has and the place of birth is this which is again something Q1809, no this was a typo, I'm sorry, too many things at the same time. Oh no, the keyboard here is different, 18094, yeah and we get Honolulu, right? Honolulu as Q 18094. So please ask if something is not clear, so this triple here just says Nicole Kidman place of birth Honolulu. But things are just expressed in identifiers, everything is expressed in identifiers. So when you look at the data and one of the data sets will give you for the exercise sheet will look like this, it will just be relatively cryptic. Will look like this, the only thing you can read directly sometimes are these things on the right. So on the right you can have a string or here for example you can have a date. Here you even have data types, this will not be important for the lecture but just for the sake of completeness. Okay, and this is date of birth. You please ask any question if something is not clear, otherwise I proceed. Okay, so is this used? Yes, this is used a lot. So Wikidata, we've already, actually it's now 19 billion shippers. This is not completely up to date, doesn't matter. There is here some data sets. For example, there's such a knowledge graph. Let me just go back to the slide. This was an example with a few shipples. And now people have compiled sets of that such shipples for really huge data. So for example, there's such a data set about, and we will see it in a minute, I have some example queries, all the protein data in the world. All the protein sequences, the genes that encode them to what diseases they are related on which part of the gene I have what information, just modeled as triples. So this triple idea seems to be a good idea because people are using it. So 110, and that's a B, that means billion, I need to get some water. 110 billion triples, you see another interesting thing here, the predicates are relatively small, right? This vocabulary, so to speak, with which you express your sentences. There's a similar thing about chemistry, all the chemistry knowledge of the world, 124 billion triples. So it's like this substance has this formula, this chemical property is similar to this substance. There has been this experiment with a substance in this lab and this is the result. And that's the result if you put it into a mouse or a human. It's amazing these data sets. We will see a bit more in a second. OpenStreetMap, who knows OpenStreetMap? Who has heard of OpenStreetMap? Great, we will see OpenStreetMap. It's also a knowledge graph or at least you can model it as one, 14 billion triples. And also, I mean these things, they are consortia, so big groups of people curating this. This is just crowd-sourced like Wikipedia, it's amazing that it works. We will see it in a second. Here's another one that's all computer science publications in the world. We have also seen that this is relatively small but a billion is also large. And actually most companies, not only the IT heavy one actually, have their own knowledge graph. So of course Google, Amazon, Microsoft, but also companies which produced products, Walmart, Airbnb, and so on. So now we have data in a slightly different format and it's, how do I get this a bit further back? And the question is how do you query it? So and there's a query language, so now we have this data in this triple form which is like a database with a table with three columns and there's a language specifically for this and it's called SPARQL and I have an own part only about sparkle but just to get a first impression, here is how sparkle looks like. You already see similarities to SQL, there is also select, now there is where not from and now let's maybe, I have an own part but let's already understand this query because you can kind of, you write triples. So just like your data is triples, your query is also triples. So that's already different from SQL. So here I say, I'm looking for, and I'm, yeah, if you remember, this is place of birth and this is Honolulu. So kind of, I'm looking for triples where, yeah, I'm looking for people who are born in Honolulu, right? Every triple that matches it, person will be a person who is born, will be, this will be a person who is born in Honolulu. And then I also want to match this triple, and when I match this triple, I get the name of the person. So I just write triples and I put variables and the thing in which I'm interested in. And here I put the same variable because I want the name of the person that is born in Honolulu. And then in the select clause I put my things just as for SQL with the difference there is a question mark in front of variables and SQL we don't have question marks and there's no comma, right? So it's the usual thing. Every language has to do these minor little things differently so that to maximally confuse everyone. Okay, we have a whole part about SPARQL, just a first impression. And again, and here, not again, but I haven't shown it in that detail yet, here you see how these prefixes work, right? So I don't want to write this long IRI here, because maybe I have to write many of them, so I just introduce an abbreviation here, I say WDT, which is, yeah, WD stands for Wikidata, the T I'm not sure, it's just an abbreviation for this, so when I have this P19, I plug P19 here, and the angle brackets, that's just what RDF does, it's just put IRIs and angle bracket to distinguish them from strings, has no other meaning. Okay, so first this is all introduction. And this is just for your information. This is just for your information. So one of the strengths, so what you may ask yourself or you will ask yourself in the following, what's really the difference to databases? Think about combining different databases. You have made your database about movies, your friend has made another database about movies, totally different tables, I mean you've done this for three exercises now. The tables, the information is the same but the tables look completely different. It's very hard to put your two databases together if your tables are completely different. Just think about it. And this is in practice, it's a huge nightmare. You have data from this hospital, from that hospital, they all have patient data, patient diagnosis, what they did at the temperature they took on that date, but in completely different formats, you can't get it together. Now in RDF, in this triple world, all you need is to put the two together. Now in RDF in this triple world all you need is to put the two together and let me just show you one example because it's so nice. Let's for example go to the wiki, yeah let's take Germany the country we're in at the moment so Germany is Q183 in Wikidata and here I have all the triples about Wikidata saying it's a state and so on, here's an image. And now somewhere in here and that's very, yeah that's very typical. I see here, oh I know the people who entered it here say I also know there's this huge knowledge graph or data set OpenStreetMap and there Germany also exists and there Germany has the ID 51477 and we can just click on it because these identifiers are all like URLs if we click on this, don't know why it takes so long, and now we are in OpenStreetMap. OpenStreetMap Germany is 51477, it's called a relation because it's several things together that's not important now. And now we can also look here in, so you also see you have key value pairs here which are also like triples, so it's like saying Germany has the ISO blah blah blah code DE, Germany at min level 2 which means it's a country, Germany has this currency, the euro and so on. And if we go down down down, so we also see we have some information here which we also have in Wikidata like the name but it's typically other and here we also say oh by the way in Wikidata this is Q183 so very typical for these data sets yeah each data sets knows of other data sets and say oh by the way in this other data sets this is the ID so OpenStreetMap says in other data set this is the ID. So OpenStreetMap says in Wikidata this is the ID, Wikidata says in OpenStreetMap it's that ID. And that's also just triples, right? So for example here I have X'd them out for you. So in Wikidata, in the input data when we look at it as a set of triples we will have this Wikidata and then there is a special relation predicate which has this funny name. It's just a predicate for this is the OpenStreetMap thing and this is related to this and then the OpenStreetMap data, if you download it, you will have this triple. This thing, which is Germany from OpenStreetMap in Wikidata has this ID. So one of them would be good enough to connect it but you usually have both of them. Let me just show what's possible, just one first glimpse. So this is now Sparkle, we haven't talked about it a lot. Now I can write one query which just says, and now you see some elements here of the things we have already seen, and it's just one query which just questions both databases. Let's just do that, okay, that was quick. And now, ah okay, now, and this is the power network of the EU. So if I zoom, let me just zoom into here and go here. So here I have a power line which is probably somewhere in OpenStreetMap. Yeah, so that's a power land, even says how much voltage it has and so on. And why is that possible? So I just said this part of the take it from OpenStreetMap and this part here take it from Wikidata. I couldn't have done it with a database without knowing a lot about the schema. And here's a quiz question for you if I zoom out. Without knowing much about SPARQL yet, look at this map for three seconds. But you already know a bit about, what's this minus, what do you think this minus is? Any idea? Just guessing from the language model like from the query. And why? Think about, so you don't know a lot about the language yet, it's the power network of the EU, let me go back here, and why do I need to combine OpenStreetMap, so OpenStreetMap has the information about whether it's power lines, you don't have geographic information like this in Wikidata. Wikidata knows which countries are in the European Union. This information is not in OpenStreetMap typically. So this is asking which countries are in the European Union. And this gives me all the power line and then it's just combined. So what do you think? Minus? Why the minus? While you're thinking let's just comment it out and ask the query again. Okay. What's the difference to the picture before? What? Can you say it again? Did you see the difference or did you not see? Ah, here's, ah, now you see it. Okay, now you see it, okay. Yeah, it's the, yeah, you have to be careful. It's not easy to ask. So this is all countries, this here says is a member of the European Union and this says minus the countries which have been a member but for which there was an end date, right? There's one country with an end date and I have to subtract that. Otherwise I get countries which have been in the European ones but are no longer. So this kind of thing is possible with knowledge graphs, there are big knowledge graphs out there so it's pretty fascinating stuff, no UK, that's correct. So now let's talk a bit more about the query language. So SPARQL, what is SPARQL? So you already see SQL in there, and that's deliberate, so it's SQL, but a bit longer, and SPARQL is defined as SPARQL protocol and RDF query language. So, SPARQL stands for something, where the S stands for SPARQL. That's called a recursive acronym, right? So, SPARQL, the S in SPARQL stands for SPARQL, think about it. So and in the following, so in the last lectures we were a little bit more formal about SQL, not about all parts but about the basic select from where, this we defined pretty formally and then we were a bit more by example. So SPARQL, we will just learn it by example. Also, we just have one lecture and also because it's enough because you will see it. And like SQL, the basic language actually very simple. So basic SQL query select from where, quite easy to understand, but then of course, you have this huge standard around it. The nice thing is for SPARQL, the standard is freely available. So here, this W3C, this is, it's a web, it's somehow from the World Wide Web Consortium, the W3.org, and they just have a standard here and so if you like to read RFCs or standards here's the document which just defines everything including I think at the very end the is it at the very end let me just find that if you are interested. So this really the complete, it's just one big HTML document. Yeah, the grammar is also here somewhere. Here you have the grammar in some Baku-Snauer form. So even if you, so it's all there. It's one big document, actually not completely unreadable. Sparkle and SQL are essentially the same, like two programming languages are the same, doesn't mean it doesn't make sense to have both of them. And so the last part of the lecture and also half of the exercise sheet will be about translating SPARQL to SQL. Now of course there are engines, which once again it's similar to programming languages which kind of translates sparkle directly but there are also engines which just are database engines that take a sparkle query translated to SQL then execute the SQL query and it's a very instructive to understand and it also exists in the real world how this is done. It's surprisingly easy. So and will be exercise one of the exercise sheet to implement it. Very interesting exercise. So let's take one example query so that's let's maybe go back to this data here. So this is data containing information about who acted in which movie. Also who is married to who or was married to whom. Here we don't have information about time of marriage. Nicole Kidman was married to Tom Cruise. At some point, let's just look at acted in and married to. And now let's look at, go back to this query. So what, look at this query, you don't know a lot of sparkle yet. What do you think that query will output? Think about it, you can write it in the chat or just tell me, yeah? think about it, you can write it in the chat or just tell me, yeah? Like married couples that have both played in the same film. That's correct, married couples that have both played in the same film. That's exactly what the query computes. And I have another slide, but let me already say it here how to think about this. I mean you're looking for assignments to these variables such that all the triples here exist. That's what you do, right? I'm looking for two persons and the film so that all these triples exist in my graph. That's it. So I want this person to have acted in a film, this other person to have acted in the same film. So that's why I have the same variable here, right? If I do film one, film two, I get something different. And the two people are married to each other. And even without fully, without having a formal definition of the semantics, you can understand what this query does. And especially after you've seen a few more examples. And we will see more examples, and even more on the exercise sheet. Like in SQL, the result is always a table. You will get a table, and we will see more examples in the following. So here you will get a table, we will see more examples in the following. So here you will get a table with three columns, person one, person two, film, and the result will just be a table. One row for each person, person, film, tuple. And that's what I just said. One table row for each, so this is like the semantics of SPARQL in one sentence. It's not formal but I think it's enough for this lecture and exercise. You have one row for each assignment to these variables so that all the triples here exist. So that's SPARQL. Quite a nice and intuitive query language. So no questions. Yeah and what will help us to understand and intuitive query language. So, no questions. And what will help us to understand the semantics more is in the last part we will translate Spark to SQL and then you see what the corresponding SQL query is. And here's another way to understand Spark queries. So let's look at that query again. Since it's triples, and I told you in the beginning, triples are very naturally viewed as a graph, you can also view a query as a graph. So this is my query, which is just consisting of three triples, so I have a graph with three edges. So my nodes are now variables, or at least some of them, also some notes could be fixed things, but here they are variables, so we have a person and another person married to, acted in, acted in, and this is the same film, so they both point towards the same thing. And so you can think of answering the query, and I just leave it for you to imagine, think of the data as this big graph, and now I'm just looking, I'm taking this as a template and looking where does this match in my graph? Where do I have the structure of two people married to each other acting in the same movie? So it's like, and there's a problem in mathematics which is doing exactly that. Like you find sub-graphs with particular properties. So that's, so you can, and actually that's the reason I'm also saying this first because it's useful to think of it like that and also in the standard you find a lot of speak of patterns and graphs. It's often called knowledge graphs, graph patterns, and that's the reason, because like pattern matching. And that's what I just said. So wherever you have a match in the graph, and then you just read off the three values for the three variables, and then that's one row in your result. Now let's go to the, so this were again with our toy examples, let's go to the real things, so in the real world we have IRIs, so this would be a more typical query. And let's maybe go to real SPARQL engine now and let's type a query here. So now we are, let's say we are looking for people who are born in Honolulu. So first I'm writing a triple here, I think I call it, I also call it person here. And here you see something, I will talk about more about it in a second. Now I want to know place of birth, I could go to Wikidata, look up the ID for that, but here I can just type place of birth and it tells me that's a W, maybe let's make it one larger. Okay, it looks a bit funny now. Can you see it like that in the last row? Is that large enough? Yeah? Maybe, okay, good. So, so now this is the predicate WTP19 and I even get a mouse over that it's the place of birth. And I want a particular place of birth, I even get suggestions here, they're even ordered, so that probably means that most of the people in Wikidata are born in Prague for some reason. Here we have Fonolulu and indeed it's Q18, 0. So if I were to do that, now I just get all people and here I can write star which means all the variables, you can also do that in SQL by the way, but we haven't done that so let me just do person here. Now what will this compute? All people who are born in Honolulu, right? And here they are and I don't get the people, but I get their ID. So 1032 people, let's click on one and check it. So that's Puanami van Dorp, who is human. Okay, place of birth Honolulu, yeah so it's correct it's in the data like that. And now maybe we also want the name and now here it would be like this so there's a predicate which is called label. Now it has a, why is it called RDFS and WT, these are just abbreviations right, this thing and let me, so instead of this I could just equivalently write the following just so that that is absolutely clear. It's really just notation. This is, but that's just not nice to read, right? That's now a full IRI. Maybe let's leave it like that for a second and now I have, yeah, it's called person label, maybe let's call it person name. I can call the variables however I want. And let's also output it person name. I can call the variables however I want and let's also output the person name Okay, why do I get the person so many different times? Do you think? Why do I get it so many different times? This should look... So if I go here, do you see the difference? This should look, so if I go here, do you see the difference? It's actually the name in different languages, right? Just happens to be the same. So these are not very well known people, so the names are just all in Latin script. But if we go to a more, I don't know, let's take, okay, Albrecht Durer, the painter here. Okay that's maybe a, if you look at all the entered languages. Now we also get in other scripts, right, the vanagary and acrylic and so on. So we also get, of course in most languages just written the same and so on. So you just have the names in many different languages. There's something, it's not important for the sheet, but I'm showing it to you. Now you can say give me just the things in English language. Now I just have the people in Honolulu with the name in English. Now I have each of them once. And again, see, this should remind you of databases, right? Since it's a table and I have the name four times, I just have repeated it four times, like enjoined, cross product sort of thing. Okay, maybe I also only want those with Hindi names. Okay, nine people for nine of them, I have the Hindi names or maybe I want the Arabic names, that was AT doesn't exist, I think R Arabic names, I have 309 and so on. So it's amazing right that all this, and it's actually there in 200 other languages. All this information is in Wikidata, this one data set, here we see it by the way, 19 billion triples. People I think have problems with, right when it's millions, billions you easily lose track of how large it is, these three more digits, that's quite a difference, right? 19 billion is really a lot, let me just, I think I showed this white web size, billion is a lot right, there's this page which measures the size of the world wide web, like the real pages with content so it always around 50 billion, so billion is a lot even nowadays, 50 billion. even nowadays 50 billion. Okay, so nice, so now we have that information and here this again, this is just a full IRI. I could just abbreviate this as RDFS label. That's why you always see it like this. This is exactly the same query, exactly the same query, just nicer to read and here you just have all the abbreviations defined. Yes, please. And then I'll just use like variables, and then you can find like my own, with my own rules, I think it doesn't make sense, but whatever. So what exactly is the question? You want to define your own what? The prefixes. Oh yeah, absolutely, yeah, very good, very good. You could just call this, hoo hoo hoo. What did I do? What did I do? I don't understand this. Okay, I could, hmm. This UI is buggy, so if you want to do a project with us, this UI needs work, it's a great UI, but it needs work. Yeah, this is not correct. This is what I wanted to do. I could just, it's a very good question. The question is, does this work? And the answer is yes. It's perfectly okay. I crashed the backend. Okay. Crashed the backend. Okay. But I'm professional so let's just restart it. It's our own engine. So, the clever scripts in the, okay I'm not in the, wow, what happened there? Let's see whether it's, yes, so you see it's live, this is here our server running this engine. It's starting, you also see here the data, actually it has even more triples internally, 26 million. And let's see whether it's up again, doing all kinds of things. Yeah, and the people are using this of course, I immediately get, yeah, and there it's back up again. So yeah, I can just call the prefixes however I want, but typically you call them how they are called in the input data. Okay, and yeah, here I just have these again for your reference. So these are prefix declarations. Of course there's also order by and limit, and this I think is very easy to understand because it's also select queries with these parenthesis. So here I can just say people born in Honolulu, let's also include the date of birth and let's order it. Let's maybe write that right here. And let's do it like this. And now I can just say person and I also want the date of birth. Okay, this is still warming up. That's why it takes the time. See here, it's the warm up queries for the auto completion. Actually the queries which give you suggestions, the result of this, this box here is also the result of a sparkle query on the same data, which is quite ingenious if you think about it. So you get the suggestions for things which you can type here from the database by just asking a query to the database. This box here is just the result, it's a table, right? It's the result of a sparkle query on the database on which I am searching. Date of birth, so I just have a variable for date of birth. Let me include it here. Now I get people born in Honolulu with their Arabic name, date of birth. And now, very often you want to order things somehow, and this is how you can do it. Desk date of birth. And now the difference here in SQL, you would do it. Desk date of birth. And note the difference here in SQL, you would have written order by date of birth without the question mark desk as a suffix, but these are the typical syntactic differences to confuse you. So now I get the youngest first. So this is the youngest person in Wikidata which has an Arabic name. In case you didn't know it's Scott Moore. Okay, very good to know. You learn a lot from, yeah, so many, so much interesting information. Here's a very incomplete list of differences. It's super, I already told you more than those which are listed here. Here's some principle differences. I mean it's very similar. In SQL you have from, in a SPARQL you have where, and you can even leave out the where. You don't have a from. Why don't you have a from? Because you don't have these many tables. Think about it, right? Your input is really just one big table of triples, subject, predicate, object. So it's not even called a table, it's just called the input database. You have these different tables which are connected, primary key, and then a second foreign key and so on. You don't have that. It's, you just have one big table. So you don't need the from class, you don't have that. It's, you just have one big table. So you don't need the from class, you don't need tables. You don't have explicit joins. In SPARQL, the main instrument, of course there are more constructs. We have already seen a filter. You go back to the slide. Queries are really just triple, triple, triple, when some, at any place you can write a variable. And if you use the same variable twice, it means something. This means I want the name of that same person and the date of birth of that same person. And actually, maybe you already get an idea now, but we will see that in the last part. When you have the same variable here and here this is exactly a join and we will see that in the last part. This is 100% corresponds to a join operation and actually if you click here on the analysis tree you will see join operations. So this is a, I've already shown it in one of the last lectures. When you process this you also get sequences of operation. More about that in the last part, and then there's these minor differences, yeah you have no comma between variables, you have a, let me write that here, and plus variables start with a question mark. And it's just so that when you see a query you can immediately see oh that's sparkle. And funny differences like this, right? In SQL you would write, that's of course wrong, you would write, oh why didn't it go to that, I'm sorry. You would write this without the question mark like this. Yeah, these minor differences, right? This is how you would write it in SPARQL, order by descending variable with a question mark and SQL variable without a question mark. Descending afterwards but it's exactly the same thing. And very easy, there's, oh I haven't, I don't have limit, yes I had limit here, I'm sorry. I just didn't, yeah I didn't talk about it. Limit of course the same thing. I order by, now I say, give me the first 10 or so. Let's also do that in our query. Here I have 350 names. Maybe I just want the first 10. Now I only get the first 10, right? Simple enough. So this has the exact same syntax. Okay. Who checks these triples for correctness? Yeah, so this is of course an interesting topic of research, is my data meaningful, correct, free from contradictions? It's not, actually I think we might see an example in a second, there are, yeah, but it's surprising how good the quality is. Like Wikipedia, right? You read the Wikipedia article, is it correct? It might be completely made up, it's written by people, you can edit it if you want, but still the quality is quite amazing. And let me, yeah, just to make that point where is my yeah here I mean I mean this is open street map this used to be the business of large companies which made a lot of money so Google in their first 10 years I think Google maps was there from beginning they bought their map data from a company. You would see copyright by Taylor Atlas, which was a big company and they paid millions, tens, hundreds of millions of dollars I think for this service. And nowadays these companies don't exist anymore because you have this information. And this is just done by people, right? Somebody walks around somewhere and says, here's a street, here's a building up to the level of, let's go here, here's a tree. Let's see, yeah, this tree number 727, it says this leaf cycle, okay, twice per year, leaf type, natural, maybe even count the number of leaves, whatever you want. I mean the level is amazing right, this is crowd sourced and look at this, it's the buildings, it's the streets, it's everything, it's the individual trees and not just in Germany or in Europe but I mean you can go anywhere, let's go here and you have the same thing. It's not the same level of detail overall but it's pretty amazing. Okay so last, yeah just some example queries and then we have a break. Yeah, I think we already had that one with the first names. Let's just take Sebastian. It's just people with a certain first name in Wikidata and where they are born. Okay, Sebastian seems to be a central European name in case you didn't know. Notable things that happened today. Ok so that's let's just ask for things that happened today on this date in some year. So the month is 11 and today we have the 21st. I have no idea what the result is. So this one takes a little longer, so you can see what's happening. Still query. Oh yeah, you even see where, so it's busy with this joint, now it's finished, okay. Oh my, Vladimir Putin was baptized 1952. And Voltaire was born and North Carolina was founded and some Pope was born. Oh my, what a day. And Björk was born. I don't know if Björk, the Icelandic singer, knows that she was born on the same day as Vladimir Putin was baptized. And Aston Villa was founded. Okay, so all this interesting information, right, that's Wikidata. Here's something that's actually very useful, so that's all German universities. I mean these data are really useful. So getting a list of the university, how many universities do we have in Germany? About 100, here you can see them on a map. This kind of information you don't get very easily. It's clear that it's out there, but how do you get it? Well, if you search a source like Wikidata, you get it. So these are the universities in Germany. So universities, not Fachhochschuen, and their student count and where they are. So you can also see that distribution, they are fairly well distributed over Germany and so on. So that's really useful information. Yeah, let's look at UniProt. Oh, I see, I can't show it to you because I'm just building a new index. Oh my, let's see whether I can fix that. Oh yeah, it's just building a new index here. 110 billion triples. It's just indexing the new latest data set, which is fairly big. 111 billion, 100 million. Let me just look at the, this takes a while. It's using, hmm. Okay if I now start this, I risk killing everything, but I will do it anyway. I don't know. Maybe let's come back to this in a second. Okay yes, PubChem, information about chemical data. So these are just chemical substances which are processed by the body easily because they have a small molecular weight. So here we have paracetamol for example. That was just an example. Here is again some, tell me any country or bigger region in the world can also be a black forest or whatever, a region, a country, a city, something big? Australia, okay. Australia, that's a big, Australia, so let's see all the streets in Australia, okay this is wrong here, this is asking for all the streets in Australia in open street map 2.6 million, let's look at them on a map. All the streets in Australia. Yeah you see a certain, yeah that looks correct if I now go here that should be a street and a piece of street and open street map. Interesting right? Distribution is also interesting. I mean, I think, yeah, so it's a bit lopsided. In Germany, it would probably look different. Let's do it for Germany. I mean, that's a lot of data. I don't know if this works. All streets in Germany, all pieces of street. Let's do that. And look at how fast this is. 14 million streets, let's look at how fast this is. 14 million streets. Let's look at them on the map. Drawing 14 million streets on the map. Is that possible? Yes, it's possible. So that's the street network of Germany. So here, yeah, you also see Berlin, here's the Oberhein-Kraben. If you go in here, you see the individual streets. Okay, fast food restaurants in Germany. That's again combining information from Wikidata and OpenStreetMap, let's see. Ah, interesting. There used to be a mistake here, which is no longer there, that's a pity. There used to be a ferry line, which was mistakenly labeled as McDonald's, so you would always have these things here. It's no longer there, you would have this nice map, and then you would have one ferry line, which was just a mistake in the data. But yeah, if you do these queries, you can do them, you can play around with them yourself, you will see very few mistakes. There are mistakes, but so these are all fast, fast food restaurants in, I don't know, does this work? That's the wrong query, right? Yeah, that's's the wrong query right? Yeah that's just the wrong query. And the nice thing here is it's always the same you just put in the data set you have very many different data sets here Wikidata, Proteins, Chemistry, same query language, same principle, you can even combine them, that's very nice. Okay, and I think time for a break now, and then we'll resume in five minutes. We are back online, there was a question about particular query, and an answer to that. So the link to this engine is in the slides and of course feel free to play around with it to get a feeling for the language. And I also got this query to work here. I just started it, it wasn't started. And just, yeah, this is this enormous, look at this number. I mean it's just, yeah, that's the thing about numbers, you add a digit and it's a factor of 10, you add three digits, a factor of, this is quite a large number. This is 110 billion, which means 110,100 million. And this is just running on a, I mean this is also amazing. This running on this one machine here, which is just a 1000 euro desktop, nothing special. And it's running here. And it's just a query which gives, it's a group by, yeah, it has a group by. And it just takes, let's just a query which gives it's a group like yeah it has a group by and it just takes let's just look at it for a minute and appreciate the enormousness of the data it just for every organism counts all the protein data that we know so here we have gamma protobacteria bacterium and there are 1.8 million proteins known for it. Here's an example protein. Here's the amino acid sequence, just the enormity of the data, right? Just to show an example. And here we just ordered the organisms by how many proteins we know about them. So here we have human immunodeficiency virus one. We also know a lot and I would expect that somewhere down in the list we find homo sapiens probably, yes there we have it, 68 to 207, 892 proteins where we know everything. So enormous amounts of information right. So for, and this query basically groups, processes all the information right. All these triples. Okay so back to the, and maybe I kill this now to not impede my index building process here. Just building something new. So two more parts. Reification. That's important for the exercise sheet and it's also important to understand but it's simple enough. I have brought in a simple example. How do you model complex information? I mean it's nice. I mean it's always nice, you have a very nice model, that's how we started. Everything is just triples, but maybe it's too simple. Maybe I just cannot use it for everything because some information does not fit into this framework. So what about, let's go back to our movie theme. Meryl Streep is famous for having won many Oscars, three of them. It's quite hard to win three Oscars and here the three movies she won them for. Kramer vs. Kramer, that's the old divorce drama. 1980 best supporting actress, Sophie's Choice, 1982 best actress The Iron Lady, portrait of Margaret Thatcher, 2011 Oscar for best actress. Now the question is, so Meryl Streep won all of these three, I haven't written yet, it's the Oscar, Meryl Streep, how do I cast it into triples? So let's just make an attempt. Here's an attempt. So if I do it like I did it on the first slide, okay, she won an award. This Oscar, yes, she won it. She won this award. I just abbreviate it here. Now I want to say she won this award in a certain year, right? This is not the year of the movie necessarily. that's the year where she won the award. So let's take the predicate, Mary Streep won an award in the year 1980 and she won an award for this film. So what do you think about this? Is this a good way to model this information? I mean certainly the information is there, but what do you think? Yeah Yeah, exactly we are losing information right we see here she won an award But we don't we don't have the connection to which film. That's the problem with triples. We're kind of losing the connection between things. So we know she won an award in three years for these three movies. We also don't have the connection here. The order doesn't mean anything. And here's even a little bit subtle thing. She won best actress twice, but you can't write the triple, Meryl Streep won award best actress twice. You can't just have that in, it's a set of triples, not a multi-set, it's just yeah, she won an Oscar, whether she won it once, twice, multiple times, you don't know from that data. So that's somehow not the right way. So, yeah, that's what we just said. And it seems that this information is just inherently, one way, if you think of one award, that's called a predicate, in the RDF world it's a binary predicate, right? Because it has a subject and an object. So from the view of the predicate I have two pieces of information which I connect which why I also have the graph view right. I'm an edge coming from one entity to another. So it seems to be suitable for binary predicates. But here this award thing for the kind of information I just showed you seems to be more than binary right the award connects this person, Meryl Streep to a particular award best supporting actress in a particular year for a particular movie. So this year looks like a four aries right. This would be this here looks like a four-ary, right? This would be, this here is the predicate like connecting four information. It's looked like I want to write the following in my data. Not a triple, but a quintuple, right? So this, this corresponds to a four-ary predicate. If I write it like this. That's a predicate which just connects four things. Actually in the graph view that would be a hypergraph. Yeah? Oh yeah, that's true, it's 2011 right? For some reason this loses track of 2011, I think. Oh and there's this nice auto already here in these forms. Clippy. So is it 2011 according to my, yes 2011. So yeah in a database that would be of course no problem. Now just have a table, we can have as many columns in a table as we want. But in the data, in the knowledge graph RDF world, we are somehow constrained to table with three columns. And that's how you do it. It's actually, if you think about it for a while, you would also get there, but let me just tell you. You use additional entities which just serve as a hub for more information. And this is how, and let me immediately show it to you in the real world. So this is how it, and let me immediately show it to you in the real world. So this is how it looks like. And let's break this down and understand this, because this slide kind of explains reification. Let's just take the information that Meryl Streep won an Oscar, and actually a particular Oscar, Best Actress in 2011 for the movie The Iron Lady. Here we just have, this is, by the way why is it called Empress this funny ad prefix? This is because, this is, This is input data here, hence prefix, not prefix. A bit strange, but that's how it is. In the SPARQL query we had prefix and capital letters when you have it in the input data you write it very similarly but like this just in case you wonder. So these are just abbreviations we have already seen that now we have some new abbreviations P, P S, P Q what does it mean not so easy to understand but we are explaining it now and now it says Q873 and let's just go to a wiki data to check what these things mean and it's actually let me do it the other way around I can tell you that Q873 is the entity Meryl Streep. So this is Meryl Streep, P166, what's P166? You can also just type it here. Ok, it's not like, I have to google it, the URL is apparently a little bit different. Property talk, oh yeah. Oh it's property colon, the URLs are a little bit different. That's the predicate for receiving an award. And it has a funny prefix, P. Let's forget that for a second. So this now goes to some new entity which has this long name now, WDS, it's some artificial entity and this entity now stands for a particular word of Meryl Streep. So this predicate leads me to an artificial entity called a mediator entity or a statement node in Wikidata which is just created, it's just a node that stands for the information about a particular award. And from that node, now I have triples with that node as subject. And this idea is actually longer and we see it in a second in real life for the real data it's actually an even longer ID and so now I have to triple okay this particular has and now again I have 166 for award and this year is Oscar for best actress. It's the Oscar for best actress. This year's information about when was the award and here it says 2012, that should be 2011 but actually it's a, and this I think is not easy to understand when you see it the first time. So this award, this year stands for the predicate 5, we can look up 585, stands for 585, what's the property? It's just a point in time. There's a predicate for something as a certain point in time. This particular award has point in time 2011. Then there is 1686, let's look at 1686. 1686, that stands for something given for something, for some award. So you see the vocabulary is important. People, Wikidata people have thought a lot about this. So we have a predicate for this was given for this work here. And these are now my triples. So Meryl Streep has won this award, that's now not the name of an award, it's a new entity which I introduced which stands for a particular award and from this with this entity as subject I have these all these kinds of information and maybe it's I think useful to draw a little picture here, so and I have Q 8 7 3 And now I have this one thing here, this is now this WDS, this is now the statement node here, and from this statement node now I have all kinds of additional information, right? And this is how I, yeah. So I want to connect this to, and I just introduced this intermediate note, and then this information is all connected because it all belongs to this node and this problem goes away. And let me, and I think you have to try this yourself to really understand it. So I'm just giving you the example and how it works. And let's maybe before I go to the, I think it's a good way to just try to write the query for it. I think that also helps. So let's type Meryl Streep here. So Q873 and now let's say we want her awards. It's P166, we have already seen that and now we get her awards. We get a list of awards so if I would just put a variable here and write it here, now I get a list of the awards, not their names. Actually here's a way to automatically add the names. Just adding some triples to the query. So that's just all the awards she won but now I don't have the information for what she won them. right? And now look what happens if I write a P here. I just choose a different predicate and now this predicate doesn't go to the name of the award but instead goes to these funny things here. So this is now Meryl Streep. This is a word but a variant of this with it's technically a different IRI and now each object here is and let's just because I think it's interesting, each here stands for a particular award of Meryl Streep and let's just output them, let's just take a variable for them. So I claim that each of them, yeah, so these are these artificial identifiers, each of them stand for a particular award and these are also URIs, let's click on one of them. If I click on one of them, now you see in the database this actually stands for this whole award, right? So actually I can go to this link, I mean they wouldn't have to do it like that, it was just very interesting that they did. So this stands for a particular award of, so I'm on the Meryl Streep page here, right? You can also click on another one here. And now it's this, okay, honorary doctor of art. It doesn't even have to have a B award for a movie, right? So that's, wow, she did, I didn't know that she got an honorary doctorate from Harvard. It's a bit strange. Silver bear for best actress and now you have additional information. It's really quite clever, right? This is quite clever. And let me just... So why these different prefixes? You have these different prefixes. So WDT, we have seen that. Let me just do it again here. If I have Meryl Streep, I want just to know which awards did she get, did she get an Oscar or not, I don't care for what and then I just use this WDT, so this leads me directly, there is no statement notes involved, it just leads me directly to the information, the name of the award. P leads me to an intermediate note from where I can then go to more information. PS, and let's also do that now. If I do the P thing here, now I get these, now I just take a variable here. Now I go from here, take this as a subject, now if I want to go from, oh this is not doing what I, ok. Now PS, this now leads me from the statement node to the actual name of the award, so P, so this is now a detour. If I just wanted that information I could have taken WDT directly. But from that statement note I get to the name of the award like this. And now I also have more information like the point in time. This now has another funny name PQ. Now I don't want the point in time, I want the four work here. Now you see you get movie names here. So I can just say movie here and so on. And we could continue this for example, we can, the award I only want Oscars. So let's take this should be and there we have it Academy Award. The title Oscar, Academy Award. The tight Oscar here, Academy Award. Now I have award received, movie, let me automatically add the name triples and here I have the three awards here and each of this is the object which leads to that information. So there's this naming scheme, I have several versions of a predicate, they are all called P166 but with different prefixes, so technically they are different IRIs. One leads directly to the main information, this leads to a statement node with more information, this leads from the statement node to the main information, this leads from the statement note to the main information, this leads from the statement notes to other information. You have to try it out yourself to fully understand it and also the ingenuity in this, right? So here P1686, which is the predicate that something was given for something, you can't just use this with a word, right, this can be useful in other contexts. Maybe let's, one more example, for example, let's take Germany and the population, just comes to my mind. 1 0 8 2, now I get the population of Germany, it's 83 million, right? But maybe I want to know more information about, I want to know when was this population measured and so let's just do the P thing here. And now actually you can see there are 37 population measurements in Wikidata. The data set is just, yeah, so this, each of these things now stands for, and now you can see, and actually it's, yeah, and so PS now leads me to the actual population and then there's also something point in time. Yeah, when was this measured? Okay, now I can choose a certain, so if I want a certain population measurement like maybe in, yeah, or maybe I can just take all of them, right? It's just a date and now I can just, yeah, just take the population and the date and just order it by date. Now I can see according to the, yeah it's ordered by date, now I have the population measurements over time. So you see in 454 it was 70 million. Also all that data is there. So you can model everything. And the way that they're just called like this, these are technically, I mean, this is just an abbreviation for the prefix for some IRI part, which is different, but they just have this common suffix so that you see that they all belong to the same thing. It's really quite clever, but you have to play around with it yourself. Yes, please. Can you go once like that? Yes. And those four triples at the end, and both are the information inputs, and if I want to search for something like that, or some combination of those, then I replace some of those with the variable name. Exactly, that's exactly what I did, yes. So this is the input, yeah, and in my search for example here, yeah, I was just, now I'm searching, yeah. Again it's the same thing, I'm searching for assignments to these variables so that all the triples here exist. So I'm looking for statement notes and population and dates so that this effectively gives me what we've just seen. It took me a while to understand this, I think no amount of explanation can make it clear to you immediately, you have to play around with it, but it's really clever and the clever thing is that look this P8, this is point in time, point in time you can have it for a lot of information, right, it could be the, I think one of the queries on the exercise sheet is the monarchs. That's when something started. There are also predicates for something started, something ended. So this here is not a predicate that's in any way specific to population, but it's just a predicate that can be used as additional information for all kinds of statements. Very clever, very clever. Oh it's really clever, yeah. So and of course, yeah this is what we have just seen so I'm just showing it again here. I mean the concept is very easy but now formulating queries, so that's the query which I've just showed you, which we have constructed together, right? That's the query for getting all the Oscars by Meryl Streep, this one here, yeah. And here I've added, let me maybe remove the statement note. I don't want it in my result, right? These are the three Oscars. This looks like a simple query. You could type it into Google. You would even get the right result, I think, because it's so popular. But you had to write a pretty complex SPARQL query and you have to know a lot of information. And let me just, I summarized this for you. What do you need to know to formulate such a query? That's basically what you have to understand about SPARQL. You have to know the right definitions. You have to know that these things, you have to know that these things, but of course a tool can help you with that by just putting them there automatically. You have to know these things, okay I'm searching for Mary Streep 873, I can go to Wikidata, look it up, we saw the auto completion. This is the hardest one, how do you know P, P, S, P, Q,, P Q, that's really hard even for experts, and then you have to know these additional language constructs, you don't need them for the exercise sheet, but I've shown it to you, right? That if you don't do that, you get the names in all 300 languages and you don't want that, so this just filters by language and so on. So even for a relatively simple query, you have to know a lot. But if you understood this once for one query, then actually you have unlocked a whole word. Yeah, and this is just one appetizer for the next lecture in two weeks. So what you have seen here, what I've used all the time Yeah, and this is just one appetizer for the next lecture in two weeks. So what you have seen here, what I've used all the time was our own engine. So actually many of students like you have already contributed to this. Also this UI started from a great student project. So you see here, now I'm looking for Oscars. How am I supposed to know what the idea of Oscar is, I already see it in the list here, also the list somehow depends on what I've typed so far it's context sensitive, but I could also type just the first letters here and now I get Q1902 and I'm sure if I, I could have also searched it here, no I don't want the Oscars, I want the Oscars, they're called the Academy Awards, it's Q19020. So it helps me with that. So this by the way is on GitHub if you want to have a look. It's a big repository, very active. And so we have many thesis and project topics related to this. So if you want to contribute to a big open source project that's actually used, many users, many applications, that's a great way to, people love this and you have already, students have already contributed a lot to this. This offers auto completion, we have seen that it's very important, you type something and you get suggestions which make sense. And this is actually something we will talk about in two weeks because we skip next week because this auto completion thing is actually, yeah, if you talk about information systems, this is database and information systems, auto completion you need it in all kinds of context. You know what you want to type, you just don't want to type the whole thing, you want a suggestion. So how do you do that? That will be the topic in two weeks. Okay, so last part, how do we translate this to SQL? And we will do this by example, so it's just five more slides. And for the exercise sheet, you implement it yourself. And now first we have to say, now we want to translate to SQL, we have to say how is the data stored in our database. And for this example, I now pay attention because I just, let's just consider two predicates, the acted in, I just want couples who played in the same movie. So let's just say I have two tables, one for who acted in which movies, and I'm not using any ideas here for the sake of simplicity. So Sean Bean played in Games of Thrones and so on, so I have some, yeah, let's see how big is it, it's just something I've compiled, okay not so bad, 600,000 pairs who acted in which movies and I also have something like who is married to who or has been married to who 73,000 and that's just also Barack Obama, Michelle Obama, Ronald Reh so it's somehow ordered by, oh we see Anne Hathaway again here, Albert Einstein was married twice, we don't have the information at which time and so on. For now just two different tables, that's important I mean we just have one table for predicate here so that's one natural way if you have a knowledge graph just one table for each predicate so you don't need the name of the predicate here right I don't have to say married to because this whole table is only about married to. For the exercise sheet you just should put like it's in the real world everything in one big table so you don't have different table for each predicate you have one big table with three columns subject predicate object. Deliberately do it like this so that you can or have to do some thinking yourself. The exercise sheet is really nice. Sebastian put a lot of work into it. So and now comes, so we have these two tables, I think that's clear enough, Mary 2 and the same, let me show it again, acted in. This is my sparkle query. And now I want to know the corresponding SQL query. And you have to tell me how it works. And the exercise sheet will be to make that general and implement it. It's not trivial but it's easier than one might think. So maybe think about it for a little bit and let's as usual let's start from the from clause which which tables do we need. That's the last intellectual effort we do in this today. Yeah, you have an idea. Yeah, I think it makes a lot of, I mean we probably need both tables, right? So let's just act it in and comma, we are back in SQL world, married to. Okay, and by the way, oh, let me do this right away. This is something we haven't really introduced, but it's quite simple. You can introduce abbreviations as A comma so then you don't always when you want to refer to that table you can just it's just an abbreviation very simple M okay so we have that table A and that table M and now we want where? Suggestions? What should I write here? So A has just two, so this is just person film, acted in, and married to is person one, person two. in and married to is person one, person two. And I can also let me just so act to do that in SQL. It's not so easy. Person one, person two, know what's the problem. How do we get the two people here connected, right? I mean we could say something about, I mean let's just, I mean one way if you're stuck, I mean you certainly want to say something about m.person1, right? Equals to something and then you probably want to say, I mean that's clear, person two equal to something. Yeah I mean, I mean you can just like a language model and just write something and see if it makes sense. You just have this one table person here. I mean this you could just write it like this. It's somewhat clear you need both tables, let's just write both tables. And now probably we need to some constraints here. Person one is this person, person two does this make sense? I mean what what would be the result of that if I do that and let's just type the query. Let's just do that while you're still you have it in your head now, like a chess board. Let's just import. Oh, what comes first? TSV or the name of the table? Oh no, this is not what I wanted. Oh my, quitting SQLite is hard. I'm sorry. SQLite three import acted in and import. Oh no, no, no, no. I forgot the separator. Then it doesn't work. I'm sorry that I'm interrupting your... Separator tab, import acted in, tsc acted in, takes some time, that's good, married to, TSV married to, schema. Okay that looks good, we have two tables like this. Let's now do select star from acted in as a comma, as is just abbreviation, married to as m, where m dot person one is equal a dot person and this kind of, we don't really know what we are doing, but let's just do it. What will be the result of that query? What do you think? It's always the predictor of, you should be able to predict what. Okay, I see. Before I go to what's suggested in the chat, what will be the result? Yes? I think it's the numeric couples where each one of those are played in a new model, but not necessarily the same. Okay. It's empty. That was a good guess. I thought it was, I also wasn't quite sure, but this is, yeah, SQL is also not so, I mean, this is like the same, this is basically saying, if you just read it like math, a.person is same here and there, it's like saying m.person1 is equal to m.person2, which can't happen. So this is not the right way to do it. Yeah, only people that are married to themselves, which are apparently rare in this data set. Can we join, somebody wrote, acted in table with itself to get all pairs of actors. Oh yeah, joining with itself. That's, let's do that. We can have the same table twice. And we, I think we didn't do that, but nothing speaks against that. You can just take the cross product of a table with itself. So if I, let's maybe do this, let's do it here and then write it on the slide in the end. And let's do the following. Let's just take the acted in table and let's have it here for that's the table and let's just compute the dot product at the cross product Cartesian product with itself. Now I have all that's a big cross product right and let's do the1.film is equal to A2.film. Now I just want every row with every row here where the films are the same. What will I get? What do I get now? What's the row? Yeah? Yeah, all pairs of actors who acted in the same film. Yeah, all pairs of actors who acted in the same film, exactly. And we could, let's just look at it, we do have limit, I mean that's the nice thing here, we can just, okay yeah, so we have, yeah, this is just right, one row from the table, one row from the same table, just started somewhere and the movie is the same. So yeah, I could even, I could have A1.person, A2.person, and I just, the film is the same anyway, because I have an equality constraint here. So yeah, now I just get pair of people who acted in the same movie, right? And now, now it's I think easy, right? I just also add the Mary 2 as, no, no, not here. Let's just add another table as M. Yeah, and now I just add the constraint. Yeah, what constraint do I need now? If I just now, I mean, now I have this information information now it's also the information about married couples. What do I add? Let's see what the chat says. Oh yeah there's a suggestion okay and m.person two equal A2.person. Okay and there was the suggestion A1.person not equal A2.person. Okay that's just to avoid people married to themselves but we already saw that we don't have them. Let's see what we have now. Yeah and now we should get, yeah we're not, that which should be the information we... Something is wrong? Does the table also contain the reverse information? I think so, right? We have no, yeah. We see actually, yeah, it's always like this, just by the, I mean I didn't put it in a particular order, but yes. And it makes sense, right? If we wanted that that I could have I mean one easy trick to avoid that but I'm just mentioning it here would be just output it in the order lexicographic order so I want a one person less than I mean they are not identical so now I just get them in the order the one which is alphabetically smaller first. Let's just verify is Laura, Dern, Jeff or where. Jeff or where I mean I'm not I'm watching the movies not I'm not following who's married to who married oh yeah apparently they were married. Okay. Where are Jeff Goldblum? Yes, romance found a way on the set of Jurassic Park. Okay. So it seems to be correct. These actors, they act in a movie and then they marry. Okay, let's just write this here. And it's also something, think about it yourself. And I think we can leave some of it here. So this is from acted in as, I want this once and then I have m person one is a one dot person, I think that's what I want here. And please ask if there is any A2.person and it's just one more constraint and oh yeah which one is missing one is missing there's one constraint missing I want the constraint here I'm just writing the obvious stuff at the top. And person one. Same movie, yeah, that's correct, same movie, so A1.film equal to A2.film. And let's not forget the semicolon and since they are the same. Okay, and now, okay now we have to come to an end. And now the question is how, what's, and if you think about it, and I mean it's on the slides, you don't have to look on the slides, if you want to try it yourself, the question is how do you turn this into a general algorithm. I give you a query with any triples here, with some variables anywhere really and you should come up with this query. And I'm giving you a hint now if you don't want to hear the hint you can close your ears or just forget it anyway. It's no coincidence I have acted in here twice. I have two triples with acted in here and I have the table here twice. I've married two ones. So each, for each triple here, I have a table here. It's the table of that predicate. Or if you have all in the same table, you just have one table per triple. And now, why do I have M person one equal to A one person? Well, that's actually because I have the same variable here in two different triples. Same variable in two different triples corresponds to a join. And here I have person two and person two, and that's why I have this equality. And then I have film here so actually I have person one in two triples that's why I have this constraint I have person two in these two triples that's why I have this constraint and I have the variable film here in two triples which is why I have this constraint. And here is the, yeah, this is, so the general algorithm, it's actually quite simple but you have to do the exercise to understand it, that's the point of the exercise and I really don't want to give too much away. If you have k triples you just repeat for the exercise sheet I really don't want to give too much away. If you have k triples, you just repeat for the exercise sheet, you will have everything in one big table. You repeat it k times, then you keep track. We do the parsing for you as Sebastian has done the parsing of the query for you, then you just keep track, okay, a certain variable occurs in these triples and you have to remember at which position, whether it's object or subject, that's important. And then you just add all these equalities. So when a... And just as a last hint, so if you would have the same variable here three times you need two equalities to say that they are all equal but that's on the slide. So let's just, still one more. Oh yeah this is just one, you have to write this in Python so you should write a program where you enter a sparkle query, it's automatically translated to SQL and then it's executed on SQLite database. On data which we give you, which I show in a second, and last thing I will do. And we didn't do that yet, calling SQLite from, SQL queries from SQLite, it's very simple. From Python it's also very simple. Here's just a piece of code how you do it. So it's just how you call SQLite conveniently from SQL. And okay, and please think about if there are any more questions. I just want to give a glimpse and information about the exercise sheet. So this is, it was, it's again a new lecture, a lot of work, also for Sebastian doing the sheet because we did something new and also creating, I mean look, Wikidata is, we could have just given you the whole Wikidata set, but then, okay, you don't need super expensive machines, but 19 million. 19 billion triples is maybe a bit too large. And getting nice subsets which are suitable for exercise sheet is a lot of work. And so we are still not quite finished. But we have a first version. Is this, oh yeah, there it is. So you should implement this sparkle to SQL translation, so you get a query of this form and you translate it into SQL query. And Sebastian has already written the parsing for you. And then we give you two versions of a nice data set and here's one version. So we give you one version where you actually don't have these funny names, P something, Q something, but you have nice names like in the toy example in the beginning. You also don't have reification like the statement notes and you and it's also not so big. Let's just see, I mean not so big doesn't mean small but it's only, it's only 14 million, tiny, 14 million, triple, tiny, tiny data set. And then you should just figure out, so all CEOs born in Germany, gender and birth date ordered. And then we give you another data set. It's not yet finished. We're still working on this, but should be finished tonight or tomorrow, the latest, where you have another query and where you have the real data with this P and Q and statement notes and everything. So really I urge you to do the exercise to really understand this stuff and all the subtle things we have seen. And once again, no lecture next week. You have also two weeks for the exercise sheet. We have a lecture again in two weeks. Are there any questions right now about anything? Okay, then that's it. Thank you and see you again in two weeks.So, welcome everybody to lecture 7, databases and information systems, which can also be taken as information retrieval. Fewer people than usual, end of the year atmosphere. So I will say something about your experiences with the last exercise sheet which was two weeks ago. I hope you enjoyed the free week or at least free from this lecture. It was about knowledge graphs and SPARQL and today we will talk about fuzzy search and we will see all about that in a second fuzzy search, edit distance, QGrams, QGram index, and so on. And the exercise sheet will be to implement this with a very nice data set, which I will show you in a second. First about the last exercise sheet, which as I just said was two weeks ago. RDF Sparkle is awesome, interesting exercise, also challenging for some. Here are some quotes. It has been very challenging but fun. So we have been doing three lectures on databases and then knowledge graphs. Knowledge graphs kind of built on databases. It's like databases below but a more meaningful standard and format. Needed time to get used to knowledge graphs and SPARQL? Yes, that was one of the points of the lecture. Fascinating way to structure and query data. Somebody updated the engine to add support for optional and I will say in a second why exercise three was challenging and a bit annoying because you were just given the data set with these Wikidata style IDs, well you don't, the IDs are just numbers. And you have to find out what they mean. Very curious to see how exercise three can be solved with one query. I will show you in a second. Overall this sheet was very difficult for me, so some people were struggling a bit. Would be nice if the sheet changes would be communicated, we did, Sebastian, we did communicate the sheet change, right? I think we did, would be nice if you subscribe to the forum, and I think we did, because we always pay attention to that. And yeah, the youngest CEO that was query exercise two is a convicted scammer. So let's look at that. Let's go to the solutions which are also online now. So let's go back one, solutions sheet seven. Ah, this is sheet seven. This is sheet six, sheet seven, that's what you have to do now. There we are. So what I already did for you now, because that takes a little while, so this is just to create the databases from the TSV file which we were giving you and let's maybe, let's have a very quick look at this and some of you. And absolutely if you didn't do the exercise you have to do it at some point because I think without doing the exercise you don't really understand this stuff. And for sure this will be a topic in the exam. So this was the one data set, just triples, but triples which you can read, so sides work, dungeons and dragons and so on. And then there was, which we called complex, we could have also called it realistic. This is the data like you get it out of Wikidata, so you just have IDs here, and now you would have to, let's just, yeah, we could just search this in the data set here and see, or maybe just grab it, and somewhere there will be a triple saying which label this has. Ah, okay, this now also gives us things which continue. So maybe here we need a tab. Yeah, okay, so here, change Link the Dreaming, not sure what it is, maybe a movie or something. Yeah, you just get these IDs and then you have other triples which say what's the name of this thing. Oh, here we see a tabletop role-playing game. So to find out what things mean in this data set is a bit more complicated, which was exercise 3. Okay, so these were the data sets given to you and first step was just to build a SQL Lite database out of this. What I already did now, just before the lecture started, takes a little while, like one minute or so longer for the complex one, faster before the lecture started, takes a little while, like one minute or so, longer for the complex one, faster for the simple one. Let's maybe also quickly count. These were not so small data sets, I mean it's fun with, so that was like, that's I think 60 million, and let's look at the simple one, 17 million. And then what you could do, then the task was to write a Python script, a Python program where you enter a query. So this was the correct sparkle query for finding the CEOs. So it was people who are CEOs, you had to find that there was a predicate chief executive officer, so pretty obvious. Date of birth, place of the country, United States, because there were just more of them. Gender, and then order by birth, date, and take the first 10. So that was the query. And then your task was to write a script, sparkle to sequel pi. Let's just take a look at the argument. It takes two arguments, the database, the SQLite database on which to ask the queries. So that would be in this case, Wikidata simple DB, what I have prepared and then just a sparkle query which you see above and now this sparkle query is translated to a SQL query that was the task to do that automatically and here you get the result. And the funny thing is indeed that the two youngest of each gender are convicted. Yeah, they did something which they shouldn't have done Elizabeth. Yeah so I don't know what that says but that's just how it is. Okay and then now the second one for the monarchs, monarchs of the British monarchs. It was already written on the exercise sheet that it's not so easy and here in the master solution we also have three queries so this was one query so you have to find out that you have this predicate P39. Let's maybe look that up in Wikidata. Let's go to Wikidata. I'm not sure how you find predicates here, let's just do it like this, P39, position held, so that's position held, you have to find that out, but let's, I mean another way to do this is how do you find, I mean you just take someone which you know, like maybe Queen Elizabeth II, and then you look in the data where it says what she was and then you find oh its position held it's where and then it says monarch of this monarch of this and so on. So this is the way how you could have found this out or you find it look just search your way around the data set. So you see she's a monarch of a lot of things it's called the Commonwealth and somewhere down here is where is it, so many that it's hard to find, the United Kingdom and if you see the mouse over here it's Q9134365 and this is indeed what we have here in this query and now let's and this is indeed what we have here in this query. And now let's execute that query in the same way by just feeding it now into, oh no, I have to call the Python script, Python sparkle, and now I have to, Wikidata complex DB. And now this is already a lot of technology coming together here, right? We have our SQLite database, we could have even taken our own database, but we made it simpler here. So you have database technology here, you put in the sparkle query, it's translated to SQL and executed on that database. So pretty complicated stuff already or not complicated but using a lot of technology so it's translated to this query and now indeed you get Elizabeth II but it doesn't go further than George V who came to power in 1927. And the reason is that it changed over the time what Britain is. And I think several of you found that out. So if we look at the second query here, now here it says Q1, let's just look at that up in Wikidata. We enter this. This is now monarch of united kingdom of Great Britain and Ireland. So before that, Britain British was that and if we ask that query, now we get everything from George V back to George III of Great Britain, but it still doesn't go all the way back. So you have a third query and if you ask that let's look what the queue is here. So they all have the same form but it's always monarch of something else. So it's a complicated thing, this British monarchy. Now it's just monarch of Great Britain. So you have monarch of Great Britain, monarch of Great Britain and Ireland and then monarch of United Kingdom I think. And if we do that, now we get it back to N of Great Britain. And before that, it was the last question on this sheet, something completely different. And now the big question was, how did you get all this in, how do you get all this in one query? I'm not sure if anybody, many of you said you tried but you had a hard time doing this. So what's one way of doing this? Well, one way of doing this is you look this up and then you look, okay, up here it says is this part of something bigger? Or you could have, and then you find out for example that this is part of monarchy of Great Britain or part of British Royal Family. We click on this, British Royal Family now, Q645968. And indeed, I think there was a hint in the last three triples of the data set, so it's a little easter egg here, we made it the last three triples actually say that these three things, monarch of this, monarch of that, monarch of that are part of, and this is just the British royal family. And this, so indeed you get with this one query. Let me just do this here. So now it says position held is something. So you introduce a variable here and the something is, and this is the predicate for part of, and this is British royal Family. Or maybe you could have also taken, I think that would have also worked monarchy of Great Britain. But I don't think, maybe the triples were not in there. So let's try that query by just feeding this into our script and now we get this. So that's the query, that's the sequel translation and now we get all of them back to 1707. So and that's actually quite typical and that's also why we did this as the exercise. So you have to understand quite a bit about the collection to find this information. But it's also very interesting and you learn a lot about history. So what happened, why doesn't it go further back? So I mean there are kings somehow in this area of the world which will go back to before 1000 I think. I think there is a list of English monarchs. Let's just, let's do a little bit more of history. The English monarchs start in 849. 849 and then they go until, yeah, 77 starting with N. So what happened in 77, actually several of you researched this a bit, so Anne of Great Britain came to power it seems, yeah that's true, that just happened in 17. The Acts of Union were passed, so I think that's what happened, so before that you had Kingdom of Scotland, so you have this here, you have the monarchs of England, and they go back to 895 and I think that I didn't, it's just a list of Scottish monarchs, you probably also have a, yeah, you also have a list. And sometimes they were the same, I don't know how far does this go back, 858, so about the same. So we had all these different realms. And then at some point they were unified. And this is what happened in 1707, at least some of them. Scotland and England united to form Great Britain. What a time to be alive, yeah. 1777 I suppose at least on one day there might have been rain falling. Yes, I I agree the question on the sheet was not very specific and somebody wrote I asked a British friend about the monarchs and they did that before they worked on the exercise sheet to find out how do I get them all together in one query and she told me about the Acts of Union in 1707 before I even noticed the question at the end of the sheet. So that is my final answer and yes you won a million euro. Congratulations. Okay, so I've also shown you the, and as one of you noticed, so why isn't the, that was the optional remark, why is Prince Charles or King Charles, actually became king, not here? Well, because there is no end date, right? It's just start time, no end time, and to have the end time, you would need the optional. I think we talked about, or it would be like this joins joins which we talked about in the database lecture or optional and then you also get, I don't know, should we do that? No, I think we should continue. So somebody just implemented it because it's relatively easy and then you just get, you also get those which don't have an entry and end time and you will also see print charts. Okay let's continue with the lecture and a new topic which is related and so the next three lectures will now kind of tie everything together which we have done so far and so the the last the next two access this exercise sheet the next two, this exercise sheet and the next two will be really interesting because they combine kind of everything to one product, which I think will be very fascinating to see. So we start with fuzzy search. And fuzzy search is, I mean it's doing, let me know, it's doing what I have been doing here in Wikidata. So I was looking for something, Queen Elizabeth. I mean you do that all the time. I mean this is the lecture databases and information systems and the very common aspect or feature of information system is that you have some field and you wanna find things from some big list. So here it's just all the entities, 100 million entities. I typed Queen Ellie and I find Elizabeth II. And another feature, maybe I don't know how to write queen and I write it like this and I don't find her. So actually it would be nice if I can also make typing mistakes but I can't on Wikidata, but for the exercise sheet we should allow this. Thank you Frank. And this is our topic today. Not just to type something and get prefix hits if it's correct like Frei and Freiburg but you're also allowed to make mistakes. So there are like three variants, prefix search, this is what the Wikidata page does. Here you just do a complete match, so you can type something, but it has to be the whole thing with mistakes. And what we do today is actually fuzzy prefix search. Just type the beginning of something, you can even make mistakes there and you still find what you're looking for. That's what you really want and that's what we're going to do today. And we have prepared this wonderful, or Sebastian has prepared this wonderful data set for you and let me show it to you. It will be the data set for the next three sheets and it kind of belongs or is not independent of what we have done so far. It's the entities from Wikidata and it's like the entities from the data set with which you have worked in the last lecture. Let me just show it to you. I think I have to go back here, public code lecture 7. You can download it from the wiki, I've already did that here and it looks like this. Maybe I'll make it full screen again. So in the first column, so it's names, scores, synonyms and so on. The first column now you just have things from Wikidata. Let me just cut the first column so that you, so that's the main information. It's just these entity names. Starting with the country here because it's somehow ordered by popularity. Popularity measured by number of Wikidata, Wikimedia links on some things. So now you get all kinds of entities here. Names, people, chemical elements, whatever. But there's also more information here, which you can but do not have to use, or partly you have to use it, partly you can use it, like a score, so kind of proxy for popularity, synonyms, so other ways how to search is that's optional but it's very interesting to use. Okey data ID, a short description and more information which will be relevant for the later exercise sheet. Okay, so that's the data set and our main goal will be to solve this fuzzy prefix search. And the challenge will be to do this, yeah, we have to solve two challenges when you allow typing mistakes, now you have to say somehow okay what's relevant, what's, it's like in the first lecture, what's similar to what I have typed. We need a notion of relevance and then we will see it's not so easy to make this efficient. So we need to have a built-in index. So it's kind of what we did in the first two lectures just for a different problem now. Very briefly as a background, so where does this list come from? A list like this, there could be many sources. Here it comes from Wikidata, one typical source for search engine, Google just knows what people search, so when I type something on Google here, Google does the same thing, right, now it's just this completion here will not only be from a list like Wikidata, it will be from queries which, it will be a combination of such a list like from Wikidata and what people frequently search. You could also maybe have a collection where you don't have a query log yet, you don't have experience from people searching it, then you could just search it for common phrases. That's an interesting problem by itself. I give you any text and now you want to find interesting names in there or phrases. Or what we do for the exercise sheet, you can take some knowledge base and take a list from there. So what's, let's first, you should, whenever you have a new problem you should first look at the simplest solution to see maybe it's good enough. And what's the simplest solution? Well, let's just do what I did here again. Let me cut the header line and let me just... So this is now just the names. And now I can use a command line facility like grep. So let's just look at grep. I'm looking for, yeah I could just do grep. I'm looking for fry, maybe I'm looking for fry work. Oh you see a lot of things here. So that's now just doing prefix search, how big is this file? It's not, it's 25 megabytes, so that's on, and that's a very fast machine where we're here, so maybe type a little bit more Freiburg. Yeah, and now you see I also get Freiburg and Preiskaub. So you could just do grep, and it's not that bad if you don't, if it's not very large. There's also a grab for approximate grab. Let's try that. A grab and with a grab you can specify how many typos. So let's, I'm just allowing a grab with two typos now and I'm typing a, hi, let's see, okay I don't find anything, that's too many. Now I find a lot, okay, because I allowed, so that's also interesting already, right? If you allow two typos, you get things which are, yeah, so you also have a ranking problem here, you get a lot of things what do you show first, let's maybe only allow. I still get a lot of things. Yeah, but anyway, so you can use grab, you can use a grab and they are not so bad, they are pretty fast, of course they are written in C and the file we have just seen was pretty small. And what do these tools do? You can also use them to, yes there is a question. How is it tied, that's what we will do next in just a second. Exactly how is it defined that I can make two mistakes? A very appropriate question and the next part of the lecture is about that. Two, yeah, errors, next slide, okay. Next slide, is it the next slide or the slide after that? So let's first, what's the time complexity of this simple solution? As I said, first understand the simple approach. Simple approach is you just go through everything. So you have a number of records, doesn't have to be name, can be anything with text, and then you have the time for checking the single string. So very simple, it's just number of things time. for checking the single string. So very simple, it's just number of things time. Now, how long does it take to find if such a thing is a match? It's actually not so easy if you would do it in Python. I think it's one microsecond to just check is this string, this other string up to, we will find out what typos mean in a second. What do we want? We want, if you, I mean just going back to this here, if you type something here you don't want to wait a minute, right? You see it's waiting a little bit here but you get a little bit of a wheel here and it's very annoying that you can't make mistakes here. So typical number 200 milliseconds is something where people say that feels kind of instant 50 millisecond it feels really instant so which mean this means we would be fine for up to 200,000 records. So for the sheet we already have 1.6 million records. In reality you have even more. This search is in a list of 100 million. So you need something better. But it's pretty good. Never underestimate a baseline if your data is not so large. And we will see that's the rest of the lecture. So now we will talk about edit distance. This is our measure for when two strings are similar. So it's also called Levenstein distance after a Levenstein who already died a few years ago. And this is kind of a similarity measure between two strings x and y, any two strings. How is it defined? We define four operations or transformations to get from one string to the other and here they are. This is not blue so we have to make it blue before we can continue. Otherwise it won't work. They have three operations. One is insert one character, one is delete one character and here I also give you the signature and one is replace one character by another character. And let's just do it. It's like these riddles which you find in magazine and I always take the same two words. So let me just take, so let's take the German word dorf. Let me write it like this. Let's assume we want to get to the, to another German word blött. So we want to get from dorf to blurt with these operations. And let's see how do we do it? So let's maybe go from there are many ways to do it and we will see a systematic approach in a second. We could, so the d, I mean it's actually harder than it looks. I mean it looks like d, we want a d in the end but there's no operation to get a letter to the end, right? I mean we could now delete all these three here. It's actually, yeah, I will, you can think about whether you find a better way, but for now let me just replace So so that's now and let me do that in so what operation is this here I did a replace I Did the replace of and let's start counting the position at 1 by B. Okay, now I have both. Now maybe let's replace the second one by an L, so that's blue. Second one by an L. And now now we could insert something, maybe now it's time to insert the E. Blue. And you see now I'm already there so that would be an insert here and the insert would be at which position one two three four I'm taking the position where I inserted I'm inserting an E and here this final operation here is again a replace. I'm replacing the last one, which is now position 5 by a D. And now, very important, so let me write it here, important. This does not show that the added distance is 4, right? This only shows that the added distance is, it could be smaller, right? The added distance is it could be smaller, right? Is it smaller? What do you think? There could be another sequence of operations. Maybe I just... Maybe I wasn't clever enough. What do you think? Is there a sequence with three operations? What do you think? Think about it a little while? And you could think in both directions. You could think about finding a smaller one or you could think about an argument why there can't be a smaller one. Any opinions? Maybe the edit distance is three, maybe it's two. We didn't really make use of the fact that there is a D here, which is also there. We kind of just replaced the D and then D. That looks wasteful, right? So what do you think? What's the added distance? Four, three less? Yeah? I would say that there is no sequence with three opportunities. And first you have to look at the letters that are shared between the words. But also the order matters. So I think as we can use it we find the longest sequence of letters in both words that are in the same order. And then maybe also look at the word of my order as in then just subtract the number of shared letters. I think that's basically all. Okay, these were interesting thoughts, yeah. So you said looking at the same letters is not enough. I mean there are two same letters here, D and O, but they are in the wrong order, D is at the end, here becomes before the O, so look at the sequences of letters which are in the same order, which here would be only one, right, only the O or only the D, and then, yeah, and so you were kind of arguing why and it's correct you can't do it with less than four. But it's very interesting that already for this simple problem a string of size four and a string of size five it's not so easy. If I would ask you to mathematically prove that four is the best I think you would have a hard time. These were very good arguments but how do you turn this into a proof? And so let's talk a bit about this. So let's first introduce some simple notation. We will not go into super depth here, but at least I will tell you something about it. Actually in Algorithms and Data Structures lecture, it's a whole lecture about this. So the empty word, you need a character because if you just write nothing, it's hard to see. You don't know what is meant. So epsilon is the empty word. The length, you just denote it like the, by putting these vertical bars. Substrings, we just do it like this, where, so Python style, no Python would put a colon here so here we put dot dot and we start indexing at one it's just in the math it's nicer. Here's some simple properties. These are also popular exam question this are related ones okay this was the wrong one. I need this one, yes. So why is that? Can you give me an argument? Why is it symmetric? Is it symmetric? Yes, I said it's symmetric. So if I go the other way around, why is the edit distance from Dorf to Blur the same as from Blur to Dorf? And is this true in general? Yes? I think each step is reversible. Each step is reversible, yeah that's a very good answer, yes, it's true, exactly. For every operation you can find an operation in the other direction, which means the two correspond to each other. So if you would have a shorter sequence in the other direction, you could turn it in a shorter sequence in that direction, so they have to be the same. That's how the proof would go. And I would recommend to just try to prove this. If you prepare for the exam, when you prepare for this exam, try to prove these little things. That's another simple one going, let me say you how, how do you get from the empty word to a word? Well, there are not so many options, right? You have to insert one character after the other and that's the best you can do. You can always do it worse, like in worse, insert some things then replace them by something else, delete things, but the obvious, then replace them by something else, delete things. But the obviously best thing is you go from nothing to alert word with five letters, you just insert the five. This just says to go from one word to another. If there's a difference in length, you need at least that many operations. One is of size three, the other of size seven. You will need at least four operations because you have to add these. Every operation just adds at most a character. This is a little bit more complicated and we will not go into the detail here. There's a recursive formula which I can only briefly explain. I would need a whole lecture or half a lecture to explain it. But that's the way to compute it in practice. So if you want the added distance, then it's the minimum of the following three. And we will turn this into a scheme to compute this in a second. So it's like the edit distance of the where you just omit the last character of the Y plus one. You omit the last character of X and then plus one or you omit the last characters of both strings and here take it plus one if the last characters are unequal or if the last characters are equal. That's maybe the easiest one to see, right? If the last characters are equal then one option is to just leave that last character in both strings and just see how you get from the just leave that last character in both strings and just see how you get from the parts before the last character of the one string to the other. So there are basically what you're doing here is you're making assumptions right. Let me briefly explain that maybe you're saying okay I want to get from X to Y, there are three possibilities kind of, I leave the last character here in Y, so that would be the, and then, I don't know what this says is I have to replace the last character, yeah I think this formula is, let me try it anyway. So what this says here, the last operation will be to change the last character of Y. This would be the case number one. So I do everything to get from X to Y without the last character and then I do something with the last character, this plus one. This is the same for X and here this is the case when I do something about the last two characters. And there are no other possibilities. And so what you can see with this formula is that it will, it's recursive and it takes you from strings, from a problem with two strings X and Y to a problem that is simpler in the sense that at least one of the two strings is smaller. So the recursion actually makes sense, right? Here the Y is smaller, here the X is smaller. Now when you again do recursion with this one, again one of the two will become smaller. So eventually, one of the two strings will be empty. And here's the base case, and that we know if one of the strings is the empty string that we have just seen, then this is the edit distance. For either of the two strings, if one of them is empty, it's simple. And as I said, the proof is not trivial if you are interested here is a reference to a whole lecture where you can see it but you don't have to understand you just have to know the formula for now. But what we should do and what you should be able to do is use that formula to compute the added distance. And let's just do that. And actually gives rise to a schema, so the recursive formula. And let's just do it like this. So we went from, how did we do it? I will do the following. So let me take the same two words, which I had in my example a few slides ago, and let me write them like this. So this is epsilon, the empty word, you will see in a second why I do it like this, B-L, blut, here. So this is epsilon, the empty word, you will see in a second why I do it like this, B L blut here. Okay and now what I'm doing is, let me just write some numbers here and then explain to you what I'm doing. doing so which let me take purple here. So what I'm writing here and let me explain that to you so what does this number mean? This is this is the added distance from epsilon to the word d o. So what I will have in this table is the added distance for all combinations of a prefix of this word and a prefix of this word. So I'm looking at all prefixes. Let me maybe, so I'm looking at all prefixes here and this is the added distance. Here it's nope, it's the empty word so the smallest possible prefix of blöd and here it's the prefix until Ø. That's why I write the epsilon here so that I can also get the empty word. And now, and that's obvious if I want the edit distance from the empty word to something, then I just get zero, one, two, three, four, right? Because it's just the length of the prefix. So D O has length two, so it's just two here. And please do ask a question if anything is unclear. So filling these is easy and it's the same in the vertical directions. So maybe why not also add that to clarify here. So this box is now, that's the edit distance off and that's now from blue, the prefix until there, to the empty word and that is three of course because you have to delete three letters. Yeah, let's see if we can do this together. Let's go back one slide and... So what you have to do, let's turn this into an algorithm. So you take away the last character from one string and compute the added distance and add plus one from the other string plus one. So you consider these three things here and then you take the minimum of them. Or you take away both letters and then you add plus one or not depending on whether the last character is the same. So let's just do this here. So what I have to do to compute this number here, so I'm considering these three here. So this is, should have maybe written this here, this is my x and this is my y as you can see here. So this is the first word, this is the second word. So this is if I'm going I want to compute something for this entry now here. So if I go one to the left this is like taking away the last character of y, it's just one prefix before that. If I go up I'm taking away the last character of x, if I go here I'm taking away the last character of both. If I go here I'm taking away the last character of both. So what I'm doing, that's really what the formula on the last slide is saying, I'm taking the minimum of these three except, and I should say that again, that for the diagonal one where I'm taking away the last character from both strings, I'm only adding one if these last characters are not equal. Otherwise I can just take them away at no cost. So I have to pay attention. Here D and B are not equal, so it's just the minimum of these three plus one. So the minimum of these three is zero and plus one is one. So the minimum of these three is zero and plus one is one. And let me do one row and then I will ask you to tell me the numbers. So, so I'm here, I'm taking the minimum of these three and I'm checking whether the characters are the same. O and B are not the same, so indeed take the minimum. Otherwise I would use one less here. So it's two, one, one. The minimum is one plus one. So that's two here. And here it's three, two, two. O, B are again not the same. There is no B here, so I can end that whole row. It will always be just a minimum of three plus one, three, and here it's four. Now it's your turn. Well, what's the correct number here? Two, I agree, here. Two, two, because there's one here, it's the minimum of these three plus one, what is it here? Three, I agree, here. Four, yeah that's correct. And here, now we have to pay attention, there's an O, but here O and D, they are not the same, so what is it here? Three. And now we have to pay attention, now it's O here and O here, so what is it here? Yeah, it's two. Exactly, because you can take it's yeah the added distance from low to low is two it's the same as from D to L that's what it says and to O here what is it here? Oh also two that's correct what is it here? Three, yes. Okay, let's continue E. We don't have an E here, but so we can always just take the minimum and you please check whether it's correct. Three, three, three. And now here D. Okay, the D is the same, so we take four and now there's no more D in the rest, so it's always just a minimum of D three plus one, so it's four, four, four. And of course, all we wanted is this thing on the bottom right. This is now the edit distance of the full words from. added distance of the full words from... And actually I did it in the other direction now, which was not intentional, but doesn't matter. This is now the added distance of BLEUDE and DOF. And by the correctness of this scheme, which we didn't prove, we proved it in the algorithms and data structure lecture, this now we know that it's actually four. You can't do it unless. And how much time did this take? Well, I have a string length here and here and I made a quadratic scheme and filled out all the numbers. What I did was constant time for each entry. I just had to look at three other entries and then compute the minimum. So it's like constant time for each entry, I just had to look at three other entries and then compute the minimum. So it's like constant time for each entry and have length of x times length of y, many entries plus one, but this doesn't change it asymptotically. Any questions about this? Let me just check where we... So two more things before the break. Now prefix added distance. What we actually want to do is we want to type only something and we want matches which are longer. The prefix added distance is defined as follows. So I have something, the X here is now something shorter and think, let's maybe do an, it's maybe this here, right? I have something now and I'm making a mistake. So this is now my X and I want to see does it match Queen Elizabeth II with some and how many. So I want, maybe this is too small right? This is a lot of mistakes now maybe I just make one mistake. So here I just forgot the E and I want to know the difference between Quenele, where I just forgot the E, to Queen Elizabeth. And of course Queen Elizabeth II, it's much longer but I just want to know how good of a prefix match is it. So what do I do? To compare it to, if I want to know the prefix distance of this to Queen Elizabeth, I just compare this to, if I want to know the prefix edit distance of this to Queen Elizabeth, I just compare this to all the prefixes of Queen Elizabeth II and I take the edit distance of where it's the smallest. So I take all the prefixes of the second one here and look at the edit distance and take the smallest value. And let's just do that for an example here. So let's first, two examples actually. First, what's the added distance between uni and university? Just the added distance like we had it before between these two words. See nothing in the chat. Seven. Seven, I agree. Why is it seven? Because you basically only have to add the versity. Yeah you have to add versity, that's right. It's a perfect prefix, so you just leave these three letters and then there's nothing better than adding V-E-R-S-I-T-Y. What's the prefix added distance? So for the prefix added distance, now just look at all prefixes of university and take the one where the edit distance is the lowest. What do you think is the prefix edit distance? Yeah? Zero. Zero, yeah that's correct. Which prefix? Uni. Uni, yeah. Here you have a prefix which is perfect. So, yeah, so this is just to, so this is now, it's the edit distance of uni to uni. To this prefix which is zero. Here's another example, universe, so maybe that's what I type to university. First what's the edit distance, like how we did it so far. What's the edit distance now? Five, I agree. It's five right? I mean it's not, the proof wouldn't be super trivial but it looks like it should be five you want to replace the W by the V this looks like optimal and then you add city S I T Y. What's the prefix at a distance? One. I agree it's one. You look at all the prefixes and take the best one, the one where the added distance between univer and univer. Okay that's a W and that's a V. Okay so and that's what we want, we, so when we type something like this here, now we want to find, now we want to get in our list and we don't get it here in Wikidata, the things which have a small edit distance. Maybe we say one, up to one mistake is okay and then we would expect Queen Elizabeth here and this is what you will do for the exercise sheet. So all the large web search engines, I mean not Wikipedia or Wikipedia I think also doesn't have it, it's a pity but of course in Google you can make mistakes and you will still find the correct answer. So how do we compute the PDE? Here we saw how we computed the edit distance. Now the prefix edit distance is defined in terms of the edit distance. So let me just start by doing it and then explain to you what is written here. Let's maybe take the following example. Let's take let me take Freiburg here, so that's now my Y and let's take Fibo here. And what do you think, what's the prefix added distance of let's say I type Fibo and what's the prefix added distance? Fibo, hmm? Two, yeah it's actually two, right? It's pretty small, doesn't look like like Fibo is and why? Because so the best prefix can actually be pretty long, right? It's this one. It's Freibuh. Freibuh from Fiebo to Freibuh, it's just two operations, right? You just add the R and the E and then you are there. So it's actually two. So the question is how do you, well, what you could do is you compute the whole schema. Let's just zero, one, two, 3, 4, 5, 6, 7, 8, but we don't do that here, 1, 2, 3, 4. Let's just do the last row and let's do, so I'm not filling this out now, it would take too long, but let's just, I mean the last values are just the edit distance from the whole thing, FIBU to the prefixes. Let's just write the right numbers here without doing this scheme. What's the edit distance from Fibo to F? What's the right number here? Three, I agree. What's from Fibo to Fr? I agree. From Fibo to Fre? Also three, from fibu to frei, also three, right? From fibu to frei, also three, I think it's boring. From fibu. Because you can replace, you can replace. Yeah, it's a good question. It's actually, it's tricky. It's tricky, you have to pay attention. From Fribu to Freib, what is it from Fribu to Freib? It's also three, right? Yeah, because the I is now. But this is, and now from Fribu to freibu, it's two, right? And now it gets larger again, I think, three, four, and so on. So it's interesting, right? This small little problem has interesting properties. So here it even goes down. So you can add a word and then it becomes easier. I think that's correct. Please tell me if we made a mistake here. So now by the definition, I mean these are just all, if we just look at the last row here, it's just all the prefixes of Y and that's how the prefix at a distance is defined, right? It's just a minimum of all the entries in the last row. So this here is the definition of the, yeah. So this is, yeah, this is the PD of X, Y. of X,Y. And here's one, here's one important thing, then we can almost go into the break. So I can just fill out the scheme as before, it has the same complexity as before, and then I just look at the last row and I take the minimum and I get the prefix edit distance. So I don't need a new algorithm, I just do the same scheme, but except for taking the value at the bottom right, I take the smallest one in the value and here. Now I'm saying let's just take what's written here. Here it says we only want to know and this is how it will be for the exercise sheet. We only want to know, let me take a different color here, maybe green. We want to know, is this a match up to two mistakes? Is the prefix added distance less or equal to delta? Let's take delta or two. If it's more, I don't want to know how exactly. I just want to know is it less or equal to two and if yes how much. Then I claim it's enough to look at the first x plus delta plus one columns and what would that be here that would be x the length of x is four plus delta is two plus seven. One, two, three, four, five, six, seven. So I'm claiming I don't have to look. Yeah, no needs. And you tell me why. No need to think about it. No need to look beyond here if I'm to look to the right of this. Now I'm wondering why is it delta plus one? So why don't I have to look to the right of this? Oh, oh, and by, I know why it plus one including one, two, three, four, five, six, seven. Okay, now I know why I had this one actually save some time, this is actually important when my prefix is small. So this is here the sixth letter of Freiburg. So what I'm claiming here, if I just want to know if this, the prefix at a distance of this to this whole thing is less or equal to two, I don't have to know if this, the prefix at a distance of this to this whole thing is less or equal to two, I don't have to look beyond this green line. Why is that? Yeah? Yeah, that's exactly right. So starting from R, it's three longer than Fibo, right? So whatever I do I have to add, but it's the, now I'm confused myself because I can take, no no, if I can take prefixes here it will be among these here, right? Yeah, yeah yeah, if, yeah exactly. if I can take prefixes here it will be among these here. Yeah, yeah, yeah, if, yeah, exactly. So I'm already checking all these shorter prefixes here. If this prefix here would be optimal R or G, it can't be, I mean, yeah, it can't be one of the candidates because to get from FIBO to Freiburg I would need three added distance operations to get Freiburg I would need four just because it's four longer and indeed it's three and four here, could be even larger. So if I just want to know is it less or equal to two? No point in looking further than this. So no point in looking beyond length of this string plus delta. And the plus one is here because the first column is the empty word column. Okay. So, yes please. To shorten the table like this, the minimum will always always mean the right is the box at the bottom? No, that's a good question. It's actually, I actually chose this example. Usually it's not like this. This is a bit unusual that it goes down here, which is because of the special structure. No, it can be further to the left, can be anything. I don't know if right now could it be like all places that's also, yeah? If x would be 5 it would be zero here, that's true. I just wonder whether it could be, it can be the smallest anywhere. Yeah, it probably can be, but yeah it doesn't have to be at the rightmost position. Okay, and now unfortunately it's not part of the sheet to implement this, although it would be a lot of fun, but the sheet is already a bit of work apart from that. So we provide the implementation for you. Of course you can implement it yourself if you want, it's not that hard, I mean you can imagine it. How do you implement this? Yeah, it's just a nested loop where you just implement the recursion, right? And you just look at. So you could just implement it that way, but you don't have to, so we implemented it for you. And actually Sebastian provided an implementation in Rust because the data set is not so small and Python is really slow with these things. So doing 100,000 PD computations in Python, you feel it. So they are just, and the way it's done, feel it. So they are just, and the way it's done, the template which we provide will automatically look if this package here which you can just install via pip, pip install rdfribo-qgram-utils. If you did that, if the package is there, the code template will use it. You can just look at it in the code and if it's not there it will use the Python fallback. If for some reason this doesn't work for you, it should be as simple as that. You just install it and you can use it, then you will get fast prefix edit distance computation. And also you will see this in a second, you will need a routine to merge lists and we also implement that for you. And it's also implemented in Rust if you use that package. Okay, it's time to make a break, but before, is there any question about edit distance before we continue? Okay, no questions, so five minute break and then we resume. Okay. Okay, no questions, so five minute break and then we resume. Okay, we are back online. On to the second part, which is, so how, how do we now turn this into an algorithm for fuzzy search? And we will talk about Q-grams. So here's the intuitive ideas. When two strings have a small edit distance, they will have parts in common, right? So we have already seen that they somehow have parts in common. We have to formalize that. So for example, Freiburg Stuttgart, we're not talking about semantic similarity, but of the strings, Freiburg Stuttgart, is the edit distance small? What do you think? Just from your intuition, do they have a small edit distance? No, because they're like totally different strings, right? Freiburg, Breifurg, yeah, looks similar, right? Freiburg, Breifurg look similar, they are similar at many, like there's Ei, there's Urk, there are differences, there are similarities, right? So that's what we want to formalize now. Here's a Q gram. A Q gram is just all the substrings of a certain length. So the Q just stands for a length, so these are three grams. Does it say here that these are the set of three grams. Three grams. So the Q is just an integer, so it's just all substrings and it's not any three letters in any order it's letters, consecutive letters, consecutive substrings of a certain length. 3, 3, 8, 8, and so on. Here they are. We define it as a multi-set because you could have in a word and that's something to think about it later, you could have the same 3 gram twice. If it just occurs again later and then we also want to have it twice in the set. And actually here's an example. Here we have r bar twice, r bar here and r bar here. It's just two different substrings which happen to be the same. The number of Q grams of a string X is up to you. So obviously it has something to do with the length of the string, but it's not the length of the string, right? R bar is. So what is it? How many Q grams does a string of length X have? People in the chat, you can also. People in the chat you can also... Minus two for three grams, that's true. I guess exactly for Q, minus Q. So for three grams it's minus two. So it's not quite correct yet, minus two. So it's not quite correct yet. Minus Q plus one. Yeah. Yeah, so that was proved by incomplete induction. You do it for one example, you assume everything is linear and that usually works. Sometimes it doesn't but yeah, you do it for three and that's apparently just... And it's clear, right and there it's apparently just and it's clear right because it's it's not the whole length of the word because you need three letters so you can't go until the very end you start in the beginning and then you have to stop short of the end because you still need three letters and yeah and if you do if you just try it out it's x minus q plus one. And oh, that's where I should have written this. Okay, let me write it there. It should have been written here. It's x minus q plus one. Okay, next slide. Okay, next slide. Similar words have, we said to have a small edit distance, you have to have stuff in common and now we can make it more precise. You need to have Q grams in common. And here's the lemma which we are going to prove. And let's first understand this notation. So let's first understand what's A without B. Let's draw some Venn diagrams. Now, let me maybe, where do I draw it? If I have a set A here and a set B here, and this is the intersection of A and B, and this here is this part here, that's A without B. So you just take the intersection away, right? I think that's simple enough. Yeah, and if the two are very similar, then A without B will be small because you will be taking most of it away, right? If they overlap a lot, these two sets, and you take B away, then you won't have much left. So what this is saying is, so I'm, this is not symmetric here, so I'm taking one away from the other. It will be important in a second. So if they are very similar, then this should be a small set, right? Because I'm taking most of, yeah, these are two sets, if they are very similar, then this will be a small set, this will be a small number, and here it says how small it is, somehow bounded by the added distance. And let's first do an example and then prove it. But let's first do an example. What's the added distance? And now we're talking added distance again. We'll come back to prefix added distance in a second. So here we have free burger and brie-erger. Brie-erger, what's the added distance? Bre-Urge, what's the added distance? Two, yeah I think so too, right? If you have a lot in common it's actually easy, so F by B and then delete the B. You can't really make use of, you can't, I think it's two, yeah. Okay, let's look at the two grams. Let's take two grams. I mean this works with any cue. Let's check if this is correct. Maybe I made a mistake. Looks correct, right? So How many are these? I need a number, an integer, preferably a correct one. Nine? Yeah, I agree. Nine, and let's look at the ones for how many are these? I mean I deleted one letter so I think it's eight right I want to have those from X without those from, so let's see which ones do I have here. This I don't have in, so this remains there, right? So I'm taking away everything from this one which occurs in that one. So let's just do that, so that's fer, and let's see, so re, I also have in re, I also have here, ee, I have here, ep, I don't have it, so that should be there. Ep, ep, and then bu, I also don't have it, right? And try to understand what's happening here, boo. And then the others, I think I, yeah. So oo is there, gr is there, g is there, er is there. If I have long substrings in common, I will have all the two grams in common, right? So which means the size of this set is three. And the lemma says, the lemma here says that Q2 of x without Q2 of y is less or equal q is two here, I'm looking at q grams, two times two equal four. So the lemma says it's at most four, it's three, so it's correct for this example. Do you have any idea? That's a hard question, but maybe you have an idea. Oh, you have the idea before I ask the question? Why is it smaller? Why does the lemma only say four, yes? Like we have one lemma that has changed and that can be at most in number of Q. Yeah. in number of q, it is a maximum of q of those. Yeah. We have one letter that changed. We have two things that changed, right? Yeah, but for each letter, each letter can be maximum in q number of. Yeah, this is exactly right. So for example, here we deleted the B, and this B affected some of the two grams, right? Here we have two grams. So the B was deleted, and that's why the EB and the BU have gone away in the other word here, because I deleted the B. And if you think about it each, and if you have Q grams, that's exactly what you said. You delete one or you change one and it can affect exactly Q, Q grams. It's like you shift a window over there and for Q points in time, it will be in there. But here it's just one affected by changing the F because it's at the very beginning. That's why we have one less than could be possibly affected. And we'll come back to that in a second. So that's the idea and now let's, yeah that's what's written here actually. So if the added distance is one, which means to get from one word to the other, it's one operation. It's either an insert, a replace or a delete. Then now just think about it. You have your set of Qgrams and now I change one letter. Then this can affect, we just set it at most Q, Q grams. Which means if you take away all, yeah, it's at most Q, the set difference is at most Q. All the others will be the same. So this is basically what we have just argued. Now, and now it seems obvious the rest, okay if I have one operation, but there's still something to prove here. Now pay attention. So for the general case, so we have the edit distance is k, which means there's a sequence of k operations to get from X to Y. That is, if we just take these intermediate things, think of the blöd dorf example in the beginning, dorf blöd, I think it was dorf, bof, blöf, blöw, blöd. So these are my intermediate strings here now. And now I know from each to the next there's only a difference of the size of Q and the number of Q grams. How do I get to this number? There's something to prove here. And this is what we have to prove. If I have a sequence of sets and I know the difference of one set to the next is at most Q, then the difference from the first to the last is at most the sum of these. So that's what it says here. Maybe I should, so that's even a bit more general here. Let me just write that down. I mean this here implies, I will now prove this, but let's just verify that this implies if this thing is a subset of this thing, then A zero without A k is less or equal than the sum of ai minus 1 to ai. Right? This is just a, let's first understand this and then relate it to what I've shown here and then prove it. So this is just saying if this left-hand side is a subset of this then of course this can't be, yeah the size is less or equal than the size of this. So this is the size of this set difference and what's the size of a union? The size of a union is at most some of the sizes of the things here so it would be that. And this is what I need because why do I need this here? This is what I need to show. We have already seen if I have the sequence of string transformations then from one to the next it's always just one edit distance, it's one operation which means the number of q grams can change by at most q. And now I have to show that from beginning to end it can show by at most Q. And so if I show this, then I've shown the lemma. And this here, I mean, let me just, I think you have to convince yourself, this is not obvious, it's something we have to prove. You have a sequence of sets, maybe before I prove it, let me draw it. I mean, what I have now is, let me try to draw it maybe. So I have a sequence. I have a set. This is the set of Q-grams of my original string. Yeah, maybe let me call it so that's my A0 which is my Q, Q of the X I'm starting with, the my Q of X1. So that's the word I get after one. And now I get the next one. It's my set A2. So I've changed one more thing. I get a different set of 3 grams or q grams and so on. And now I'm continuing this here. And they are kind of all similar. And this is now my AK here, which is the set of q grams, q, q of the y of the last thing in the sequence. And now I have to prove if the difference of each of these sets is small, then somehow the difference from the first to the last is also somehow bounded by the sum of these intermediate differences. Let's say, yeah, it's intuitive, but I have to prove it. And this is making it exact here. So it's like the step from one to the last is the sum of these and how do we how do we prove this? Let me just prove it for you. Maybe you have an alternative proof. How do you prove that something is a subset of something else? Well you take an element from the set of the left and you have to say then it's also in the set on the right. So let's start with an element here. So now we have an element that is in the, let me change the, that is in the first set but not in the last set. So it's an A0 but not an AK. Let me write my sets like this A1, A0, A2. Let me just write, yeah, let me just write it for a fixed K, K equals five. So I know that it's in here, but it's not in here. Now that's a typical trick in my, when I know it's in here and not in here. Now that's a typical trick in my... when I know it's in here and not in here, then there will be some point in between where this switches for the first time. Where it's... because I want to know... I want to say something about two neighboring things. So there will be a point where just point where just in the set before it was still included, in the set afterward it wasn't included. Just has to be like this. Let me just write this up mathematically. So there will be an I such that X is contained in AI minus one, but X is not contained in the next AI, right? It has to be like this, right? There's just no way. I mean, that's just the fact, is it contained in this set or not? I know it's contained here, I know it's not contained here, so somewhere along this sequence I will have a situation, it's contained here, it's not contained in the next one. And so, I mean that's already, it's actually very elegant proof. I mean this means nothing else than it's in AI minus one without AI, without AI and this means and that's one of the sets here in the union of all these individual differences from one to the next and that's the proof. Now that's the kind of proof I can't emphasize it enough you have to do it for yourself to appreciate it. And maybe what's even better is first forget everything about what I've just shown to you, try to prove it yourself. Try to prove this lemma for yourself, that's I think the best way to learn. Okay, there's a relation between similarity in sets of Q-grams and the added distance. How would I go about proving this? Maybe you'll find a better proof than tell us. Maybe this is, this I think an elegant proof, but maybe there's an even better way. Okay, and now let's turn this into an algorithm. So what we have shown here, if the added distance is small, then the set of Q-grams is similar. That you should have understood on the high level and I think it's easy enough to understand this high level. One question regarding slide 19. Why is it stated for not multi-set? Oh yeah, it's multi-set. Yes, yes. You're absolutely right, this should be multi-set. It's no different. It's a multi... It's absolutely, we need it for multi-sets, but there's no difference. But one has to pay attention to the details. Thank you for asking this. So, that's what we know now. Similar words have many Q grams in common and we have proven a lemma about it. And here's a corollary now. Okay, how is this? Okay, how is this? And let's first just see whether it works and then see how it follows from the lemma on the previous slide. It's actually a simple corollary, just a small step from the lemma on the previous slide. Let's just check it first. So what's, how many Q grams did they have in common? Let's just go back one. Where did we have these? How many Q grams? What's the exact number of, let's just write this here, maybe in another color. Or let's just write this here, maybe in another color. Or let's just do the whole set. Okay, that's... Ah okay, that size of the intersection? I want to write a number here. Yeah? Six. Yeah, it's the six equal signs here. Yeah, it's six, it's correct. So they have six, yeah, they're similar, they have six in common. So let's go back to the slide where we have the... So we know that this is six. So we know that this is six and according to the lemma, to the corollary, this now says this is what's the, it's the max of, how long is this, one, two, three, four, five, six, seven, eight, nine, 10. We have already seen that. The other is one shorter minus Q is here. We are talking about two grams times and the added distance was two, which is the maximum 10 minus 4 is 6. And that's correct. So this lemma seems to be correct. It says the similarity of these two is greater or equal to that. And now let's prove it. And let's first understand this. I mean that's also easy, but let me just, if I have two sets A and B, what's the time saying? This is the intersection here. A enters, no not the size of the intersection, but just the intersection. Then the size, I mean, yeah, I don't have to write anything more I think. If I want the size of A without B, that I should maybe add again. It's the same picture that I already had. So this here is A without B. How do I get the size of A without B? Well I just take the size of A, all the elements of A, and I subtract the element in the intersection. I think that should be clear enough. That's just how the difference between two sets, so I take away all elements from B and the intersection related. And that's also where this comes from. Yes, please. I think there is a mistake in maximum 10 and nine. I think it should be maximum nine and eight. Because Q2 of. Oh yeah. Yeah, I was already surprised where there is no discrepancy and where the plus one vanished, thank you. It's nine and eight, right? Yes. But it's still correct, of course. So it's one, so the actual number is larger and this is just saying it's at least that large. actual number is larger and this is just saying it's at least that large. Five. So this is what the, So this is what the, yeah, that's just by what we have just shown. So this here minus the intersection, that's just the same as this one, Qx without Qy. So this here is just a simple set inequality. If I take away from the Q grams of X, the intersection with the Q grams of Y, I get the Q grams of X without the Q grams of Y, and this here is just the lemma. It's just the lemma that we have shown, and maybe I write it here, this is the lemma. And now the thing is we can do the exact same thing in the other direction, because the edit distance is symmetric. We can also just say, okay, if we take from y away the q grams of x and y, then we get all the q grams of y without the q grams of x. And this is also q times. And I think to, it would be clearer here if I write it like this. Which is then, so this also follows from the lemma, I'm just applying the lemma in the other direction. And this here is ed of x comma y. And now if you just put these two together, I mean then you just bring this to the other side, this intersection here, then you get this intersection is greater or equal to QQX minus this here. So I just get this once with QQX and once with QQY, number of Qgram here, and both inequalities hold, so the intersection, just notice, right, if I bring this to the other side, I get this here is greater or equal to something. And greater equal to this, and greater equal to this, minus Q times this, so it's greater equal to the maximum. Okay, so if I put them together together I get this corollary. Even if you didn't understand the math and you absolutely have to do the math yourself at home what does this say? It says if two words it just says how many things do two, how many q grams do two words have in common. And here it says, and if the added distance is small, so if they are similar, then this will be a small term that is subtracted here and this will be a fairly large number, it's the size of the longer string, right? So it just says they will have many Q grams in common. So let's look at the special case, they have the same length, let's say 10, and the added distance is 1. Then this will say they have 10 minus, let's say we're having 3 grams, then they have seven three grams in common. And we have a slide about that. So that's what it says qualitatively. And here's another one that was for edit distance, that's for prefix edit distance. That's now, there's a cue missing here. I have to add it, keep. It's basically the same except that we don't have the max here and let's just, yeah, let's see if you can do this by heart. What's the size of this intersection without going to the previous slide? How many two grams do these two have in common? Yeah? Two. Which are they? Re and ee. Yeah, let's maybe write them exactly so it's the, so this is, yeah so I mean you just have to look at these here, the smaller ones because they are, and it should be burger right I did, I think I didn't want to, that was not on purpose that I omit them but it's still correct so BR is not in there, RE is in there RE and EE EE so it's 2 and what does the lemma, what does the corollary say? It says the length of the first one now, b, which is 4, minus, oh and by the way what's the prefix added distance between the two? What's the prefix added distance? One, that's correct. Minus two times one, that's the prefix added distance here. And it's two. So the corollary says these two, because the prefix edit distance is two, do at least have two in common and they actually have exactly two in common. There's a space two. And I'm not sure whether I will go through the details of, it's basically the same proof as before, now you just have to make use of the fact that we have the prefix at its distance here. I think you have to do it yourself too. It will not help, I think, if I now explain this in words, what's written here. Just to continue and to be able to understand the last part, it's again saying something about I want to know how many Qgrams. If the prefix at its distance is small, then I will have many Qgrams in common with the word. And it makes sense, right? If I'm a perfect prefix of this one, then all my pugrams here will also appear here. And if there are some small differences, then it will be a little bit less. That's what this says here. So intuitively, I think that's easy to understand. Just the math. Yes, yes? So intuitively I think that's easy to understand. Just a mouth, yes, yes. So what's, sorry, what's the question? Oh you are right, it shouldn't be, yeah you are completely right. This shouldn't be four, it should be the number of two grams, should be three right? Right? And yeah, you're completely right. But now I can't, oh, because I, now I have to, I have to draw. What? Now I get, oh my, shape outline, no outline. Some PowerPoint magic. You are completely right. Thank you. Q2 of X and X is Brie. Okay, now we, okay, and now, yes? Oh yeah, I need another box, you're completely right, b minus it's one. Yes, yes, but now we know how to do this. Go here. Probably there's also, I think I can fill. Again, outline now. I'm sorry. And now it's one. Maybe would have been quicker to delete everything. Okay, now maybe you didn't understand all the math, but now the algorithm is simple enough to understand. Now what we want, let's go back to the high level to our problem. We have, that's what we wanted, right? We have a big list of words that's called D, it's our dictionary and we are given something, an X and we want to find all the words in our dictionary which are similar which means the edit distance, I will have the same section for prefix edit distance in a second is below a threshold. So now what I can do, the simple algorithm was to just compute it for every Y and now I can do the following. I can just look at how many Q grams do they have in common. Is the number, do they have a sufficient number of Q grams in common? And I'm taking exactly my bound this from the lemma, right? And if this is so, if this is not so, then I know the added distance can't be less or equal to delta. This is just by way of the lemma, right? And I don't think I will go through the details now. Just understand this qualitatively, right? We have shown that if the added distance is small, then there are many Q-grams in common. And we have shown precise bounds. And for the exercise sheet, of course, you have to make it exactly. Now I'm just checking how many Q-grams do these things have in common. If they don't have enough Q-grams in common, I know there is no way that they can have a small added distance. This I think is simple enough to understand. If yes, that's important. If they have very few Q grams in common, then I know no way they can be similar. If they have Q grams in common, I still know I have to compute the exact value. Because all these were just bounds. There's no formula of the kind, the edit distance is equal to the difference, somehow the difference in the number of Q grams. We just know if there are too few in common, the edit distance can't be small. If there are enough in common, I have to compute the edit distance to check if they're really, if it's really below that bound. So that's the algorithm which you will implement for the sheet and exactly the same algorithm for the prefix edit distance just with a different bound here. And you have to, yeah, for the exercise sheet you have to make sure you understand this because of course if you're making a small mistake here plus one minus one then it doesn't work anymore, you will throw out things which maybe should have been part of the solution. Oh yeah, that was a, is there any question? That's the algorithm now. I think that's simple enough. But we're not done yet. There's still... So now I have to check, we still have to compute the number of Q-grams in common, right? Oh, here's one more detail. We will actually pad the strings with dollars on both sides. So for example, for the edit distance, if we have three grams, we put dollars on both sides and we put two dollars, not three dollars. And why does this work? Well, the edit distance between two strings with the padding or without the padding is the same, but this gives us more Q-grams. And that's actually, oh, this should have come first here, this is the sentence which I wanted. It's an improvement because with the padding, and we will see this in the last part of the talk. Unfortunately I am going a bit over time. I'm a bit slow today, I'm sorry. But yeah. So for now just accept the padding, it will be clear enough on the, I will have an example. It's four more slides and there will be an example and then I will explain it again why you have the padding. So for now you just add padding and it's just one less than the cube. So if you have three grams it's two, the beginning at the end. And for prefix edit distance you only add it at the beginning and we will see in a second why. So here's the last part of the puzzle added at the beginning and we will see in a second why. So here's the last part of the puzzle which you need for the exercise sheet. So the algorithm, there was all this math but the algorithm was actually quite simple. The algorithm was just check for every word how many Q-grams do they have in common. How do I compute how many Q- grams do they have in common. How do I compute how many q grams they have in common? I mean now of course I could take my input word, break it into q grams, take every word from the dictionary, break it into q grams, compute the intersection, but that also looks pretty expensive. We somehow want to speed this up and that's what the Q-gram index is about. And here's what we do. And you will see how this gives us exactly what we need. We will compute an inverted index. So back to lecture one. But now it's not an inverted index for words, it's an inverted index for Q-grams. And it's very easy to understand, so this is, so for example, this is a Q-gram with padding, dollar, fr, and since the padding is here on the left, this says nothing else. A word containing this Q-gram, it's just a word starting with fr. So here I have an example list, Frankfurt, Freiburg, Freetown, Fresno. Here's another Q-gram, Ibu, and yeah, here I just have the list of all the words containing this three gram. So that's a three gram index, it's just an inverted index, and I gram index, it's just an inverted index and I, yeah, it's just like the index we have seen in the first lecture. There we had for each word a list of the documents containing that word. Now we have for each possible Q-gram the list of words containing that Q-gram. And of course like in the first lecture we didn't actually store the documents here but we had IDs for documents. For Q-gram index you will have the Q-gram here and here you will give each word an ID and then you will just have list of IDs. You don't want the string here or something like this. So that's a Q-gram index. Okay and this, yeah this I still want to do, this will take a few minutes but I want to do it anyway. Just to show you and it's also a preparation for the exercise sheet. Let me just copy the code from the, I think it's lecture one. Yeah, let me just call it qgram index.py and let's edit it together. And now I'm claiming that I can use with very few changes, I can do the exact same thing. Let's just do that. And maybe I should, okay, my index now should get a Q here, an integer, and it already, and this is my self is equal Q. Okay, and now build from file, my class is now called, I'm sorry for the overtime, but you had a whole week without me. Q-gram index, build from file, cell file name, and I think I will leave out the unit test here. Q-gram from given file. Before we do this, we should definitely write a function that gives us the, am I right here? Qgrams. That gives us the Qgrams of a string x. Compute the set of Qgrams of the given string. And let's do a type in here, it's a string. I think we, here we can, so that would be QI is new. QI is new, new, no we don't have new Qgram index, and we have to say which Qgram we want. Let's maybe take two, Frank we are a little bit late, I'm sorry. Maybe take two, Frank we are a little bit late, I'm sorry. So, qi.q grams of, what do we want? That's now, okay, now we get fur. Do we want, let me take the unpadded ones for this exercise sheet for this what I'm doing now. Fur, re, i, ip, po. I'm a bit slow today. I think it's the weather, I'm very sorry. Okay let's maybe also do the ones for, no I think it four loop, I in range, how many do I have? It's the length of the string, minus, yeah, we already had that number, right? It should be this, and here I have my qgrams and now I append yeah, I just need the respective substring I think, right? x and I think it goes from i starting from zero yeah and here it should be probably just I self Q and we will see in a second whether it's but you tell me if I make a mistake is this correct looks correct to me correct to me. I hope you can still focus for the last minute. Okay let's use this up here build from file. I will not, I will remove the unit test here so now I have record ID. Oh yeah, I have my line in file. Yeah, I will not take the actual file. I think I have this, I will just take this file here, right? Well, I've already extracted this. For the exercise sheet you have to pass the real file, but here it's quite simple, so I will, so this is just some name here, which is just, it's just a line. And, hmm for Qgram in Qgrams of line dot lower. So I'm going over the q grams and I think that should be self q grams. So I'm just computing so my, maybe I should do it like this. I don't know if the new line will probably be there. So I'm lower casing this. So now I'm just iterate like I break the documents up into words now I just break the word up into q grams. Now I just just if the qgram is not already in the inverted list then I create the inverted list for that Qgram. We did the same thing. And here, and then I append, maybe I shouldn't call it record ID here, but word ID. And here I, so I'm just giving the words as I encountered them ID, ID one, two, three, four, and so on like I did for the documents, and I'm starting with one here, and then, hmm. I hope I'm, so it's very similar, I'm using basically, I'm using the same algorithm, I'm just calling the things slightly different and now I think the, does this work? Did you see a mistake? I hope not, so let's just, let's see if this works. It worked, this can't be, it's impossible. Let me make a mistake here. No, I'm sorry, I didn't make a mistake. I'm slow, but at least it's correct. Now we can do one thing. We can just out, we can remember in lecture one, we just outputted all the inverted, the sizes of the inverted lists so the frequencies of the words. We can do the same thing here. It's not called an inverted index here, it's now called a Q Qgram index, Qgram, and now here, and now I just output the inverted list, and you can already think about which Qgram will be the most frequent one. But which Qgram will be the most frequent one? And now that I think about it, I think we should add the padding, right? We should somehow... The padding was... We had a dollar times, and we can just write it like this, Q minus one. Let's just add the padding here. Qgrams append x padded is self.padding and padding in the end. And this is X padded and X padded. And now we need to add a few, ah, we just need to add one, right? It's a dollar, this stands for starts with F, and this here is G dollar. Oh, it's still correct, it's amazing. My coding is better than my speaking today. Okay and now, word, this is not, yeah I can name the variable however I want. And you think what's the most frequent Q-gram in the collection? What do you think? If it's a 3-gram. And here I need to pass a parameter. I want 3 grams. Okay, does this still work? This can't be. Python 3 Q-gram index. Okay, it wants a file name. So now it should compute all the Qgrams and then give me the length of the inverted list. I is not defined. Here it should be QI. Oh my, so much over time. But you need to stay five minutes longer, but then we're done. Okay, these are all the Qgrams and how often they occur, right? So it's simple, occurs in two words. I'm just showing the length here, but what I can do now is I can sort by frequency. So I take the tabulator as a separator, and then I want... And now let's see. Now I get them sorted by a frequency. That's what I just did. Let's see it whether, okay. It's now three grams, right? And this is words ending with A. So that's the most frequent. The most frequent ones are ending with this, ending with this, starting with C. Now we have ion, okay, makes sense, right? These are frequent three grams we have in our words. And if I would, I don't do this now, if I would plot this, you would again see Zipp's law. And now comes the final two slides, and I hope you can bear with me for these two more slides, but that way it's also on the recording. So this is now, so obviously you want the Q-grams of your input word, you only have to compute this once. Now what you do is for each of the Q-grams from the input word, you fetch the inverted lists and then you compute the union of these inverted lists and count how many. I know what I will do, I will just explain it by example, that's I think what I should do before everybody runs away. And the previous slide was just the abstract description. So let's assume I want, I type braai, and I want to find all words where this is a prefix match with a two gram index up to an arrow of one, delta is one. So the first thing I pad to the left, dollar braai, and then I get these two grams here. Dollar B, brr, re, I, okay. Now I fetch these inverted lists. So here the inverted list of dollar B is everything starting with an B and so on. Brr is everything containing a brr somewhere, so also Cribralta and so on. Now I know that if I for any word with a, these are the bounds which we previously computed, if the prefix is less or equal to one, then they must have at least that many Q grams in common. So now the set of Q grams here is one, two, three, four. Four minus two times one. I should write that on top. This is what the lower bound gives me four, my Q is two and my delta is one. So that what my bound gives me. It's easy to understand now, don't worry. My bound just tells me if anything has a chance of being similar it must have two 2 grams in common. So what I do is I now merge all these lists. So what do I get if I merge all these lists? In this merge list we give it to you, I get all the words which have at least one Q-gram in common, right? Any of these here, I mean these are the inverted lists of the q grams of my input words. And some words here I will have several times. So when I'm computing the merge here I will get Bangalore twice for example, I will get Beijing twice, it's twice here and I get Freiburg is I think also twice here. So these are the only things, all the other ones I only have one so I don't have to check them because I've established that I need this at least twice. So for these I actually have to check whether the prefix this edit distance is one or not And let's do that as a final thing together. So what's the prefix added distance of Brai and Bangalore? Hope you're still listening to me. What's the prefix added distance between Brai and Bangalore? It's three right? So pretty bad. We wanted one. It's the last slide I promise. No match. Why then did it have so many, a two grams in common? If the prefix edit distance is so small. I mean why was it even a candidate? Why couldn't I throw it out before? Can you see it? It's the last question I'm asking today. Why does it have two two grams in common, even though it's so dissimilar, I mean it doesn't have. Hmm? I mean it starts with a B, that one it has in common, but it also has a RE, right? Here the RE, but the RE is in the end, in a position which is not relevant for the prefix match. So it was a candidate, but. What about Braille and Beijing? What's the prefix at a distance? Two? One, I think it's one. It's a match, yeah. Braai and Beijing is actually similar than you thought. So that's a match. So here it worked. So Bangalore was interesting and Braai and Freiburg is also one, right? That's also a match. That's also one, right? That's also a match. And believe it or not, now we are done. Is there anything else? No, it's just the references. So the exercise sheet is to implement this. I'm very sorry for the overtime, but at least it's all on video now. I thank you very much for your attention and to see you again next week. Thank you. Thank you.So, welcome everybody to lecture 8, databases and information systems, which can also be taken as information retrieval and today's lecture is about web applications, part 1. But first we will talk about your experiences with the last exercise sheet which was about fuzzy prefix search and I have a slide about the infamous AYSA folder and I'm curious about what you will answer to my question and today we will talk about how to build a search web application with a lot of live coding. It's a very demanding lecture and I need your help. I hope we won't go over time. I don't think we will but let's see. And the exercise sheet will be to build a web application. Actually it will be two exercise sheets now before Christmas. We timed it that way deliberately. So first a simpler one, static application, a next exercise sheet, a dynamic application, which will sort of integrate everything we have done so far. And since there's always a few people who say, why are we doing web applications in a lecture about information systems, databases? Well, I think basically every information system has a kind of web interface. Even your mail program, which is also a kind of information system, has a kind of a web browser inside JavaScript and so on. So it's really a must to know this kind of technology when you build any kind of information system. First back to the previous sheet, which was about fuzzy search. Here are some quotes. Nice lecture and well crafted sheet. So very sorry for the overtime. I have to see for the next time. It was a little much because maybe the edit distance, so much interesting stuff, it was a bit much. Explaining the edit distance, explaining the mathematics and then how you do it. I have to see about that for next year. I think it's a bit too much for one lecture. Nice lecture and well-crafted sheet about an interesting topic. Many of you said it's very interesting topic, very enjoyable and instructive exercise. Took around 10 hours. So as usual, there's a wide spectrum of how much time people take for this, depending on your experience and yeah, just your depending on a lot of things. Less work than previous sheets took about four hours. So you see the spread. My first implementation had quadratic runtime. Telling from the comments on the forum and also in the feedback, several of you I think had that problem that even for the small data set it took very very long and that's typically an indication of something quadratic going on or worse. Would be more exciting to implement prefix edit distance myself. I agree it used to be part of the sheet, but then it would be even more work. That's why we, I mean, like every year we provide more and it's still challenging. I think it's cool to get to know how fuzzy search works. I agree. The mathematics behind this is still confusing to me. Yeah, please look at it for the exam, it's important. I think it won't be okay to just tell you here's the algorithm, use this formula, then it works. I mean, we have to explain the mathematics, right? I mean, it's not like deep mathematics, but you have to understand what's going on. It's this interesting kind of mathematics which is very close to the algorithm. I mean you just need this formula for when do I need to actually compute the PD or when can I filter it out by just looking at the Q-grams. You just need that formula and so you just have to do the math. It's actually a great example of how the kind of math you very often need in programming. I mean you just need the formula. When do I have to compute the PD and the better the formula, the less computations you have to do. So the math here is really important. Tricky one, nice challenge as usual but also a bit much. Some of you wrote that, not all of you. A little complicated to understand. Yes, it was not easy. Okay, and I think I wanted to show you, we do a lot of coding but maybe first let's go back to the master solution. So let's maybe first, this was the file we gave you. So if you wanted, you could only look, or mostly only look at the first column, which was just a lot of entities from Wikidata. And you will also be working with that, you will be using that. Let's just look, yeah, 1.6 million about. And let's just, I think I just tried it. So let's just try it with font is little one smaller today because otherwise it won't work with the coding we will do in the following. I hope you can still see it all right in the back rows but you have still young eyes. For the video it works fine, I tested it. So you can see it's a not so small file, it takes some time. What you see here is Python. I mean Python is just a convenient language but just very slow. So let's just in the meantime, how big is this? Solutions sheet eight. Why, ah, there it is. No, sheet eight is not yet there. I'm sorry, you have to do it. Yeah, it's 269 megabytes. That's not huge. Terabytes is huge. Here it is. Let's type something. Let's type Freiburg. Okay, that was too many. Let's type Arnold Schwarzenegger, didn't even make a mistake, okay. Let's, anything, I don't know, Forest Gump. Okay, I'm making too many mistakes here. Forestry, okay, Forest Grove, forest gump, okay. Freiburg and Reisgau, it's really fast because the PD computation was done in, so yeah, let me maybe Freiburg him, okay, so I can't make Freiburg. This works, so I can certainly make some mistakes. Frib, that's too many mistakes. Fribu. The longer with the word, the more mistakes were allowed. Fribourg. Ah, that's interesting. Because if you make two, transposition is two edit distances, so apparently, don't we allow two errors here? So I'm a bit surprised that we don't get Fribourg and Briscoe here. I don't see anything. Ah, I see, it's because PD with one comes first. That's the reason. Okay, it's interesting. That's the ranking here. The ranking was such that you first get the hits with a prefix edit distance of one even if they are less popular. So anyway, you see that it works but you can play around with that for the next exercise sheet because you will also use this, we will see in a second. But it's pretty fast and it works, so you can just, if you don't make any mistakes, it's just there. So pretty cool that you could do this in one exercise sheet. Okay, back to the, let me just go back to the folder where we will be in a second. The AY, are you still active folder? It was actually very hard to get all of you react and now I'm interested about what you will tell me why. We sent an official announcement, so there's this forum official, and we said in the first lecture that you have to subscribe because it's the official announcement. The topic was prefixed important in capital letters. It was sent on November 15th, and there was a mail via his and one campus management system. And it was a reminder at November 24th or maybe 27th, I'm not sure, doesn't matter. And then I sent another reminder with a prefix very important in capital letters and I put the word exam in there. So I thought maybe it's a good idea to put the word exam on December 6th. Again, via Daphne and via Hissen one so that before this last reminder, which was the third one, about 100 people did this, and we were a bit suspicious. Is it really only about 100 people who will take the exam? Because there were over 300. I mean, initially always a lot of people sign up, and then it usually reduces by about half or not quite half. So 100 looked a little bit small to us. One day after this one it was 200, which means there are 100 people who needed this third reminder with very important in all caps and exam. And I, we were trying to understand why is that. Is it that these 100 people, they're basically not following and you just have to take extreme measures to take, I mean we are living in these times where you have to take extreme measures to get people's attention but still. So if you have any, I only got the mail, okay, here get some feedback. You only got the mail on 6th of December. First mail never arrived. There should have been two mails before, okay. Interesting. Here some of you wrote something about this, even though there was no question. I almost never look in the forum. That's of course a mistake, because official announcements are official announcements. I agree, I thought about this, I think I basically had no, I forgot it to mention it on the sheet, so in the lecture. And it has a build error on Dafne, yeah that's correct. We did it like an exercise sheet so that you can, that we can see on Dafna who did this. Okay, anyway, interesting. So now onto the topic of the lecture, it's a quarter. Yeah, let's just dive right in. Oh, you have a comment about this one? Yes, please. Yeah, I just forgot, so I also just got the last email and never want to go to the forum. Okay, but you didn't subscribe to the forum, right? Because that's for sure that this post was posted on the forum. No, I don't want to go to the forum, but the email is posted. Yeah, but if you subscribe on the forum, I mean, there are two issues here. Let me just clarify that. I don't know if it works. Thank you for the feedback. This is the forum here. And if you subscribe, so here's important, very important, here you see fire. What, what exam? I have to, so if you subscribe to this forum, you get an email. So if you didn't get an email, two things happened. You didn't subscribe to the forum, you get an email. So if you didn't get an email, two things happened. You didn't subscribe to the forum, that's for sure. And the hiss in one mail somehow didn't work, which we sent twice. Anyway, interesting experience. Let's move on to the lecture. Thank you for your feedback. So, web applications. We are going to build a web application today in the lecture, very simple one, but even building a very simple web application requires understanding of a lot of things. Simple things, but a lot of things and when they interact it's not so simple anymore. Super exciting topic if you ask me and an absolute must to know for information systems but for computer science. Everything has a web application in some form. The base is socket communication. What is socket communication? First a few slides just about what it's about and then we will dive right into the coding. Socket communication, you have two programs, processes, talking to each other. That's obviously important when you build a web service. Typically they are on different machines. They can be on the same machine. Thank you, Frank. For a typical web application, it looks like this. You have your browser here. Type something in the browser, for example this, and now this browser talks to our machine where this wiki is hosted, gets the webpage, displays it. So here you have a browser just running on this machine right here, talking to our web server in our group which returns this page. And there are static webpages, we will do that today. And then there are dynamic web pages where on the web page you have code which is executed and also communicates with other machines. That will be the next lecture. So you have this machine to machine communication, two endpoints, like two people talking, then the people are the endpoints. The endpoints are called sockets. On each machine you have a socket. And each socket belongs to a particular machine and also has a unique ID because that machine at the same time can talk to a lot of things. Unlike humans, which are not really, who are not really good at multitasking. At that unique ID for a particular channel is called a port. So you have a machine called a host in this context and a port. And there can be many different channels open at the same time, so you have many different ports. And as we will see, a port just has an integer ID, like 8512. Here's a very basic protocol which we will also use today which is also how the basic web application protocol works. On the server side, so you have a server, for example the machine hosting these pages, our wiki pages. And there you create a socket like an endpoint so that people talk to you and a certain port. So please you can call me under this number on that machine and then you just listen on that port. You're sitting beside the phone and waiting for someone to call. You wait, if someone calls, you hear what they want. You do your work, they want something, you have to fetch a file, compute something, you send it back. You wait for a call, you send something back. How does it look like on the client side? Client is, for example, the web server. The web server wants this page from our machine, calls up the machine, needs to know the machine and the port. And then on the, yeah, this then gets also a port automatically. If this web server calls someone, it doesn't have to say which, it is not important which port, it just needs to tell the server the port, and this is somehow done by the operating system, we will see that in a second, then it sends its request, waits for the result from the server, and does something with it. Protocols can be much more complex, much more dynamic, but that's kind of the simplest one. You call somewhere, you say this is what I want, and the other side takes the call, does something, gives you the result, over, next. That's what we will do today, and that's also how web applications, basically most of the time that's enough. So we have the server side, the one doing the work and the client side who wants something. All programming languages have standard libraries for this kind of communication. For the server and for the client in Python the library is called socket. In Java server socket in C++ boost ASIO for asynchronous IO. It's not in the standard. It's pretty complicated stuff, all this network stuff. We will see a little bit of it. We will provide for the exercise sheet today, maybe let me show that to you already. I did some preparation. So that's the kind of code you will get for the exercise sheet. I will write a lot of additional code with your help today, but then I will not give it to you because if I give it to you, you can copy it and you have half of the exercise sheet done. And let me address two issues here. First, why don't we give you that code? And why can't you just use HTTP library in Python, right? There's a library for building web applications. And with that, 80% of the exercise sheet would be done. But then you don't learn anything, right? To really learn how this works on the lower level, and that's the point of university, right? Understanding how things really work. You have to do this yourself at a lower level. Not at the lowest level but at the level I'm going to explain in a second. So it's really important, there are also always exam questions about this, really important that you understand this. Of course in the future, if you have understood the low level then you use frameworks but for the exercise sheet, you need to do it yourself. So that's what's written here. So we will just provide this very basic, it also in the file, I will just explain it on the slide. So yeah, you create, you use the socket library, we will do it in Python. You say I want a socket, you have to say okay internet. You have to set some options. This here is important when you start the program again, that you can just reuse the same port again, otherwise it takes some time for the operating systems to clean it up. These are a lot of annoying things, small details, if you don't do them, you have a lot, so many small details, we just give them to you. This is important, I mean, you're running on your local machine, this means basically anyone can connect no matter how they call you up. You have an IP address, a telephone number, machine also has a name, different ways to call a machine, name, IP address, many IP addresses, many names. This basically says however you call me I will react. It's also an annoying detail if you don't get it right. The port, I just explained it and then you say okay I'm ready listening. Then this is how you accept the connection. This just waits in the standard mode. This is blocking. I'm just sitting near the phone and waiting. And if someone called me up then I have a connection object now with which I can work and the address of the client, IP address of the client. I will not explain a lot of network, I mean there are layers, many layers below this, all the low level network stuff. This is not a network lecture, we will not talk about that. That's also important. Maybe someone calls you and then there is no one on the other side, just heavy breathing or no breathing or I don't know what. Some I have to say, okay, if nothing happens for five seconds, I will move on to the next. Especially important for our code, we will not do multi-threading today, right? Normally what you would do, a lot of people are asking you and then you have like 10 threads waiting for requests or if you have a request and you work on it in a separate thread, you're ready to work on the next one. We will just do it single-threaded, one after the other, which means if someone is not talking to you, we have to somehow cancel this after some time, otherwise your whole server is, it does not go on, somebody calls you, doesn't say anything, and then nobody can talk to you anymore. We will send that and see that in a second. I mean, then you read the request, then you do something, you send back your answer, and you close the connection. That's the basic server loop, we'll implement it in a second. This one is important, let me say it already now, remember it, we will implement it in a second. When you listen to what the other side is saying, it's network protocol, they are sending to you, maybe they are saying something, it's a bit think of it like a phone call, they are saying something, then they are saying something more, when do you know that they are finished, right? You somehow need to, there's no universal end of message. When you communicate with people, it's kind of when somebody makes a longer break, you know that now they are finished. Usually they don't say end of message, now you can speak. We have some cultural norms for this. In the computer world, we have to agree on something. But it's important that you not just read once, but you say, okay, is there more? Is there more? Is there more? We will see it in a second. And on the client side, yeah we will not implement the client side just the server and we will see that. Let's start coding and let's see how that works. Okay here's the, I have a lot of windows open here and you will see in a second why. Okay, so let's do the server loop. Okay. Yeah, so the server loop is wait for requests and process them. And we do that, that's the typical server loop, it's kind of running forever. So I'm a server forever, I will just, yeah, so let's just wait for an incoming, wait for incoming request, and we have already seen that on the slide. Let's just write that we are incoming, let me maybe put that on the top. Waiting for incoming request on port, and I think I want a formatted string here, that's a bit nicer. Yeah, let's put a space here. So I'm just waiting, and how do I do that? That was written on the slide. It's just, okay, I get the client connection or socket, let's call it socket, the client address, except. This is now blocking and waiting for an answer. Okay, and yeah, so if somebody replies, let's just say okay, we received something from this request. Okay, great. Let's maybe, let's already implement that and let's already run that, I mean. Okay, so I'm... And here, what do I get to start my server? It's called searchserver.py, that will be on the right. Let's just see. I just have to specify the port. That's why I can be contacted. Let's just take 8080. Okay, waiting for incoming request. I think I want a new line here. So that it's a bit nicer, we will see this a lot. So I'm waiting, how do I call this machine now? Let's, I have a third window here and one very simple program, is it on the, yeah, it's on this slide. Just for testing, that would also be useful for the exercise sheets, a very old program, Telnet. Telnet is just call up this machine, pick up the receiver, call this machine. Call the machine so and so on this port. Let's just do that. And here on the right we have, so I'm just doing telnet. The machine is called Tura. I'm also on Tura, I don't have to be on Tura. Let's go to some other machine here. I don't know, we have, let me go to Indus. It's one of our, we have a lot of machines. Let me call up Tura on port 8080. Ring ring. Okay. So received request and you see what's, yeah, so this 108152. This is Telnet. Let's maybe briefly exit Telnet again. It's not so easy to exit telnet but it tells you that it's one of the hardest problem how to exit certain programs but it tells me what the escape character is. If I take this email address here that should be, yeah that's the, I think I have to ask for the host name here. Yeah, that's a Indus, that's a machine. So machine has a name and an IP address. And you see the machine from which I was calling was also getting a port. That's the port now on Indus on this machine, 55700. Okay, let's call again. So, and you see, so let me just, hello, hello, yeah. A lot of things can go wrong in communication between humans, also between machines, right? I mean, this has already moved on. We see it here, it's waiting for the next request because we haven't implemented it yet. So nothing is happening here. I can talk all I want, nothing is happening. So we obviously, let's just continue with our code. A lot of things can go wrong, right? The client can ring up and then the server just moves on or died or whatever, the client just hangs up again, starts a little bit, then something goes wrong, it takes a long time. Okay, so now we read, we just read bytes from the client in rounds. This is what I said earlier, it's very important and here's another very important slide. Bytes versus strings. This is very important for the sheet and also for what we are doing now. The data sent via the network is always bytes. It's just bytes. That's important. In Python how do you deal with bytes? There's bytes and byte array. Bytes is immutable, you can't change it. We don't want to change our objects here. It's not that we receive something and then we don't want to change it. So we can just use bytes. Don't need bytes arrays. But it usually you want strings. What's the difference between strings and bytes? Well, here's an example. Knödel is a string, right? Oe is a German umlaut, and when you represent it in bytes, you somehow, for example, in UTF-8, we will talk about that in the next lecture, how do you encode funny characters in bytes? Well, in bytes, the oe actually is two bytes, so it's the byte with the hex code C3 and B6. And this is how a bytes object can, in Python you just write like a string but with a B, prefixed with a B. So this is actually knödel has six letters but knödel has seven bytes, right? That's very important. Very important. Great source of errors. That's the other way around. So the functions in Python are called and you need them. Encode and decode, turning a string into bytes or turning bytes into strings. So here I have bytes, two bytes, which are just the German umlaut U, same as here, and if I decode them, now I have to say okay which encoding are you using? The whole next big part of next lecture will be about this UTF-8 because it's super important for every information system as somehow as to encode text. And so you have to say how do I encode all these funny characters in bytes. So here if I use this encoding scheme, I get back a normal string. So this is string with one character, length one. This is a bytes object or a sequence of bytes of length two. And you need to convert between the two. All the network communication is bytes, what we actually do is string. So here we are, so my request is just, let me, it's something like this. And while, yeah, while what? I mean now we have to read the, read next part of request, okay. So, yeah, and let me do it like it's a byte sequence I think one way. So it's just a receive function and let me just do extend here. If you want to append to a byte array I think it's called extend. And then you have to say how much you will be reading at most in one batch. But it's very important that you, and now let's somehow we need to agree on a protocol. How long do we read? When do we say okay now I and let's do the following. Let's just wait for a new line. Let's wait for find and let's yeah let's do that. And we will see that in a second. This year, that's carriage return and new line. That's how Windows encodes new line as two characters. Carriage return, that comes from the typewriter world where the carriage on the typewriter goes back to the left and then new line is go to the next line. That's why it's two symbols in case you didn't know, right? On the typewriter you have this carriage, you have now typed, it's over here. You have to get it back into the new line. R is the carriage return to the first column, new line, next row. And that's also in a web you use that for. So that's just a new line and that's another new line. So this is basically saying empty line. So let's just do that until I find an empty line. Okay and now let's convert that to a request is, am I, yeah let me just, I want to decode that in UTF-8 and let's just print it. Yeah, let's print it and, mm-hmm. Request was request, okay, let's see where that goes. Let's just start our server here again. And where's our telnet? Here. Okay, let's call again. Okay, what did we do wrong? Bytes object has no attribute extend. Okay, what did I do wrong? It doesn't have extend. Then what do I do? Is it plus equal or? Let me try plus equal, okay this seems to work now, let me call again, okay let's say hello. Okay our protocol was this is receiving, waiting until I, okay, are you talking to me? And now let's new line, okay that worked, right? Request was hello, are you talking to me? Until I type. And now again you see this connection here is closed, the server moved on, here it hasn't moved on, yeah? Okay, here is my request, yeah it's not talking to me anymore, right? Somehow we didn't agree on the protocol. So here it already moved on and so on. Okay. And let's do another thing. So we have settled that. And now where do we... And... Okay, let's go on to the next stage. So now we have a basic protocol established. Now, okay, I think now it's time to, so we use Telnet, but actually a browser is doing, is also a kind of, it also calls up other machines. So this is running on Tura. This is now internally, so I can also just write Tura here because it's in our network. Let me just write Tura 8080 here. That's when you type in the browser the name of a machine or an IP address colon 8080, what you're saying is call this machine on this port. Let's just do this. And you see something is happening here. So now this is just the same thing, right? Now it's not Telnet that was calling me, but the browser. Tura 8080 is just say call the machine. It's like Telnet, Tura 8080. And you see the machine, it's like telnet Tura 8080. And you see it sends something to my machine here. It sent this whole thing, we'll explain it in a second. And the important line for us, we will ignore everything here, that's all kind of header information, additional information, cookies, a lot of interesting things to understand here. I think we will mostly ignore it today. First line is important for us. How does the web address typically look like? Typically I have something here, right? Machine port, usually you don't specify the port. If you don't specify it, it's implicitly 80, right? 80 is, if we do Wikidata Q80, that's Tim Berners-Lee, Q80 is port 80 of HTTP, so let's, but we specify the port here with a colon, if we send this. Now what do we get in the first line? Now we get get, right? So everything you type, so that's what a web address basically is, it's the machine, it's the port, and then some information I send. So this is now sending here get, and then do something with this please. It's telling the server do something with this string. And here that's the name of the protocol. We talk about it in a second and so on. And you see why is it still, what's going on here? Why is this bing, bing, bing, ping pong here? And nothing is happening. Yeah? Yeah, it's not responding, right? It's easy when it's explained like here, but when you're doing it, this can be hard to explain, right? My server already moved on. This thing has no idea. This is on a different machine here, right? It's on a different machine and still waiting for a response. It doesn't know that the server, this web browser doesn't see this window. So in networking, it's simple in principle, but so many things can go wrong. Now let's talk a bit about HTTP before we continue with our implementation. So now we are talking about machines talking to each other in the context of a web application. And so the protocol there is called HTTP, how we talk to each other. And it's a basic request response protocol. And we've already seen that, so this is how it hosts port and then it sends something. This is also, we have already seen this, it sends, so if you add something like this and typically you add pages, this will be sent to the server. And then send back a result now. Okay, now you have sent back a result and if you send back a result now. Okay, now you have sent back a result and if you send back a result, maybe let's first send back some result. Let's go back right to the coding. So what do we do now? Handle the request. I'm sorry. And let's handle the request in a separate function. Yeah, I think that's a good idea. So we have a separate function here which handles the request where we do, and this is self just because it's a method of that and it gets the request as, okay. So this handle the request and return a suitable response. It's kind of, so what, yeah, let's just do something, let's formulate a response and let's just say responses, yeah, thank you very much for, let's just be polite but lazy. Thank you very much for your request, okay. Thank you very much for your request. Okay. That's it, that's our response if we send it back. And let's, as we send it back as bytes or yeah, let's, as bytes, let's send back our, as bytes. Let's send back our, I'm not sure, I think I want to encode it here. Response is equal to, I want to encode it and then I don't need this. Okay, now I get my response. I'm not sure if I need two lines here. So this handles the request. And now we somehow have to send it, send the response. So how do we do that? We saw it on the slides. It's just send all response. That's correct. And now we should close the connection and move on to the next one and that's just close. Now we have implemented the full server loop in a very simple form already. Let's just see how that looks. So I'm reading the request in rounds until I get an empty line. And we already saw that's exactly how HTTP works. It kinds of, that's the convention used in HTTP. Read until there is a completely empty line. So we already implemented that. And then just send back a stupid response. Let's just see how that works. Let's go back to our browser here and let's just send the same thing here and bam. Okay we get something here. Thank you very much for, oh I'm sending back the whole request, okay. Let's fix that. Actually the request was not only this thing, I mean we're really only interested in what's after the slash here, right? But we get this, we get get and the protocol name and all these headers. Let's forget about all that, let's just extract this part here. Everything before the protocol and after the get, we also don't want the slash. Let's just implement that. So, yeah, extract the part. We are only interested in the first line and the part after the GET and before the HTTP slash 1.1. Okay, let's see. Yeah, let's just split it by, yeah yeah we just want, we know that there's a new line, let's just take the first line, let's split by the space, yeah we just want the space here, that's very reasonable. Okay we also split, yeah that's one way to do it. Okay let's, there are many ways to do it. Let's just try that. And let's just print the request after this. Request... Yeah, it's not, I don't know what's there. There's a name for the thing after the... Yeah, it's not, I don't know what's there. There's a name for the thing after the request part after get was blah. And maybe, yeah, let's maybe put this in a try block. I think that's safer in case. No, you can put that in a try block. You should do proper error handling. We don't have to do it now. I mean, maybe I get a different request. We see that in a second. Maybe it doesn't have this format. Then of course it shouldn't crash. It should output a proper error message and so on. Let's just see how that works. I'm just explaining the basic here. Let's try the same thing. Oh yeah, this work request part after get was, let me write it a little bit. Yeah, get request for, I think that's a nice, that's called a get request, what we receive, so the standard is get request, give me something. Let's just do that again. And we're already getting closer to something real, right? Get request for, blah, so this is exactly the part after the slash here, I'm sending back an answer and I see the answer here. So'm sending back an answer and I see the answer here. So we have already a simple web application. Now some very interesting stuff here. I now press F12. Let me see if it's on the next slide. Yeah, F12, the browser development console. I mean this is just, you have to know this, not only for this exercise sheet but for the rest of your life, for everything. You just need to know that your browser has this development console. And here it is. It's now F12. I open it. You can have it on the right or on the left or at the bottom. I have it at the bottom now. And this is like a huge tool and we just, a number of things will be interesting. The console we don't need that, the network tab will be the most interesting now. The network tab, this is a browser and it's talking to other machines, that's what browsers do. And this is just telling, saying what's going on. Let's just do reload and now we see, aha, it sent two GET requests. And we see a lot of information here, all of which we will be going to understand. Why did it send two GET requests? Let's first look at the first one. That's the first one. And so here it said, I sent to Tura 80. It's maybe a bit small. Let me try to make it, let's see if there's a lot of information here. So yeah, that was one larger. So it sent a GET request to Tura on port 80 with this. This is what we did. And here are the request headers. This is exactly what we saw when we didn't extract the first line. This is what it sent. GET exact, oh, it's HTTP 0.9. Wow, this is an old browser. HTTP 0.9. I'm surprised. I don't know what it's, okay but my code worked anyway because it just split by HTTP anyway. I can also see it in raw form, so it's really just sending that text. If I don't do the raw form, it's actually just key value pairs, right? These headers. So it sends all kinds of key value pairs, the meaning of some of which we will learn, but we can ignore most of them. And then there's a response, which we can see here, and that was the response. Thank you very much for dollar. Why is there a dollar? Did we send a dollar? There was a dollar in our code apparently, right? Handle the request. Oh yeah, there was a dollar here for some funny reason. Anyway, so that's the development console. We can see what's going on. Actually, we didn't do this properly. And let's just see that, if you go back a few slides, the way we are supposed to do this, we are not just supposed to send some contents, we are supposed to send headers. Just like the request had headers, the response should also have headers in the HTTP protocol. And these are the minimal headers. Some will say, okay, this worked fine. Then say, I'm going to send you a message. It's in of this form. We will see more about this in a second. And it has these many bytes. So let's just do that in our code. We want to write proper web server. So, so now we write, compute the headers. So we want the headers. Let's start with, yeah, let's take HTTP1, okay. Let's add the content length. Okay, that's fine and let's add the text plane. So these content types. Let's maybe go back to the slides for a second. Content length, it's clear, it's the length in bytes. The content type. What are the content types? Yeah, this used to be called, this comes from the mail world, I think originally. Multi-purpose internet mail extensions. You send something via mail, you have an attachment, you have to say what's in this attachment. It's just encoded as a sequence of bytes. What is it actually? So this used to be MIME, then in web context it's called content types, the correct modern name is media type. What kind of thing is this? And the way you call this is always it's two parts separated by a slash. So the first part says what's the what is it in principle and the second part is more specific. So here we have some text format which says this is just text, it's just considered as a string. Plane. This is just text like what I have sent so far. HTML we will see in a second, this is text with a special meaning. For example, program code is text with a special meaning. HTML we will see in a second. Maybe you also want to send an image or code, application is all kinds of code. So you just have to know these. It's not hard, right? Code, application is all kinds of code. So you just have to know these. It's not hard, right? You just have to understand there's this convention of calling things. Text plane, this is what we want here. And okay, let me also enter headers. Yeah, let me also, and the headers, yeah, let me also encode the, I'm setting, yeah, they should also end with an empty line, and I think we want to encode them as bytes. That's exactly right. And now we want to send back the headers and the response. So let's do that. So now I did the same thing but I'm sending back. So now will we see the headers here? Let's see if we run it again. No, I get the exact same thing here, right? Nothing changed. But when I look here now, now I also get response headers. Here are the response headers I sent. 47 and text plain. I can also see it in raw. This is exactly what I sent, right? I just sent headers, an empty line, and then the actual content. And again, note, I need some convention so that my other site knows when it's done. That's why I sent the content length, right? Otherwise, the other site doesn't know, will there be more, will there be more? For the request I did it like this that I just waited for an empty line. I can't do it like that for the response. The response could be anything, empty lines could be part of the data. So the way it's done there is like saying, okay, I will send you some headers. There are a lot of different data. So the way it's done there is like saying okay I will send you some headers, there are a lot of different headers. Then I will send a new line and one of the header will say exactly I'm going to send these many bytes and then I send these bytes. And these were these are the bytes and this is the header and here it only displays the bytes and in this case it's text so everything is fine. And here it also says the okay. And by the way I think it's totally arbitrary what you write here. Can also write a less soupy. Let's see if that works. So it's really, it's very simple this HTTP. Let's just see whether that works. So it's really, it's very simple, this HTTP. Let's just see whether that works as well. Okay, now it says, allesupi. Yeah, you see that it actually work. It doesn't really care about the string, right? The 200 is important. That's really interesting if you do it on this level yourself and then understand, actually running it and seeing what's happening here. That's the way how you really understand this stuff, right? Not if you used a HTTP package by Python then. And there are a lot of things now so far everything worked fine. There are a lot of things which can go wrong. And those, these things in particular you have to understand. Let's maybe look at why is it sending this FAF icon? Okay, oh it's also getting something. So FAF icon is, I think you're welcome to implement this. Let's go to some, let's go to our, let's go to Wikidata. You see this little picture here? Browsers always like to show little pictures here on the tabs. Some people have many tabs open and it's good to know what's running on this tab. So you have these little icons here and this is something, where does the web browser know this icon from? Well it's asking the server to provide the icon. So it's always asking FAFICON IKO. So if here I would change my server to return this, then, so I'm returning something, but I'm returning, I'm sorry, if I go back here. Yeah, I'm saying thank you very much for FAF, ICON, IKO, so it's not exactly, so that's why I don't have a little icon here. If you feel free to do it for the exercise sheet, also return an icon to that request and then you will have a nice icon here. So let's see what else we do before the lecture, before the break. Let's go through the slides once more. So we did this, we received the GET request, we sent back our reply with a header and then the contents, this I already said, HTTP once new lines encoded like this, We only do a, just a second. We only do a get request. There are many other requests type. Post requests. This I think it's worth to talk about for one sentence. With a GET request, you're writing it in the URL and then you have, it's like short requests, things which fit into URL. What if the browser wants to send one gigabyte to the server? This doesn't work with the URL. There's actually a limit here. This you do with a post request and then you can send arbitrarily large data, but with a slightly different protocol. Many more headers. So request types, we will not talk about them here. Then there are many more headers for the result. So not found for example, we just sent back 200 okay. Not found, forbidden and so on. You should implement at least 404 and 403. And we will implement one of them I think in a second. We already had the media types. We already had the media types, we already had the development console, we will see more about this. Today we will only look at the network tab, we will see about that in a second, and console will also be very important for the next lecture. Okay, and I think that's a very good point to make a brief break and we resume in five minutes. So let's continue. So far time management was perfect. Let's not spoil it for the rest. I don't know what's going on. So as you can see we are already over the middle. That was the hard part actually so, so let's see. And please do ask if you have any questions. Let's go back to the slides. So now we have established, so what have we done? Quick recap, we can now, we have a server here, we can start it and in the web browser now we can talk to it and it reads the request here, it gets two requests, one the actual one and one for the icon which we don't provide. We return the answer, we do it in proper format, it displays it but so far it's meaningless. Enter for testing we can also do this one more. We can also, you can also do this here. Ah interesting, this doesn't, so yeah you see I didn't do proper error handling here, right? I was expecting a GET request. Let's maybe do that once more. If I would stick to the protocol, it would work. I could also send a GET request here. Blah, blah, HTTP 6.5. Now it works, right? And now I get the reply. So this is so I can play web server here in telnet, right? I just have to send it in the right format, which my server wants, and now I received this result. So, not bad. And now, HTML. HTML is, when I want to display something in the browser, a bit more nice than just text. That's what HTML is about. It's a very simple language, XML-like, so it has these tags in these angle brackets. We already know them from SPARQL. And let's just write a very simple web page now. There are some basic elements. So there's this header section which has meta information and body is the actual contents of the page. And HTML all around it. So in the header you can have things like, we will talk about it in a second, a style sheet, maybe you want to give it nice colors or fonts. Code, we will talk about that in the next lecture. And in the body section, the actual contents, you can have things like, so it's like semantic markup, right? You don't say show this in this font size from this font family, you say this is a first level heading, this is a paragraph, this is an input field. So these tags are semantic information, not formatting, not low level formatting information. Let's just start by writing something and now it gets, it's not hard but it gets more, more stuff now. We now create a different file. So search HTML. Now I'm creating an HTML page. And let's just, so I'm starting by HTML. Here I have a head. Let's just do, let's just choose an empty head for now. And a body and now let's do a first level heading my first search engine. Okay, H1 and then I finish the body. So that's very simple, it's always good to... and then HTML, done. Okay, and now we want to deliver that with our web server. Now how do we do that? And it's called search HTML and what we want now, and that's the typical thing what web search, how this works. Here I want to type search HTML dot HTML and what I want, I mean I could do it now and it says thank you very much for search. That's how I programmed it so far. What I actually want my web server to do and that's the typical mode is please look up whether you have this file on your machine and then serve it to me. That's the typical, that's how the web worked in the beginning. Asking for static files and then they were served. So I have this file here on my machine and let's just extend our code to do that. So we are in the handle request function here and let's interpret the request as a file name and see, read the file if it exists and return its contents as response. So now we have, and if we have done that it's a real web server. Okay let's, so how do we do that? Let's just think we should start by writing a try block. Okay, yeah. So we try to open the file. So we just take this as a file name and maybe a file name request. I think that's more meaningful and let's just write that again. Yeah, so I'm just reading the contents, okay. And yeah, file read, will it return the contents in bytes? I'm not, uh-huh. Yeah, if it was not found. Now what do I do now? Response file, okay, let's just do that. File not found. Let's just try it and see. So now I'm just, I try to read this file. If it exists, I read it. That's my response. And let's just see how that works. Let's just go bit by bit, that's also a good way. Yeah, and now it crashed. Why did it crash? Can't concat str2 bytes, that's the error message. You tell me what's going wrong. Can't concat str2 bytes. Something with, we had one slide on this, very important, yes? Yeah, exactly. So apparently here we get a string as the result and not, yeah, so we said we send this back as bytes, right? So this is really important, but you usually get good error messages. So you can't just concatenate bytes and the string. So we just encode it here and then this error should be gone and let's go back and reload. Voila! It's not exactly what we expected right? Now we get the HTML as text. Okay but we already know that. That's easy to fix. Let's do the following. Let's have a default status code and media type. So by default, we want 200, I think. media type, so by default we want 200 I think and I think, yeah let's say 200 okay and we want a text plane, that's what we did. And now I mean this should be, let's go back to that slide where I had that here. I think it was here. Now we have to tell the browser what I'm sending you now. Please interpret this as something. The browser doesn't do it itself. I now send it as text plain, so it's just showing me plain text. That's what I told it, right? Here it says headers, content type, text plain. I should say this is HTML interpreted and when it interprets it, it will actually show me a heading, my first search engine. So I just have to change my server to do that. But that's easy, right? Let's just set the media type according to the suffix of the file. So if you run a real web server, that's standard stuff or you can configure it. Here I'm not saying please interpret this as HTML. I'm just taking the suffix as a hint. It's dot HTML, so please interpret it as HTML. That's the normal way to do it. So let's just do this. So if it ends with HTML, we set the media type to text HTML. That's it. Default is text plain and now all we have to do here is just insert the media type and here in our message we just take the status code by default and let's just, yeah, we set Alice Zuppe is better than OK. OK and OK let's just do that. Let's start the server again. Let's run it again. Bam! Now it works. Now it interpreted it as HTML and it's just, let's just look here what it did. Let's just look here what it did. Alizoupi, here's the response. And I can look at the response in raw format. So I actually send this. But now let's go back to the headers. Now I said 89 bytes interpret as text HTML, right? And then it's displaying it like this. This is what it was, okay. And there's a new line here for some reason. Is that new line also in my file? Oh, why is there a new line? I think it's the Valk-Enix convention. Ah, yes, yes. So there is a new line here. That's probably true. There is a new line which is not shown here. So interesting little details. Okay, is there anything else? I think there's something else but I wonder where it's written. Something I wanted to show you. Request cookies, we will not talk about cookies although that's also an interesting. I'm looking for the info about the quirks mode actually right now it's in some mode but I where is that info I'm looking for anyway maybe it comes back to me so what now I requested the file that exists what if I type now I just get file not found, but here it says 200 allesupi, okay, and it just sent back our default was text plain, file not found. Why do I get that? Because I wrote it here, right? I said if file not found, file not found. Well, that's also easy. I mean, look how easy it is now to do, to make this stuff real. I just say, no, the media type is fine. I mean, now I'm sending back a text message. See that I very easily, and you can do this, and you should do this, also send back an image or something, right? You want to send back a meme or whatever, you just set the media type to image JPEG here, whatever, and the response is an image. Then the display will be an image. But I'm fine with text plane, it's the default, and let me set next down, okay, 400, next down. Okay, not star, okay, not found, whatever I like. So let's do that. Now I get the same message and now I get a 404 here. It tells me, okay, it doesn't exist, I get proper codes now. And also for the favicon, it now gets a 404. 404, if we go to Wikididata I think 404 is also a... Q404 is HTTP 404, not surprising. So what else? What else? So now we have HTML. Okay. And let's add a bit more to our HTML page. Now I have a running server. And now note the following. Let's, okay, my first web page in 1990s. Look and feel. Yeah, this is how web pages look in the 90s. Why is it displaying this as red? Because end is actually a reserved symbol in HTML XML word. You need to escape it like this, empers end. That will actually show. And now I want an input field and yeah, let's, I don't need a value. What do I want? I want to have its size maybe 40. Let's just see what happens. Now notice I changed the HTML plate. I don't have to restart the server now. I'm just changing this and now I, let's go here, search HTML. Now I get the new page, right? I didn't have to rerun my server. Now I'm just changing the content which is being served. Now I get this 1990s look and feel, the Empress Anteas, right? I get my search field. Let's, now I can write HTML here, right? Let me just put a placeholder here and write something here. Type your query. And I think it should be a slash here. So now I can just, yeah. Now I get search engine stuff. I mean I'm just sending HTML, the browser is interpreting the HTML. I don't have to write code for this. I don't have to, my server is still running, working, doing its work. So now I'm just working on the web design part of the problem. So we're not using any libraries or anything, we did it all by ourself. So we have all the basic elements here, a diff here, we will look at that in a second. So now I can type my query, my query and nothing happens. So let's also have a search button. How do you do a search button? Like this, it's also an input thing, let's just do that. Now I have a search button. Okay, let's search, nothing happens. Okay, that's the next part. How do we get something to happen when I click on search? I mean, we now go full round, I have a query, I search, my server does something, sends back the result, and now the exercise sheet, I will talk about the exercise sheet in a second, but let's implement that now. Or let's go to the slide. Now we want to have action for the search button. Okay, before the action we have to make our webpage nicer. How do we make it nicer? I will just add one element now. I mean, this is now 1990s. Now I can do all kinds of stuff. How do I, I mean, this is now 1990s. Now I can do all kinds of stuff. How do I, I don't know, maybe I want to make this here blue or whatever, or I want to make this larger. How do I do this? I have to specify a style sheet, and I do this in the header section here. Oh, one more thing. Here it says Wikidata in the tab. Here it just says, contains the URL. If I want this to have a title, I should put it in the head. My first search engine, I'm just taking the header here as the title. Now watch this. Now this is the title. The title, this is on the page, this is what's displayed in the tab or other programs. Okay, and let's give it a style. So now I'm linking a style sheet, which is typically called like the main file with CSS cascading style sheet. Let's go to the file. What style sheet? That's a separate file which says how I want my page to be formatted. It's also very simple but a huge standard by itself. You can just study CSS for the rest of your life. It's huge, very interesting. Here's just how it works, basic working. Here's just a tag, H1, right? We have seen that. H1 is just specifying first level heading. This just says first level heading, please put them in blue and boldface. Yeah, let's just do it and let's now, now we added search CSS and now let's just, I don't know, let's take the paragraph and make it, I don't know, red. So let's see and now the paragraph becomes red. I'm surprised that it becomes, why did it become red? I mean how did that happen? Now you see here, now it's reading four files. It's also reading the search CSS, which it, yeah. So it asked for, please give me, I mean I already wrote my server that way, whatever. So it sent a GET request for Tura 8.8 search CSS. That file is on my system. It was returned. So, and here's the response. So it's in color red. But I'm, now I'm really looking for it. It should complain because I sent this as text plain. I should have said this is a style sheet and I'm, it should complain some. Oh here it complains, there's something red here. No that's just the favicon thing. Does somebody have an idea where the complaint is that this is not the right content type? It should complain somewhere that I sent this as textplain and not textcss. And I'm wondering whether it is request. It's complaining nowhere. And I know that there is a mode where it, does anybody have any idea? Some web developers here or people who are, I was expecting the message to occur here and that you should... Okay, I know one thing, maybe I can provoke it, because what the... Let me just do the... Oh no, this is not correct, right? There should be a... Yeah. Let me check that. That's actually what you should do. At the top you should say this is HTML and now let's try that again. Ah, okay. Now at least it's not red. Why is it not red? Now it ignored my search CSS. Let me see that. Where is it? I have to do it again. Search CSS. Response. Where is the error message now? You see it's quite, ah okay. The style sheet was not loaded because it's mime type, text plain is not text CSS. So it's even telling me why it did not load it. Browsers are very lenient with all kinds of mistakes because people writing web pages and stuffs they make so many mistakes. One way to do it would be like compilers do it, there's one tiny mistake in your page or in your whole setup and then the page says sorry something went wrong, I'm not showing it. That's how many programs work but for web you don't want this, right? You just want the web browser to be very lenient and just maybe ignore some parts and show the rest that works. By writing this here, I'm telling it to be strict. Be really strict and if something is not properly done, don't show it. So here I didn't send the right heading for the text CSS, the right media type, so it's not loading it. We can fix this very easily here. If the file ends with CSS, give it the right media type. Now we have to restart our server. And now when I reload, it should be red again. And now it's just complaining about the FAF icon. OK, that was that. And let's maybe make it blue. I think that's... And one other important thing that's very... Web browsers are doing a lot of caching. When I have the... Which means... Yeah, maybe I can... Let's just close. Let me check that whether it works, might not work. Now I close the development console. I can reload here. Let me just turn this red now again. Oh, it's turning red, okay. Depending on what the browser does, it may or may not reload it because it has already loaded it before. So for web development, that's a super important tip. Please remember it. Open the development console. When you don't want your browser to cache stuff, open the console, also for other pages, F12 and make sure that disable cache. Then it will reload everything from scratch. You made a small change or something, you really want the newest content, open the development console, disable cache. Here it reloaded it anyway, I'm a bit surprised, depends on the browser. Now it's blue again. Okay, now we have style. Why is it called cascading style sheets? Because there are many ways to give rules. You could have another rule later on and the most specific rule wins. That's cascading. You can nest these sheets and so on and it's used a lot, it's a huge standard. So for the exercise sheet at least use some style, don't use, make it at least 1992 or something, but preferably 2023. So you can spend a lot of time on this, spend at least a little time on this to see what's possible. Okay, why are there two gets for the CSS? That's a very good question, right? Do we have an idea? Why are there two gets? I think that's a very good question and I don't think I have an answer. Does anybody have an idea? Why does the browser send the GET twice? I don't know, let's ask the Oracle if it's a... Why does the browser send the get request for search dot, it doesn't have any context, but twice. Let's see if it's, okay. It's interesting, that would have also been mine, generic reply, browsers do all kinds of stuff to prefetching, opening connections just for the future, sending it twice to check whether it, high latency, I think the first one is a good one. They check. So just sending it twice to see. Okay, that's one explanation. Let's go back here and let's maybe go back to the slides. So it's not, we don't have the definitive answer but the browser is doing it to check something. Now we want something to happen when we press the search button, right? Nothing is happening right now. And today we will do it the 90s way, that's how it started, that's also how you should first do it. I think that's complex enough for this first sheet. Let me just say that again. The individual parts are all simple. HTML for itself, CSS for itself, writing socket communication, basics for itself, networking, but if you put all these together, bytes and strings, so many things interact and co-row. That's why it's good to start with the simple stuff. Let's just put form around this. So form is like you fill out a form. And this looks like a form, right? It has a field. So that's where this notation comes from. So let's just put form around this form and let's indent it here and let's close the form. So that's all I did, I put form around it, I don't have to reload my server and it's just okay. And now let's type something and do search. Ah, now something happened. Now I got a request for search HTML question mark and it says search HTML question mark not found. So just by putting the form around it, something happened and now I think I, I think what I have to do here, let's go on the slides. I have to say, I think I have to give this, I'm not entirely sure. Let me see what happens when I give this a name. I say name query. Let's just see what happens if I give it name query. Let's go back here. Let's reload this. Let's do here's our search log. Ah yeah. Okay. Now what happened? Let's go back. I typed xxx, I clicked search. This is what my HTML says. It says here's an input field with context, it has the name query, and here I have a submit button and it's all in the form. And what happens when I click on search is that it will send a GET request to the server of the form, question mark, name of the file, question mark, this name here, query, I put query, that's why I get query here, equals the content of the search field. So this is now what my server gets. And this is what my server gets and what does it do currently? Well, it just tries, interprets this as a file name and looks it up and says it's not found. So now we have to change our server to interpret this correctly. And what should it do? It should interpret this as a query and then return the proper page. And so let's make that change. Okay, we have to go back to our server. So and now check if we have a query string of the form query equals blah blah. Okay let's do that. Query if maybe if right if request starts with no that's not what we, if, maybe I should, if the request ends with a query string of the form query equals blah, blah, let's do that, okay, it should now be if, I think I need a regular do I need a regular expression? yeah I don't know what if request no it doesn't start wherever. I have regular expression here. Okay, let me, if re match, that's also not what I want. What do I want? I want to know if it contains a query, yeah, this I don't need. So if it's, I'm sorry, I'm sorry, I'm sorry for the confusion, I'm back here. So let's see, I want to know if there's something, I mean it's now, I want to process something like this. And what I want is I want to know if there is something, I mean it's now I want to process something like this. And what I want is I want to split this off and get this here, this part is what I'm interested in. Okay, if this is this, now how do I remove the part of the, actually I just want to find this substring. How do I find a substring? And yeah, I think find is correct, request find is, I just want the position of this. And you can help me or write it in the chat if you know how to do this. I want to, so if this is, then I want to, I want the request to be just everything until, okay the query is now, yeah what's my query? It's everything after the position plus six or something. Yeah, that makes sense. And the request is everything until the position. Okay, and now then I should print that I have a query. Okay, I think that looks good. Now we check is there a query? If yes, remove this part, that's what I wanted. Remove this part and just print the query. Let's do nothing for now with the query, let's just see what it does. Let's go incremental here. Let's type something. Bam. Okay, now it just, where do I have it? Yes, here. This is correct. And now it's showing the webpage again because I was asking for search HTML. Okay, now I have to do something with the query. How do I do that? Interpret as a file name. Now that's the last important part for the exercise sheet. That's something which you don't do nowadays anymore, but as a first step, I think in some contexts you still have this. Now I want a page, and let's do it the way Google always did it and still does it in some form. I somehow want to have the same page as before with my result. So now I click on this and I want to have a version of the same page but with the result added, right? That's basically how Google works. I'm going here to Google and typing something, blah blah. You always, oh yeah, okay, you don't get a hit for everything. Let's do blah blah. Okay, and I get this, right? So I'm typing something else here and I'm getting something else. So I'm getting the version of the same page and the results at the bottom. This is what we want to do and we do it as follows. So in my page I'm just putting a placeholder like a template and now I'm changing, if there was a query I'm now changing that into the result. And let's just do that. You will see it when I do it. So if there was a, no I don't want suggestions here. Let me disable them. If there was a query, replace, if the page is search.html and there was a query replace result by the result. That's what we will do, right? So let's just do that. If request is search and I think we should always set the query to none by default. I think empty string is not none exactly. Then, so what do we want now? Now we want to, yeah. So what should our result be? Let's first write a result. To do, put the result here. Let's just first understand the mechanism. So if it's like this and now I do in my page, so I've already loaded the, just understand where we are now. I've clicked the button, so now I've got a request for, oh it's already, no it's up here. I've got a request like this. This is after I click the button. So I've stripped this off, so what remains is I'm just loading search HTML again, but I have xxx in my query. I do something with it and I just want to replace the percent result in this page by my result. So all I do here is, now I change something here in my response and you will need to do this for the exercise sheet. I just have to replace the result and I think I already encoded it here. Okay, I think it's this we better do here now. Turn and code the response as bytes, so that way I don't have to deal with bytes here. Yeah, let me just do that, result by result. And otherwise, I don't want, yeah, let's just do this for now and see what happens. And so, yeah, this is not what I want, right? If I don't have a query string, now I get the result here. So that's easy to fix too. So if I'm not in that case, no, I don't want, I think what I want is, so you see these suggestions, you have to take them with a grain of salt, right? This is, yeah, I want to replace it by nothing. So if I had a query string, I want to replace it by my result. I will compute it in a second. to replace it by my result. I will compute it in a second. Otherwise, replace it by nothing. Let's see whether that works. It doesn't, why not? Oh, because I didn't restart my server. Okay, the result was not replaced, why not? I think first of all, I should only do this if I'm here. I'm a bit confused, so if query is not none, so if I have a query, I do this. If I don't have a query, my result is just the empty string. And in any case, I only want to do this for, I want to replace percent result by result. That looks good to me, but it doesn't work. And you have to tell me why. I don't want this here here and I don't understand why it's still there. I mean it's clear that it's there because I put it here but now I was expecting that if my request, yeah? Where? Oh that's the reason, thank you. Too many trees. Search HTML. Still there. Yeah? So where's the mistake? Which line of code should I change? Yeah. I don't understand why am I not in the... Sorry, I have to reassign what? Oh, stupid. Okay, I'm just re- yeah. Thank you. It still doesn't work because I didn't restart the server. It's a good example. I mean, yeah, now sometimes I have to restart the server when I just change the page. I don't have, now it will work. I'm pretty confident, yes. So you can get confused very quickly. So far it worked swimmingly, I'm surprised that we didn't have more problems. This stuff just gets complicated because you have so many things interacting with each other, but we are doing pretty well so far. So let's just see, now I type something here, I press search, put the result here, it works. So let's put the result there, that's all we want to do. Let's put the result there, let's put a piece of HTML there, so now I want a piece of HTML and let's just do it with like this. Your, what do we do? p, p or not p, yeah. Your, let me put a format string here. Your query, let me just write query, let me just put the query there again. Query, no that's the request, no let's also put, and then the result. Yeah, result, I haven't computed the result yet. And, but I think, I think is almost the last. So I'm just putting this in parenthesis so that I can do it. If things are in parenthesis and you break the line in between, you don't need an end of line character. And let's make this result string here. And this also result string here. Yeah, and let's also replace this here with result string. Okay, fine. And now I have to compute my result. So what do I want? Let's maybe just take the, whatever the, let's just consider it as a mathematical expression and just write eval query here. Okay, and let's maybe put it in a try block, another one. Try result query, except result, I think the indentation level is wrong here, I think the indentation level is wrong here, is invalid expression. So that's a very simple web app. I type something, little math question, I evaluate it, invalid. Let's just see how it works. Let's just, okay, let me just do some tests here. Oh my, I get some errors here. Trailing white space, terrible. Let me just fix some problems here. I get, I don't need sys, okay. Re-imported but unused, okay. I don't need regular expressions here. Too many blank lines and 69. Okay, there shouldn't be a blank line here. What else do I have? Do not use bare except, okay. Okay, I should use exception. Okay, wonderful. And 120, F string is missing, placeholders, thank you. Yeah okay, if I don't have placeholders, I don't need an F string. Okay now it compiles and I don't have check style errors. Let's restart the server and let's see. So yeah, this is really, I have the string here. This is in the static case, so let's type something. 6 times 7. Bam. Invalid expression. Okay, and it's blue. Okay, why is it blue let's fix that little problem. I only want the first paragraph to be a... Make a first paragraph blue. So how do we do that? Yeah, first child blue. Okay. Yeah, you can do a lot with CSS. Now it's only, oh, but that's the second one, okay. First child apparently is not, shouldn't be, is it nth child one, does it start at, maybe it should be zero. Oh, now none of it is, okay. Does anyone, anybody can find out how I make the first paragraph blue? For now it's not. Why did you make a paragraph blue when you can use a glass or a thing? Yeah, yeah, that's stuff you can do for the exercise sheet. We are doing very simple things now. But isn't it a paragraph? So I'm not understanding why it doesn't make the, why it doesn't make the first paragraph blue. Okay, but. I think it looks like the nested element, so inside of P maybe. Inside of P, but I don't have something inside of P, right? I mean, huh? Yeah, okay, let's do that, okay. So that's my look and feel, let's do that. So I can also do that, I can give elements an ID. Yes, so look and feel, that's one way to do it. Now I just make the element with that ID blue. Yeah, so that's one way to do it. Okay, back here, we are almost done. Now what's happening here? Look at what it sent. I typed, let's go back to my search page, I typed six times seven here with spaces and now what it sends is six plus times plus seven. Why, what are the pluses? I think that's the last thing I have to tell you about. So we did this template thing here. What are the pluses? I think that's the last thing I have to tell you about. So we did this template thing here. Yeah, in a URL, you can't write everything in a URL. The stuff you can write in the address bar is quite restricted. Here's the set of characters, so it's typically the letters, dollar, percent, a few special characters, but no space, for example. Then there's no space in the URL. And here we have a space in the input and we want to send it. Okay, so which means when this form sends it request to the server, it will somehow decode this. And when we interpret it, encode it, we have to decode it again. Now that's now not UTF-8 encoding decoding, it's URL encoding decoding. These two things together also can drive you crazy. You have string to bytes, UTF-8 different encodings and then you have URL encodings and these come on top of each other several times maybe. If you see funny things in mails which are strangely formatted, it's typically because of this. It's also a whole world, a whole rabbit hole to go down to, very interesting. We will talk about this more in the next lecture. So in particular, you can't have special characters or spaces as I said. For the exercise sheet and for now we just do a very simple encoding. I will just say okay I know that the plus, so space is encoded as a plus. So let me just revert the pluses to space again. So let me just do that in my code. So, and you should also do that in your code. You can also use URL, decode if you like. So let, simple URL decoding of query and what do I want? I want, and I think I should do that up here. Yeah, let's just do a simple URL decoding of query. Query is, yeah, let's just replace all pluses by space. Let's see if that works. That's a simple one. Okay let's do that again. 6 times seven, 42, it works. Okay, six times seven, so it did this here. It extracted the query, it replaced pluses by spaces, it evaluated it. Now I think this is internal, I am seeing no requests from the outside. I will not do this now. I did something terrible here. Nobody said something so far, but I did something extremely terrible. I wrote eval in my server code. The user typed something and I evaluated. Oh my, we will talk more about this. What if I write rm remove rf star, some command here and then it executes it. I don't know. Okay, it says invalid expression, but I tell you I could write something here which will cause my Python server to execute code and arbitrary code. And you don't think about this when you program this, but you just shouldn't put. So the minimum I should do here is look at my query. If I really want to do it, does it have a certain form? If not, then I say you're not supposed to do this. So this is a terrible code. Right now, I'm not showing it. There would be a way to write something in the query field which would prompt my server to actually delete all the files on my system. This is just for your reference, you can read it yourself. I think we are done with the, we have written a whole web application which is pretty impressive. Oh I think we can't do the plus, right? Yeah, plus does not work because the plus is now also encoded as percent $B. Let's also fix that last thing because space is encoded as plus, so I also have to encode plus, and plus is encoded as %b, so this should be reverted to plus. Let's do this last little hack and see if it works. We have to restart our server, and we have to, yeah, this now works. So now I can type arbitrary expressions here, divide it by five and search it. Okay, divide it by, I would also have to, I don't do this now, but I think with plus and times it works, okay. So we have written a very simple web app with everything that belongs to it, that does something, returns it, shows the result. Let's quickly go to the exercise sheet. If I find it here, the exercise sheet will be to do just this as I showed you. You get the starting code as starting code just as I did today in the lecture. But now what you should do is you should, what you did for the last exercise sheet but just in a web app. You type something here, the beginning of something and then the result is just like I showed at the beginning of the lecture, prefix matches of what you type. And for this exercise sheet, we will do it in the static way. You type something, you press a button, and then you get the list of matches for what you type. In the next lecture, we will do it dynamic as you type. And then you have a super cool web application. This is a lot of text text but just to help you. And you should pay attention to really understand what you are doing. There will be a lot of mistakes on the way. I also add some today. You should do proper handling of all kinds of cases here. It's explained here. So you're basically connecting this web application stuff to the last sheet. One last thing, you don't have to do it for this sheet, you should also connect this to our SQL sparkle thing by just, I mean, this was, let me just go back here, let me leave this. I mean, what you are searching is these entities, right? So that's what you will do. You type something, just like for the last exercise sheet, you get a list of matching entities. In our SQL Spark lectures, we also had these entities and we had information about them. So what we will do in the next exercise sheet, you type something, you do not only get Freiburg and Preiskaal, but you also get triples for Freiburg and Preiskaal, which means your application will just ask a Sparkle query and show the corresponding triples. And the sparkle query is of course translated to a SQL query, which is then executed. And this is very typical for web applications. So you will have all of this together. You don't have to do it for this exercise sheet, but if you want, it's actually not a lot of code, you can already do it. And for the next exercise sheet, it will be part of the sheet. And then you kind of have everything we did so far in one application, which I think is really cool and you will like it. Let's check if there is relative path search server. Let's reply to that. There was a question whether, let's just check that. Can we render our own Python script here? Server.py. Yes, there it is. Yeah, and you have to pay attention and that's part of the next lecture also. Now I could ask my system to render all kinds of files, right, if I don't pay attention. Probably if I write this here. I will not do it because then you will see our, is there a secret information in there? Full names of the people. Full names of the people, okay, so I shouldn't do this. If I would do this, I think it would, is there another file which is maybe, maybe let's do a proxy PU info. This I think a file which is, ah okay, it doesn't like the slashes. Okay, there's some protection here because I've done something not quite right, but you really have to pay attention, it will be a topic of the, it will be part of the next lecture. Any questions for now? So a little over time, but I think we did well. Any question? Please do the sheet. Otherwise, just by hearing it, you don't learn it. You have to do the sheet, but it's really fun. Okay, that's it for today. Have fun. See you next week. Bye.So, welcome everybody to lecture 9, databases and information systems, the course that can also be taken as information retrieval and today it's about web applications part 2. There are relatively few people in the room, I'm not sure why but more people than usual on Zoom. So I will say something about your experiences with the last exercise sheet which was web applications part one. This should be I think applications. And two important announcements. There will be the next lecture is in four weeks from now and there's an important deadline coming up, which is before the next lecture, which is why I'm telling you about it now. And today we will just continue what we did last time, web applications, we will make them dynamic. We will talk about vulnerabilities and Unicode, important stuff. And the exercise sheet will also be to continue the last exercise sheet, make everything dynamic, secure, handle Unicode, and connect everything together. So you will also bring in the stuff we did in the first lecture, Sparkles, SQL, and so on, just using it, but just having it all together in one system, which will be very nice I think. Very briefly your experiences with the last sheet. Most of you enjoyed the topic, the lecture and the sheet very much. I think everybody, almost everybody said that lecture was super fun. Excited about the topic for next lecture which is today. Extremely cool to also learn about this part of search engines. As I said, web applications is a major component of every information system. Even your mail program runs a web browser. Thunderbird, everything it displays. It's JavaScript, what we will learn today. First followed the lecture closely, then started to grasp how it works. So several people wrote something about how much they oriented themselves regarding the lecture, how much they did it themselves. Nice sheet, a lot to learn when solving it without the lecture. So I guess what a lot of you did was you followed the lecture and then you tried it on your own and maybe you went back to the lecture some more, some less, but that was of course perfectly okay. Live coding lectures are always very fun. It would be nice to have a little sheet about, I think I'm missing the sheet here about HTML and CSS. Yeah, we were thinking about that, but that's just a very, yeah, that's just so much. I think it's easier to just Google it because there are so many things. I think this was by far the best exercise sheet so far. Forgot how much fun you can have with making web apps. Actually several people wrote that they did web apps somehow in their youth when they were very young. You're still all young but very young and now you just did it again after some years and you found it to be a lot of fun. Without the video I would have been completely lost, so a few of you already said that, but yeah, it was of course perfectly all right to follow it closer. So some of you have done nice stuff. In the next lecture, which will be in four weeks, I will show demos of some of the things which you have done. So if you have done something particularly nice or put a lot of effort into it, we will see it, design, whatever special features. So the next lecture will take place on Tuesday, January 16th, that's 2024. That means lectures officially already start again on January 8th. There's a two-week Christmas break. But as we said, because these lectures are typically two hours long and not one and a half hours with the break, we said that we would just have two weeks without a lecture. We already had one such week, I think two weeks ago, and now we have another one which will be the first week after the Christmas break without lecture. So the next lecture is in four weeks from now, just so you know. And the deadline for the sheet is also then the day of the next lecture at noon as usual. Of course you should start, you get a lot of inspiration and impressions from this lecture. You should at least start I think this week. This is highly recommended. Oh and let me say something. There's always in the exam stuff about web applications. There's no way to get it right if you don't do it yourself. I think this is no way to just look at the slides trying to, at least you have to do this yourself. So yeah, you cannot do anything now and then for preparation for the exam do the sheets but of course it would be much more meaningful to do the sheets now while everybody is doing them and while everything is fresh and you get feedback. So you need to do this and we can just tell from experiences from exams there are always people who make the impression like they never did it before and then of course they don't get it right. So just do the exercises, highly recommended. So next year there will be three more lectures just as what's remaining after this with exercise sheets and then the typical final lecture. So on January 23rd, so this is not correct, right? Now I got something wrong. Is this correct or is this just shifted by one week? Sebastian, what do you think? Now I'm slightly confused. Somehow I added a week to, hmm? Everything is shifted by one. But does it mean that the last lecture is on February 6th? Yes, okay, I'm sorry. I shifted everything by one, but we'll just, the algorithm I think, the correct one is to do it like this. Okay, yeah, this is just to, so what's left? So on, we have, so the last two lectures in the last three lectures, the last three lectures in the next year will have to do with modeling things as vectors and use linear algebra. That's like a completely new topic but very important. And yeah, would like to spend much more time on this. We just take the time. We have three more lectures and there is a space here, linear classifier, so a little bit about learning and we will also at least start with language models. Of course, only one lecture, so I can only give you an intro, but better than nothing. And the last lecture, as usual, is no longer a lecture with a new topic, but important stuff. We will talk about the evaluation, info about the X-TAM and intro of how it is working in our group. So that's very important. There's a strict deadline coming up, January 14th. So this is not a deadline by us but by the Prüfungsam, so an official one and unlike us, we will send you five reminders and stuff and will even accept things if they come maybe a few hours or days late, the Prüfungsamt will not do that. I mean you should know that by now. If you miss that deadline you can't write the exam and it's not up to us to do anything about it. And just a quick explanation, we are very big faculty of engineering, technical faculty, there's no way the Profungstamt can handle 1,000 requests. Oh, I missed this, I'm sorry this reason, so that's why for a long time already these deadlines are absolutely strict. So there's a very simple consequence. If you know already now that you want to take part in the exam, register now. And certainly before Christmas, there's absolutely no reason to wait until January 14th. Because if you miss it, you can't write the exam. And there's nothing we can do about that. Okay, it's a quarter past and we start with the actual contents of the lecture. So a quick recap. So what I did, I just copied the file from last time. Here they are. So just a quick recap. What did we have? We had a server which we wrote in Python. We had a, I need to get used to typing again, we had an HTML page, a super simple just look and feel and by the way now that I first I think the right way to do this oh I chose the hmm I'm a bit confused Confused, search HTML, search CSS, I still, first of type I think is the right way to, no I don't want color red, I want color blue. This is what we were wondering about last time. If you have several paragraph symbols like we had here I think when we entered the result, I just want to have the first one blue. That's the way to do it. Let's also run the server. I think you should all remember this and now if I go to this page this is what we had. Our first search engine, six times seven. I press the search button. Oh, by the way, this thing here, this is because the browser remembers what you previously typed. Let's maybe fix. Let's make a few fixes to get into the mood. So here I think it's auto-complete of, this is not some fancy auto-completion, but it's just the browser is showing you what you have typed previously. So now I think, yeah, you don't get this drop down anymore. You click on the search button, you get the, oh, this is still blue, even though I have first of type here, okay, if you find out why that is, I'm a bit confused, disable it. Why is the first one still blue? Okay, maybe we have to live with it, but I don't really understand why. Okay, and one more thing, let's also fix that now. This favicon, we don't want to get this 404, this is for displaying a little icon here. Let's maybe try to fix that very quickly, if we don't succeed we leave it. If you want a FAF icon, let's just look at how UNI Freiburg does it, let's go to the home page, let's do F12 here. And let's see what, yeah let's just reload again and let's look for the FAF icon here. There it is. Okay, it's faficon.svg, so vector graphics. So you can provide it in several formats. So let's just copy that URL here. So if I just, yeah, so that's the icon, so it's vector graphics, which means you can scale it up as you like, and it just doesn't lose it, so let's just download this here, and this, yeah, it's just a URL, so now I have a favicon SVG, and it's, I don't know, let's maybe have a quick look at it here, yeah, it's just XML type, it's just drawing this basically, so it's a vector graphic. Okay, let's try to include it. This is just getting warmed up, how do we have to go to our server? And I think one thing we have to go to our server and I think one thing we have to do here, also yeah we don't have if we have SVG we need the media type image SVG plus XML this is absolutely correct and I think we have to tell our page that we are, because it's not favicon.ico, I think we have to say that it's, and it's also not PNG, it's SVG plus XML. So we are telling it look, our favicon has this format. It's vector graphics and it's scalable vector graphics and it has this name. So let's just see how it works. Let's restart the server and let's see if we now get rid of this. It's, oh yeah, wow, it's working. Not bad, it was faster than, I was expecting some problems but as you can see here, there's now this tiny icon here and we don't get the error message anymore because we know. So yeah, once you have everything in place, it's really easy to add new stuff and I don't really don't understand why this is still blue. Do you have any idea even though it says P first of type here and this is the first, I don't, I don't understand it. Maybe I will just revert to look, thank you Frank. Yeah, then, so now I'm just doing it by ID again, anyway. Yeah, that's one way, I just give things an ID here. We will use that again also today, now in a few minutes. Look and feel, and then we're putting a hash here, I'm saying apply this color to only this item. Okay let's go back to the slides. JavaScript. So today we will make every so just five minutes for the favicon and getting back into things I think that's good. JavaScript. It's a new programming language now and that's for stuff code that runs inside the web browser. So's for stuff, code that runs inside the web browser. So now we will have code that runs inside the browser on the machine of the person who is opening the browser, not on our machine where we have the server. And so nowadays if you, back in the time people would say JavaScript, it's work of the devil, I switch it off. If you switch it off nowadays, no webpage will work anymore. So basically every page contains some kind of code. You can also use JavaScript, it has become a very powerful and useful programming language on its own. You can just use it to code outside of the browser. Some universities even teach it as the first programming language you learn in the first semester. Here in Freiburg it's Python. JavaScript is also a good language for that. And that's Node.js framework for those who know it. So in principle, JavaScript is a fully featured programming language just as powerful as any other Turing complete, anything you can do with computers, you can do with JavaScript. Of course, when you run it in a browser, it has to be restricted somehow. In particular, it shouldn't access stuff on your file system. A webpage should not be able to remove all files on your file system or even just read them. So there are some restrictions, but you don't have those restrictions when you run it somewhere else as a normal programming language. So what it's particularly good at in a web page, you have all these elements here and you can modify them, which means you can use JavaScript to modify the page and that's also what we will do today. And we will see in a second how that works. Just some, I mean I will not give you a full introduction to JavaScript, we will just learn by example and for the sheet, exercise sheet you can just Google it or chat GPT it. I mean it's, yeah, it just thinks all things have a slightly different name as usual. Why is it called JavaScript? Well it has absolutely nothing to do with Java. It was, there was a mere marketing move at the time. So at the time, Java used to be popular in case you didn't know, and in particular for web stuff. So there were these things called Java applets, little Java program running in the web browser and showing stuff. I mean, it was clear already at the time, at least to me, that Java applets wouldn't last. I don't think you have maybe a few websites with these little Java programs running in them. So they just thought, okay, we have a new web language, it should have the name Java in it. Speed is similar to Python. It's an interpreted language, so it's a script language. It's called JavaScript. Modern browsers do just-in-time compilation, which means they will compile parts of your JavaScript and then just run the compile code and then it's much faster. Most script language have something like this. Variables are untyped, there is TypeScript, the variant where you have types that's similar to the type hints we have in Python. You don't have to use types in Python, we do it. But you can, similar thing in JavaScript, we don't use it today. Yeah, and then here's some example for assignments and variable types. You don't have types here. An integer, a string, an array, an associative array, a map, it's the usual stuff. If you have seen other programming language, it's like not that you have to learn completely new stuff from scratch. And we use C examples today. Variable declaration, that's I think important. Why does it say let here? There are three types in JavaScript. The old type is var so you could just declare a variable here, you could just write var. That was the old way of doing it. It's a terrible way, don't do it when you declare a variable var, even if you declare it in a function it's available everywhere, so it's like a global variable. Just don't do it. Let, then the scope is within the code block, as usual code block curly brackets. And then you have const, it's when you declare a variable, assign it, and then you don't change it anymore. So these are the ones you typically use, let and const, var is old, it's only here for. And we will just, yeah, not so much theoretical talk, let's just start right again. So how do we do it? And I will switch between live coding and the slides, many of the slides are just for reference. First we have to include some JavaScript. So I just put it in the head and this is clear what it means. There's some code here search.js and I want to include it. So now let's just do that. Now we need with, let's start maybe with, ok, it's me, JavaScript. So alert, what will alert do? We will see it in a second. So now we have a new thing here. Let's just, we don't even have to restart the browser. We have to, oh, it already works. It's me JavaScript. Alert just opens a pop-up box here, which I can then close again. Okay, it worked even though I didn't give it the right type. Content type is text plain, so the browser is very lenient here, but let's just briefly fix that. That's also easy, I just add, if my file ends with JS it should get the media type application JavaScript. Now I need to restart my server. Here it is. Let's just do it again. I get the same pop up again and if I go to JS here it says content type JavaScript. So I'm going pretty quickly here because you have already seen that last time and it's very easy now to add these new things. Now we are loading JavaScript. Alert boxes like opening boxes is also a no no, you shouldn't do it and let me just, yeah let me not even, I don't know, comment it out. No the comment sign is like this in normal. Here's an important thing for debugging, super useful for the sheet. Console.log, so there's a console and let's just see what's the console. You can apparently write to something, it's not the webpage. Let's just see what happens. Now I don't have a, and here is a console item. Console, and now it's written. So it's not the console from consolation, but from the console like here, the Linux console. So now it's just writing something here. You can use this for debugging. It also shows error messages here, super useful, right? Don't use alert boxes. Okay, back to the slides. Now let's, so one important use of JavaScript is to modify the contents of the page. So let's just do that. And so now you have to, if you want to modify the contents of the page you have to say okay this element I want to change it. And let me just go back quickly to the HTML. So that's how the HTML looks like and there's a thing called the document object model which just is a scheme for how you address the elements in the web page. So one way to do it, we already used it, is to give elements from the web page an ID result and then in JavaScript you can address this. We already saw how we did it in CSS, the style sheets in JavaScript, you can say okay, this element, I want to change what's inside here in HTML. So let's just do that in our JavaScript. I will not spend too much time on the slides, but always go right into, so let's just document, yeah I'm getting some funny, there I have the query selector here, So let's just document, I'm getting some funny, I have the query selector here, and let's, I don't know which piece do I want to, and let's change something right away here because I'm not using the templates anymore now, so let me just delete this, but what I will have here, I want to have a P, yeah, why not like this with the result? And I also want to have a P with the query. This thing is suggesting funny things. Yeah, I just want to have a paragraph with a query and the result. And let me just write there now, no query yet, no result yet. Let's just do the following query. So I'm just selecting that element and let's maybe do it like this so that you also see. So there's this element, it has the ID query. I'm selecting that element and I'm changing its inner HTML by no query yet. We do it like this, let me do the same with the result. And let's see whether that works. So I'm loading this, then I'm doing this. Let's just see what happens. If I don't have to restart the server, I just'm doing this. Let's just see what happens if I don't have to restart the server I just change something here. Okay, this didn't work. Why? We have an idea. I will do a number of mistakes now, some deliberately, some not deliberately because you will also encounter them. That's a typical one. It says document query selector is null. Is null means this thing here returns null which means doesn't exist. So did I make a typo query here, says query here. I can tell you I didn't make a typo. What do you think? Do you have any idea or there's somebody out of it? Yeah? Because there's no text inside the block or something. Okay there's no text inside the block. Maybe let's test this hypothesis. That's what live coding is for. Okay doesn't change anything. We have the xxx here. Any other ideas, or maybe some of you know already? It's a, yes? Yes, exactly, exactly. That's an important thing to know. So let me just show you the whole thing here. This is processed in the order in which it's encountered. That's something, so this synchronicity or asynchronicity is super important to understand in web apps. Many things happening at the same time or not at the same time, you have to understand what happens when. When this is read by the browser, it processes this line by line. At this point, it will read the JavaScript and execute it. It will execute this fairly quickly, it will print this to the console, execute these two lines and these two lines will not find these two elements because we haven't passed that yet. One thing you could do, there are two ways to, is async, it's not on the slide, this will now say okay, while I'm processing, do this in parallel, I'm processing the JavaScript and in parallel I'm continuing parsing this page, let's just try it. It also doesn't work because executing the JavaScript is faster. Depending on the browser it might work. Here's another one which we will use and I write it at the end. It's defer and defer means wait, load it but don't execute it before you have passed the whole page. Defer. So just wait until the end and this should work but doesn't. Ok now I'm surprised myself this was a not deliberate. In the query selector. I was wondering about that. Ok then it's also wrong, so it's the like and yeah I think that's true. So let's do the, so my previous, yeah, it's still null now without the defer and now let's add the defer and now it works. Okay, no query yet is wrong. It should be no result yet. And yeah, so that worked now. So these little, but still my previous explanations were correct. I just wasn't proving them here. You also have to write the hash like in the CSS. Okay, is this correct on the slides? Yeah, it says this here. Okay, is this correct on the slides? Yeah, it says this here. Okay, and one more thing, I think I want to do this a little differently. I want to have a paragraph here and I want to write query colon and then I want to have this thing here in a, so there's an HTML item which is span, which is just just a piece of HTML which has no particular semantic role and I can just use span for that. It's just a piece which I want to address, maybe change its constant, constant or whatever. So span, I think it's a good name, just a piece of HTML, which I want to have a particular name or property. So the JavaScript doesn't care what type this is, whether it's a span or a, now I should get query nothing in the beginning and then this should change it. Let's just see how it looks like. Query, no query yet, result, no result yet. So that's fine, now we have some basic JavaScript in place and it already does something. Yeah, maybe it should say hello, it's me. Hello, it should say hello. Let's go back to the slides. So okay, yeah, let's do that right away. We should, I mean that's of course something we want now. That's one very typical applications the user does something, for example typing and with the typing you want to do something and there's also JavaScript way to do it. You have a query selector on an input field, let's give it the ID input, I will do that in a second and then add an event listener. I mean it's self-explanatory, just wait for something to happen and here in the string you say what it is that you want to happen. I mean there's of course a whole world of events and things but this is what we need here. Input says something changed, some user action in the input field. So let's just do that. Let's go back here, let's give this thing a name. So this is our input, I mean it's the query, but let's, we already have ID query here. So let's do that, so let's, so this is input here, and I want an event listener, right? Yeah, I want to add an event listener, I don't want key up, so you see it suggests key up here, key up is key up, key down, key pressed, input, there are subtle differences. And now an event listener, that's an important concept of, now I want to say okay when something happens in the field do something and I do something I want to specify in a function. So now I need the second argument as a function here and one thing you see a lot in JavaScript is anonymous functions. So what's called lambdas in other programming languages. It's just in a function and I can just define it on the spot here. I don't really care about the arguments, this is just, okay, execute this unnamed function without arguments and now it should just do something. And what should it do? Well, let's first read the query from the input field. Let's just see what's in there and that's, yeah, let's just read the value from the input field, right? Let's just do that and let's log it to the console. So now what we are doing is whenever something happens in the input field, let's just read what's in there and show it. And now you have another typical thing which you don't see in other programming language, this funny combination of a closing curly brace and closing parenthesis. This closing curly brace is the opening curly brace from the anonymous function, and this here is from the function where this was an argument. So this very typical thing curly brace closing normal parenthesis semicolon and you just write it like this. Let's see whether this works. We don't have to restart the server. Let's see if I type something here. Yeah, it works. So I'm typing something and it's telling me, so we are already one step further, right? I have some user. So you see it's very quickly, it's a great thing about JavaScript, you can do very powerful thing, I mean that's not super powerful but it's something very quickly with few lines of code. Okay, so what do we want to do next? Okay, this we already, yeah by the way since it's on the slides there is a, the typical way what I did here is you said JavaScript code, load that file, execute the code from there. You can also use the script tag to write code directly in the HTML, but it's always nice. This is the modern way, I mean the way you should do it. Here is my style sheet, I have it in a separate file. You could have everything in the HTML directly, but we don't do that. The defer we already talked about. So now what do we want to do next? It's obvious I think. We don't want this stupid search button. We want things to happen automatically. I type something and immediately when I type the action should take place which so far took place when I clicked search. Just recall how we did that. We had this funny form thing here which is also from the 90s. You don't see that a lot anymore and I don't, basically I don't want that anymore. Now I want my JavaScript to do this. Whenever something happens in the input field, send something to my server, ask for the result and then put it in place. So I don't have to reload the page or anything, I just work with the page dynamically. So how do we do that? And let's just, it's on the slides but maybe let's first do it and then I explain it because now talking to other machines is a lot of things it's also not hard in the code but a lot of things to understand and get right or get wrong. So let me, so I want a response and now it already suggests fetch here. So now how did we do that so far? So far we had to search HTML with a query parameter, right? So fetch is talk to this other machine and I think this is how we did it, search HTML query. This probably still, it doesn't work anymore because I removed the percentage result but I think I can still send something. Let's see via the network tab, yeah. So now I said that's how we did it last time, we appended something to the URL, question mark, query, six times seven, and here what I get back is a whole page, right? And the only reason that it doesn't show here is because I removed the percent result. Actually I could just probably insert it again. This is how we did it last time. Result, I think it was called. Let's just revert to that for a second. Yeah. We just had this template mechanism, right? It just replaced it. So let's just use that to, for the JavaScript also. Let's just, this is now calling the server with this request. And now I need to do something which I will explain in the slide in a second. Now I want to say then. Then means this is something, oh my voice is. Let me take a second and read what's on the chat. Yes, the thing's missing. So there's a question in the chat. Let me briefly respond to that. What's the difference between having this here in the head or in the body? And the answer is it just happens where you put it. And typically you want the code to be executed before everything else. So that's the main reason for doing it in the head. Could also put it first thing in the body but it's just more logical to wherever you put this it happens unless you put a word like defer then it's but the loading happens. I don't think there are any safety issues you just have to be aware when it's executed. Ok then is this gets back a raw response here, actually I think we can, here's another thing and I would just, let me do it like this first, fetch is something which happens asynchronously right, it's something you happens asynchronously, right? It's something you have to get back a resource from somewhere else. This is asking the server and the server could take 10 seconds, 5 minutes, 5 hours to process this. So this is something which takes time and it's kind of happening outside of the browser. So await, we have to wait for it and await is an important keyword here and let's just look at, let's just output this response here and see what's happening. Let's just see what's happening and now I have to I think go back to the original page and I think I will remove the template thing here again. I don't need it. It was just for, I'm sorry. Oh, I'm sorry. This was back here. Bam, bam. So now let's type something. let's maybe go to the console and see what await is only valid in async functions. I will explain that in a minute. For now let's just follow the, so here it says await and if there's a await this has to be asympt because this is now in function which does a synchronous stuff right it's sending fetch, it's waiting for the fetch let me put that in a new line here so that I don't overflow. Okay you have to, maybe I explain it right now this could take a long time right right, as I just explained. So of course you don't want your whole web page or the JavaScript to be unresponsive, right, that's why you write async here. Async means while this is running and not doing anything like waiting for the result of this fetch, you can do other stuff and I will have it on the slides in a second but let's first do it hands-on. Let's see what happens now. Okay, now let's, okay I have to type something here, six. Okay and now I see I get a response object here. So it did something, it talked to the server, it sent this here. So actually it works right when I'm getting all these requests here on the server side when I'm typing something and I'm getting stuff back. And the stuff I'm getting back is like a complex object here. It's even showing me the whole object here. It has a body and all kinds of header information from the HTML. So a bit too much information. What I really just want is what the server sent to me. So let's just get that and the way to get that is I just want the result and let's just and the result was actually HTML so I can just get the HTML from this and this is also something which might take time so I'm doing it asynchronously. So let me just log the result here. So first I'm getting something, a raw response, and now from that raw response I just want the HTML. And let's just do that and see what happens. I'm typing 6 here and now I'm response.html is not a function, okay. I think I have to choose text here. I think what's important is the first thing of the media type, it's text and it doesn't interpret it in any way. We can talk about that more if you like. Six. Okay, so what happens now? Try to understand this. Why do I get, I mean I type 6 times 7 here, why don't I don't get 42 but I get this, why do I get this? I mean I want 42 right when I type 6 times 7 but I get the whole HTML page. Just try to understand why you are getting this. This is the JavaScript doing this. So this code is giving me the whole webpage in return. Why? Yeah? That's just how we did it, I mean we always completely send it to the page and just reflect the results? Yeah, that's exactly right, that's how we did it so far, right? We are asking for, we are doing it even your search HTML question mark query query. Our server gets this request and what it did so far is returning a whole new web page with the result. That's how we did it so far. That's how it was done in the 90s. That's the, and now we don't want to do that anymore. So let's go to our server. Let's go, we are always switching back and forth between the individual components. That's what we did here. I think I need to make this a little larger. So that it's, where did we do this? Here, right? If the request is search HTML, then we compute the result and then do this replace this template here and return the whole HTML. Okay now we don't want the whole page we just want to ask query and one way to do this then we shouldn't ask for, now how do we ask this request? Let me do this like this. I mean we have to need to have some name here and a typical name here is to call this API because I'm just making a request to the server. API is an interface, the I stands for interface. Is API what's actually the full, hey, Breev. Is it a programming interface or actually what does API stand for? It's application programming interface, yeah. So that's why you, yeah, we're just talking to the server and we could choose any name we want here. Let's just call it API question mark query and now let's handle that in our server. So if the path is API, the passing of the query we did already here, right, this doesn't change. If we have question mark query, just strip it away and parse it so we already have that in the query variable. If the request is API, then do the following. And let me just, let me just make a string out of this here so that I have a string either way. No, I don't really need this thing here. I don't need to produce. Now I have a string with either result or invalid expression if it's not yet an expression. And now I just send, I guess I just send that back, right? Let me just, yeah, and here it's result is nothing. Yeah, I mean I shouldn't call API called without query. Okay, should not happen. And then now my response is just this result. That's just what I send back. Let's just see what happens. I'm just sending that back now. Just either the result or the... Yeah, let's just see what happens. Now I change the server, so I have to restart it. Let's just see what happens now. Let me type six. And now I just get result six. Okay, let's look at it. And here I see my API call here, 6. Now that's what I sent, API question mark query equals 6. But it's really just a string I sent, right? And I'm following just some conventions here. And for some reason it says 404. Why does it say 404? That's something in my, I mean it worked but it still sends a 404. What did I do wrong in my server? Oh, I understand because we first try to read this as a file. The file API does not exist then it sends not found 404 and then it does this thing which actually does the right thing. So now, and we had to do it in that order because before we read the search HTML file, then we processed the query, changed it in the HTML file and send that back. Now we don't do that anymore so we can just do this before this. just evaluate the query and compute the result. Let's compute the result, bam. Okay, and then this should only happen otherwise. So let's just put an else here. This should work and let's just indent this here by four. So either the call is API, then I do this, then my response is just whatever I computed. Here it's just a simple expression. Otherwise, this should now be an otherwise, otherwise interpret the request as a file name, read the file and return its contents. Yeah? So our server is now different things, right? It's either computing stuff, when I call it with API something, or it's just serving files. There was a question or any question? Did I miss something? There's a question in the chat. More than one script with defer, in which order would they be executed? Okay, that's a good question. If I would have another script here with defer, yeah, I strongly assume that they would be executed in the order in which they are encountered, right? If I would have another script here with defer, yeah, I strongly assume that they would be executed in the order in which they are encountered, right? So if I have search one, search two, search three here, they would be executed in order one, two, three, but only after the page is loaded. So let's try that here. See if it works, okay, now type something here, 6. Okay now I don't get the 404. It's not changing here. I'm just outputting it here so far in the console. Okay now maybe let's go to the slides back a little bit because, no I don't want to go to the slides back just yet. I will explain it in a second. Let's do a little bit more coding. Now we just send this as, let's see how we send this. Let's go to the network tab and go here. And what are the headers? Alizupi, content type text plane. So I just send it back as text, my result, the six or the whatever. If it's here, six point, yeah, how do I, I think I have to make it a little bit larger here. Yeah, now I see the, now the request was, where's my, six times seven was the query, and I get the reply here, what's the response? It's 42, I'm sorry, and it's just content type, it's text plain. Typically the answer is something more complex and what you want to return is actually a JavaScript object, something which you can process directly in your JavaScript and that's what we will do now and let's just see how you don't want to return text which then your JavaScript has to parse or something like that. So let me just do the following here I want to write and let me use we are in Python so let me use, we are in Python, so let me use, yeah that's not bad. I want to query with the query and the result with the result, yeah. Let me maybe quickly explain that here, what I'm doing here. Okay, I have to close the string here, and do this. Here I'm just constructing a string in case you're wondering. Yeah. So I'm just using a formatted string here. I want to use curly braces. The curly braces have a special meaning in a formatted string, right? It's what it's saying is evaluate this here. That's using the variable query and putting its content. That's what the meaning of curly braces, if you have a formatted string here, F. And if I really want the curly brace, I have to write it twice. So that's just escaping the curly brace. So now I should get in curly braces query colon the query in quotes. I also have to escape the quotes and result and this is the result I computed. Let's just do that and I have to restart the server because I changed something in my server. Let's go back to the browser, see what's happening on the console. Six. Now I'm getting back this. This is what I just did here. Oh, and why do I have two things here? Probably because I made a mistake. Oh yeah, I forgot the F here. Oh yeah, that's also why it's put right. I forgot in the second part, I forgot the F, so it's just taking this verbatim and not plugging it in. So I have to restart my server again. You see nothing conceptually hard, but it's just a combination of these many things. So now I get a query six, result six. This is already looking very good, right? Six times invalid expression, six times seven, the query is six times seven, the result is 42. So now I'm sending a result which is already a valid JavaScript variable, which is very convenient in my, and let's just make use of that in my JavaScript. So now this is my, and actually I don't have to, let's just, so now I can read the result. So now I can read the result. The result is now just my, ah it's saying, yeah this is kind of the result, why do I call it JSON, I think I will say that. This thing which I'm sending, JavaScript object, JSON is JavaScript object, it's on the, and what does the N stand for, notation maybe, what does the N stand for in JSON, so that's a abbreviation, I think it's JavaScript object N, you can look it up and tell me. So this is now a JavaScript object and I can just say it has a result part and a query part and I actually don't need the, I already had the query above. And now let's just plug that into our web page. So I have these document query selector, query and result. Now I'm just reading this out of what the server sent me. Yeah, let's just see whether this works. Let's just see what's happening. Six times seven, okay. This works, result is undefined, so somehow it's not reading the result. I also wanted to put the result here. Let's check that. Why is that not happening? Probably because. The text is in line 11. Line 11? In line nine. In line nine, what did I do wrong? Oh, okay that's true but where's the error message which is complaining about this? Yeah that's a bit mean right now I'm sending this as a JavaScript object and now I have to say this is a JavaScript object. Actually there are two. Yeah. So first I should say here, interpret this as a JavaScript object. And I will explain it once more on the slides in a second. Let's see if this alone already works. Okay, this, oh yeah, now it's also showing it as an object here, right? Do you see the difference? Let me just do that again, that's I think very useful. If I do text here, it's just interpreting it as text. I mean, it looks like a JavaScript object, but it's just printing it as text. And then if I say say give me the result part of this text it doesn't work because it's just a string. If I say JSON here it's interpreting it as a JavaScript object and also showing it in the console as a JavaScript object here which has two parts a query and a result and then indeed I can write something like give me the result part here and put it there. Okay, and it didn't complain, although it should have, that I'm not, let's just look at that in the network tab here, I was returning it as text plain, not as JSON. I don't know why the browser is not complaining, but let's just fix that. So if I'm... No, no, this is not something I should do here. I should do it up here, right? Here I should fix the media type. It should be called application JSON because I'm sending something in JSON. I have to restart my server now because I'm, let's look at that now, six times seven. Let's look here now. Now it says application JSON, how it should be. In the console it's a JSON object. I'm getting, putting the query here, the result here and it works. Okay so now I have a working, okay invalid expression. Oh there's one more thing now. What's the percent 20 doing here? You also see it here. Six times seven. It doesn't work invalid expression. that's what my server gets. What's the problem here? Yeah, it's the decoding of space. Now it's doing something else, the form sent the space as a plus, now it's decoding the space as percentage 20. If I do ASCII here, or maybe I should go to, let's look at the ASCII table. 20 is the space, right? In hex, percent 20 is the space. So let's just fix that in our code. I think we had some very basic escaping here. It was up here. Now we don't, it's not the plus, but percent 20 should be replaced by space. Let's just rerun the server. And let's, six space times seven. Now it works. Okay, now I can, Okay, now I can compute, yeah it's quite impressive, right? I have already a working web app here. Not bad. So it's just working, it's search as you type functionality. It's really quite good. And now when you get that in place, now it's easy to also do more complex stuff so pretty neat now let's go through the rest of the slides just recap some stuff and then let's have a break. Is there, would you mind sharing your, if we had more, okay, that's something. So yeah, let's do that before the break and then have a break. Just recap some things, the fetch here. So yeah, this is how we wrote it, right? We had this fetch. We didn't use then, let's maybe do that very quickly. It's just an equivalent way of doing what we just did, because that's a very... So here I'm waiting for the return of the call, I get the raw response object, then I want the JSON from that, and here I have two steps and you can just write them as one and let me maybe just comment that out and let me write it how one would usually write it. So you have a fetch and when that is done, do something and now I'm, and now here'm, yeah. And now here's another thing, this funny notation I will explain it on the slide, what does this mean? I mean what I want to do, I want to take the response and do this with it, it's what I did here and then this should be my result JSON now, so I don't need this intermediate object. Result JSON. And let's explain that on a slide. And I think this I will explain after the break, the async and the wait, but let's maybe first, yeah, this I will explain after the break, the async and the wait, but let's maybe first, yeah, this I will explain after the break, and let's maybe, this one here. That's just, so that's what I just did. I did a fetch, and if this fetch, if the answer comes, then I, then what's in the then will be executed, which is a function, and that's something you have a lot I already said that anonymous function so it's a function which gets the response from the fetch and does something with it and it does something very simple it just gets the JSON from the response. I don't want all the other HTML, HTTP, not HTML, HTTP information. That's a very typical kind of anonymous function. Gets an argument, does something very simple with it, because that's so frequent you have a shortcut for it. This is the bold thing here is exactly the same as this here. Take this as an argument, do this. It's exactly this, it's just a shortcut. So if you wonder about this in JavaScript code it's just an anonymous function shortcut. Also works for multiple arguments. I will explain the rest in a second. So that's, yeah, that's the way how you will typically write it in JavaScript. You just, you fetch something and then you say okay I want the JSON from this and then you assign it here. Let's just check whether this works still. No because I did something wrong. Note how nice it is that you get this debug info here it says in line 11 it's a scripting language. I did something wrong and of course the semicolon here doesn't work. So it is complicated, all those things together. Okay, now it's time for a break. Few more explanations on this and then on to the next part. Five minute break. So on to the next part, five minute break. So onto the second part. So we have a working JavaScript web app now. And that of course will be the first part of exercise sheet nine is to make your web app from the last sheet dynamic. In that way, as usual, you can just follow what I did in the lecture. Maybe you want to do more yourself or follow closer the lecture and you have a lot of instructions, explicit instructions here how you do it. I want to explain a few more things. I mean you don't really need to understand it but it's of course good to understand it. I mean you should understand it but to do it you can just let me start that sentence again it's about this await fetch then thing right there's always two levels of understanding one is okay you just have to write this and then it works but there's actually some stuff which is interesting to understand here and I want to briefly explain it. Let's spend a few minutes. Why is there a sink here? Why is there a weight here? What is this then here? I already explained the syntax. This is interesting. So please listen for a few minutes and here's some and it's actually it's really interesting. So here's the first thing to understand. Pay attention because it's really interesting and important. JavaScript is single threaded. This is quite surprising because many things are happening at the same time, right? Here something is loaded, here something is blinking and image is still being loaded. It looks like it should be multi-threaded, right? Many things happening at the same time on a web page. Just we said you can write async in the HTML and then the page is loaded while JavaScript code is executed. But JavaScript itself is single threaded and what does it mean at every point in time? Let's just go, here's JavaScript. This might be very big code for a big website. You're at any given time, you're at exactly one point here. It's not different strands of execution, right? But still, you have asynchronous functions. And the typical, the most typical asynchronous function is like fetch, get something from another server, like we did here. It was something very simple here, but it could be the result of a big computation. And you can declare functions as in, and the important part to understand here that if you call such a function like fetch then this is time which is not spent in the normal JavaScript it's spent elsewhere right and now asking the server now a server on another machine is doing something which might take a long time and there's no JavaScript happening in the meantime. It would of course be completely wrong to now wait for this, right? And do nothing, like have really wait here. Like the JavaScript at this point, it stops, it freezes and then the server takes one minute. In the meantime you can't do anything with the webpage. So what you want is that in the time where this something is happening outside you can continue with your JavaScript and that's what JavaScript is doing, right? So JavaScript, when you write this here, await, then it will say okay, let me just wait for for this in the meantime carry on with some other part of the JavaScript. Maybe you added five different event listeners here and some other event is happening and something to do there it will just continue there right. So what JavaScript has is kind of a stack of things it has to do and whenever it has to wait for something then it says okay let me put this on hold and continue here. It will only do one thing at a time but it can jump between functions. So that's important to understand. So it's single threaded but it can jump between different frames of things. And then and the nice thing here by writing the syntax, this I'm not explaining here the old way of doing this is that you would have callbacks, right? You would have say okay, fetch this and when this is done please call a particular function in my code and then this code is executed. You don't have to care about this here. It will happen automatically, right? The await will say, okay, while I'm still waiting, do something else, whatever you have to do. And as soon as I'm done, I will continue execution here. It enables much nicer code. But you do have to understand what's happening. And it can be really tricky because, yeah, the JavaScript code jumps between different things here. So that's one important thing to understand. Single threaded, but it can jump between frames. And now these are the keywords which are important, async await then, if you don't want to use callback. So this means this is a function which might spend time elsewhere and just wait so you must declare it asynch. We got an error if this was not asynch. Await means, yeah, I just explained it, wait and do other stuff in the meantime if you want or do nothing in the meantime. And then also belongs to this. I also explained it and there was a question in the meantime. And then, also belongs to this, I also explained it, and there was a question in the chat, maybe I explain it again here. This fetch is something which can take time, and this means wait for the fetch, and this is how I wrote it here, right? Doing a wait on the fetch, if it takes time, do something else in the meantime as soon as it's done then execute this next function here right and wait for that too and you can have a whole chain of this here I could have another then and another then and so on right which each then awaits the reply from the previous, whatever it is. And this final await then just awaits here the final then. So this await really is an await for the final then, however many thens you have there. Yeah, and all three of these are used here and you also need them for the exercise sheet of course and yeah and this is one thing now that's one level below but let me briefly explain it at least how is this done in the JavaScript code it's actually you can you could code this yourself it's done using promises and you could even look at that but we don't have time for it now. What this fetch does, it actually returns immediately. You could just write it in your code and look at the object and what it returns is an object called a promise. And a promise has, it's just an object which has two components, one component is that it said okay what's the current status of this thing, right? The fetch returns immediately and then I could look at the object and it would tell me okay something still ongoing here, this has finished, this has failed. So these are these three things and I can ask that object repeatedly if I want. And the result if it's fulfilled. So let me say that again. Instead of writing await here, I could just have a normal assignment, let some variable equals the result of the fetch. I will get a result immediately and now I can look at the result now or in 10 seconds or in the future and I can always ask that result, okay, what's the status of this operation? And it will tell me ongoing, finished, failed and if it succeeded, what's the result? And it's a lot of fun, so if you have time or you are interested play around with this by just looking at the objects in the JavaScript output them. You can also write promises yourself if you write library stuff you would do that but typically you don't need it because you have nice functions like fetch which hide all this from you. And whenever you have a promise then you can use these if you're using this as a weight and then I think that's enough for now because there's other stuff I want to show you. Jason I already showed you. One last thing just to mention that you don't need it for the exercise sheet 9. This thing here, if I'm, this is now very fast, right? It's sending a lot of requests. Just imagine each request here would take seconds or minutes. And it will be so for exercise sheet 9 because each request you send requires computation on the part of the server. Here it's very fast but imagine it takes time, then the server has a lot of requests it has to handle at the same time. So it's very natural for the server to be multi-threaded. Every reasonable server which processes stuff coming in from one side or from many sides at the same time will have to be want to be multi-threaded. So a typical server what it would do and it would actually be very easy to do in our code. Let's just go to the Python code one more time here. What does it do? Let's go back to the socket communication. I received the request and now I call handle request. Do the actual work. It would be very easy here even in Python to just say do this in an old thread, right? Open an own thread or even an own process, handle that request there and wait for the next request to come in. Because of course I want to be responsive. Yes please. Isn't it impossible in Python to do a call request? Impossible or possible? Impossible. I mean you have to go to a new server. Yeah exactly. That's the next thing. Actually you could write it here but I don't do it now. You could say do this in a new thread, it would work, but then it wouldn't because Python is terrible, I mean that's the reply to what you just said, at multithreading. Because Python has a similar problem or feature as JavaScript that it can only execute one piece of Python code at a time because Python is an interpreted language and this interpreter has a lock. So you can't have two threads and they execute different parts. One thread is executing this and another one is executing this because the interpreter is not multi-threaded. It has state and so on. So there is a module in Python called multiprocessing, but that will copy the whole process, fork as you say, and then you have two processes, full processes with a complete data. And for the exercise sheet, the complete data would mean your complete QGram index. So what it could, I mean, we tried that in a year, I think last year or two years ago, you could say, okay, this response, now just copy the whole process, and then it will, but it will take time and a lot of memory, it will copy the whole Q-gram index and everything for responding to your queries, and now you have a copy of the data structure. And if you have five queries at the same time, you have five copies and you will just run out of RAM. So bottom line, Python is terrible at multi-threading. And it's just really terrible. There's no way to solve it because the language is just not made for that. Which means a reasonable server will have to do this somewhere else. So what we did do, Sebastian did instead, that's why we had Rust code to speed up the exercise sheet you want fuzzy search and if each query would take like half a second, then you would run into a situation that your web page would become really unresponsive because the request would just pile up and then it would be terrible. Now we have to go to the next topic. We have two more topics, 15 minutes each. They are very nice and interesting but smaller topics but interesting to know. Now we just without much care we just did something here we were happy when it works but now there are some potential problems which maybe you didn't even think about or maybe you did and let's look at three of them. three of them. Access to private data, code injection and yeah let's just see what it is and it's just for reference. Let me mainly show it here and then go back to the slides occasionally. What is one thing we could do? Let me do proc CPU info here. Okay, so I just, what did I do? I just, I mean of course I wrote this so that I can show my nice search, our great search HTML page, but now I just wrote slash proc CPU info which is a file on my machine which lives somewhere completely else on my machine. It says get request for slash proc CPU info and it's just returning the file faithfully as I programmed it to do. So that's now the CPU info, maybe not so private on my machine, but maybe not something which I wanted to share. And I weren't even aware that I'm enabling this. This machine has, actually it's a big machine with a lot of data in it. I've now written an application given to the outside world where you can read any file on my machine. Maybe not what I want, maybe my mail is on there. Yes? Last exercise she was also instructed to handle traversing of the tree. For some reason I didn't get to work so the browser and even curl refused to interpret this double dot. Is this something new or? Yeah, that's a good point. So browsers, because users of even application programmers forget this stuff, browsers also try to say this looks fishy, maybe that I shouldn't even let this through. I mean now of course I don't know how you programmed your server, maybe for some reason it doesn't let this through. Actually I don't know what happens here when I type dot dot. Okay. Yeah, it disappears. There's some protection built into browsers because browsers sometimes they say okay this is never right. I will just not accept it. But as you saw here this worked right? So I can certainly access via the full path name file in our system, that's not something I want. Of course it's easy to solve, let's just go back to the path, I'm not showing this now, this would also work, 888 here we have 8, I should maybe fix that. Of course, once you're aware of this, these are usually very easy to fix, right? How do you fix it? You say, okay, this thing shouldn't contain slashes, or the really conservative way is just have a whitelist. This server is only for these files, or only for these files in this directory or something from this. Otherwise that's why you have forbidden or something like this. So that's easy to fix. Code injection, that's another fun one. Let's look at it. We have a working web app. So let's just go back to search HTML and let's just, oh let's have fun and write some, it's devil.com click me, ok. Yeah, I mean if you think about it, it's not surprising. I wrote a piece of HTML here and what is my JavaScript doing? It's just changing this to this. Let's just click on here. I'm repeating this joke as long as I'm still amazed every year. We have this joke for a number of years now, www.devil.com just redirects to mybible.com. So the Catholic Church for some reason bought this domain. It's really funny, yeah. It's for a number of years now, it doesn't change. So yeah, now you can say, okay, that's not really harmful, I'm just inserting a link here and but you could imagine scenarios where this is harmful and maybe let's, I like to HTML, what's the, let me also put the, what's the HTML entity or the devil in Unicode? I think there's a symbol for the smiling face with horns. Oh, it's a plus one F. Yeah, we will talk about that in a second. That's why. You, let me try that as an HTML entity. Let's have a little bit of fun here. Let me try that. Add hash colon semi colon. Ah no, that's not the, what's, how do you put an HTML entity with a, who knows that I thought it was hash, but apparently it's not. We'll come back to that in a second, but now I want to, no I don't want the winking face. How do you, if you know the Unicode code, how do you specify the HTML entity? Does anybody know? Apparently it's not hash, the code, but it's smiling face with horns. That's the political correct way to say devil. Okay, smiling face with horns. UTF-8. Code points. Here it should say what I want. Oh, it's hash. No, it does say HTML entity. oh X because I did it in, I think, yeah now I have it, okay. The X was missing because I specified it in hex, okay. Now I have that one, okay. So you can insert arbitrary HTML if you don't pay attention. Happens pretty quickly. So yeah, that was just a click me hack of course. Or just think of this in your get request here, you just insert JavaScript, you can execute arbitrary code. It's maybe a bit more dangerous already. And now you say, yeah well I will see that if there is a script tag in the URL but there is URL decoding we will talk more about it in a second. You can just obfuscate this right. So if you get links with a lot of funny characters like this be aware that maybe you should look at it more closely. Okay. Here's another one. Yeah, okay. On the forum, of course the forum has to take care of this tool. On our forum, if you post something like this and then you, if the forum software wouldn't pay attention and you just have a script in there with JavaScript code and then early versions of such software did this, it would just execute the code. And now this is code running on your machine and it could do stuff with your data, right? So it could like, even maybe you have Gmail open or some private data open in another tab, it could just steal that, send an email. It's quite easy actually. If you're into this you can try it yourself. I mean this is just, this is now running here and you're, this is now running your code and you can do all kinds of stuff. If you manage to grab personal information here, then I can just send a fetch request to this server here with the information and now my server has that information. And that's of course still an ongoing topic. Here's another one, same origin policy. Let me also show you by an example, it's I think the best way to understand. So here, so far all of this was happening on one machine, but it doesn't have to. Let's ask the same thing here on a different machine. I mean I could have specified, so the machine we are working on is called Tura, that's a river in Asia, Russia I think. So this should also work. It's Tura 8080. I'm in a local network which means I don't need fully qualified names. Let's just try that out whether it works. Yes, this also works and it gets the request here. So let's just check the machine is Tura. We also have another machine which is called Amur. This is another river in Asia. Our machines currently have river names. So let's, no I don't want to go to, I think I want to go to teaching information retrieval 24 public code lecture 9. Yeah, that's our code. Now I can write, actually think about it, it's kind of funny. I can run the exact same server here. No, let me first not run it and see what's happening. Let me now just ask another machine and now I should get 404, right? Because nothing is running there yet. Let's just do that, that's fine. I didn't send anything, six. Ah, okay, it's not even sending the request because now it's saying, okay, this is not allowed. And I will explain in a second why it's not allowed, but let's first see that it's not allowed. So, now this website is running, it's talking to a machine called Tura and now the JavaScript wants to talk to a machine called Amur and it's not allowed even if I'm running this here. So it doesn't even reach the server, right? Yeah, no request here because right away when I... Oh no, no, no, no, it's not true. Ah, it's not true, okay, I have to... no no it's not true. Ah it's not true, okay I have to, the last sentence is not true. It reached the server, so I'm getting this request here on Amur now, six times seven, it's even computing something. But the JavaScript says I'm not allowed to use this result. So this page now wants to do something with the result, build it into the page like display 42 here, which discomputed, it says I won't. And let's first see how to fix it and then explain why it's a problem. The way to fix it is, in my server I could, let's just quickly do that. Here I add another header and I'm saying if, and I think I only need that for the RP call, if what was the if request is API. Let's do that. If request is API, that's correct. I add an access control header. And I want to say, so I'm the server now. This will be running on ARMOR on the new machine, on the second machine. And I'm saying, now I have two, so which one? Now who has a lawa to whom, let me think about it. Let's think about it together. What's the security situation here? So I'm thinking about does Tura have to say yeah it's okay to use the result from Amur or does Amur have to say it's okay if Tura uses my result? Which of the two is it? Now I'm slightly confused myself. I mean I know it but I'm just... What's the security situation here? This allows reading the... So what do you think? Which we could of course just try it out. Yeah? If it just allows the reading of the answer, then you should allow to read the result and on the toolbar server, you have to specify the reading alone, you will be able to see the images. Okay, let's verify that, which means I don't have to restart this server. This is ARMOR, which is just, we now use AMO just for computing the results. But here we are saying, let's do that, I have to say on Tura, HTTP, AMO, it's okay to read results from AMO. And I have to specify the machine end port. Let's just do that. Let's see if it works. So now I'm restarting Tura, this one, and let's just see if it works. If that resolves the situation. Okay. The same. Now of course the question is did it? That's certainly some, yeah, I always, is the header here? I mean, first of all, let's verify verify that if request equal API. First I should see the header here right access control. Sorry for the confusion but it is confusing. Cache control. No, no, I mean it has to be, I mean this is not responding to a request now, right? So I have to, it has to be AMOOR which is sending this header because it's computing the, it's computing the reply. So let me, let me restart the server on Armour and this should not change anything but at least I should see the header now, right? And then let's, where is the, yeah, it says here exacts control allow origin armo. Okay, now armo is saying it's okay if armo uses my result, but I have to write, I guess I have to write toura here. Okay, let's just do it the stupid way right now and then try to understand why. So let's write to her here and restart the thing on Amur. But this is something, and now let's see whether it works. Six times seven, now it works. Okay, so Amur, but it's actually okay. It's confusing and you have to ask exactly the question which I just asked. Who has to allow whom something? So this worked, so let's just see what we did. And then let's try to understand it or maybe not fully understand it. You have to do that at home, but just what it's about. So what I now did that armoire with its result sent a header and that's we can look at that header thanks to the development control here. It said access control allow origin it's okay if toura uses my result. This is what we did right. Armoire the machine the new machine which does only the computing now, not file serving, says it's okay if Tourer uses my result, and that way my webpage works again. And by the way, we could have done, we could have just written star here, which means everybody can use my result. And why is that important? Well, a typical way, just assume ComDirect, that's just an online bank, and actually it's not ComDirect, it's ComDirect with a K, people will not notice, and you make it look like exactly the real page. Yeah? For some reason you happen to be good at this, make it exactly look like real page, and now people enter their data, they log in, and now you just send that data to the real. You know how the API calls to the real page work, and they will have a different domain, they will have ComDirect with C, right? And now ComDirect with C, the real bank, will not allow ComDirect with a K, the wrong one, the evil one, to use their result, right? So that's why what we just did was the right thing. So the one providing the service could be confidential information or private information. It has to be careful who is allowed to use my result. So here we are just computing mathematical expression so we might just say everybody who calls me can use my result. But if you are a bank, if you are the server, you are giving out secret information, you should be careful who may use your result and you will restrict that. And actually if you, yeah, I don't want to advertise for banks here, it's just something that came to my mind. I think if you do this, if you enter this into the browser, it will actually redirect to the real bank, right? Because they bought these domains as well too, so that's a typical way. And actually there's even laws that you are not allowed to put up websites with very similar names, right? So like Google or something, that will probably also redirect, I don't know. Let's just try it because otherwise you could do all kinds of, yeah. So basically these companies bought all the domains which sound similar. And if you buy it or bought it before, actually you can be forced to give it to the company. You can't say, I bought it, now I will use it. And it just differs by some tiny character. Okay. So that's just explained here on the slide. I will not go through the slides anymore because I already explained it. And there's one exception to this same origin thing. So it's easy if everything is happening on one machine and that's JavaScript. So JavaScript, so in the past we used jQuery, we don't use it anymore but maybe you want to use libraries. Here we're using JavaScript from our own machine but maybe you want to use a library which provides all sorts of fancy functionality and you can use that so you don't need anything special for that you can just write this and there's a little story here which I think I will just skip and rather talk about the last part so in the past this was exploited for all kinds of things. And let me just skip to the last part. Yes, and that's correct what you wrote on the chat. Last part and then we are done. That's actually nice and easy but it's something to, we have already seen the encoding stuff. It's 13 slides but don't be afraid because it's not deep and also not much live here. So context switch, it's something that's not completely new but you don't need what we did so far. It's just about how do you represent all the funny characters in the world, right? So we have seen the little devil here, it's no longer here. No, I don't want the, yeah. So we have already seen Unicode here. You have of course the letters, but you also have funny characters in other languages or even symbols, how do we represent those? This is what Unicode is about. So a little bit of history. So for the longest time the standard was ASCII and we saw the ASCII table here. That's the ASCII table. It's what can you represent with one byte and it's actually just showing the lower part until 127, so that's seven bits from 0 to 127 and what you see here is all the typical characters, right, in Latin script. Of course a strong bias towards our script. That's just how it is historically. All these typical characters here, the small Latin letters, the capital ones and some symbols. So of course that's not enough, right? 256 in one byte of Chinese alone has tens of thousands of different symbols. So many other languages. So how was it done in old times? In old times you would use the lower part of ASCII, what you see here for the typical characters which you use in programming and computer stuff. And for the other part you would have different standards. So one standard which is still in use today, it's called ISO88591. If you now look at the bits from the numbers from 128 to 255, it would give each number a particular symbol. But just if you use that standard, let's just look at it here. So for example there should be a table here, here it is. Yeah so for example e, okay and I will donate to wiki media but not right now. So for example hex code D6 is the German umlaut U, the O with two funny dots on the top. And here you see how it's drawn. Thank you Wikipedia for how you draw an U. Yeah, so this is the, so as you see it's all these Latin script special characters which cover most of the symbols you have in the European languages, right, including German. So that's one way to do it, you just for the upper part and then you can switch that, so maybe you're now Cyrillic or Chinese or something, then you have a different character set there. And this you sometimes find in, you have it in Microsoft Word or in mail programs. They switch the encoding and say now let's use this, now let's use that. But that's also not nice, you just have 128 different characters and if you want to combine many of them you have to switch all the time. So Unicode solution is very simple, you have one number for every symbol in the world but this now has to be a large number potentially, right? Not every symbol in the world but let's just say almost. And this is how you do it. So here's the number, capital small a, you just use a small number, 97 for capital a, and this happened to be the ASCII numbers, I have a slide on this. Now for the German umlaut you use a slightly larger number and here are the hex numbers 228 and for some reason this is here. And now some funny new symbols like the euro sign they get even large numbers and you can see here I wrote how many bytes do you need to write this. This is fine with one byte, here you already need two bytes, here you need three bytes. And Unicode knows slightly over 1 million different of these numbers and each number corresponds to exactly one symbol. And we already, I'm sorry, I want to go here, we already saw this here, so for example the smiling face with horns, not called a devil, has the number, where is it? This is the hex number up here, but there should be the decimal number, yeah, 128,520. So that's just a Unicode number for this and it will never change. It will remain from now until the end of the universe the Unicode symbol of this. So that's simple enough. Now we have a number for each symbol and this number can become very large. One byte is not enough, two bytes is not enough, just 65,536. So how many? Up to four bytes it is. And now how do you represent that in a byte sequence? And now there are different ways. Of course one way is to say, and that's the simplest one, four bytes are enough. With four bytes you can represent over two million numbers. Let's just always have four bytes per character and that is the standard for this and it's called UTF-32. And code each character in exactly four bytes. That's simple but very wasteful, right? Because for most characters you only need one byte. And so you also have a standard which uses two bytes for every character, which is often enough, except for very special characters where you need more, and then you use four bytes. This is what Java does. I think I have a slide on this, which is kind of the worst approach of all, because now it's still wasteful, but it's not always the same number. So very, I think this was a bad decision by the Java people but anyway history. And then there is UTF-8, that's what everybody uses for a long time now already. You just use as many bytes as you need basically. So one, two, three or four. And this encoding I will now explain. And now of course your encoding has to be such that you can see by just looking at the bytes is this a 1, 2, 3 or 4 byte code. It's actually very elegant and nice scheme, very easy to understand if somebody explains it to you, which I will do now. So 32 is fixed size, just to say that again, always 4 bytes, that's easy, you know, 4 bytes, 4 bytes, 4 bytes, but very wasteful. These are variable bytes, so you need a way to figure out, okay, how many bytes belong together to form a character. And here's the details of UTF-8. If you have a small number which you want to encode, for example in the ASCII, right, you want to encode the capital W which is 87. This fits in seven bits. And then you just, the UTF-8 code just has a zero at the front and then the seven bits which encode the number. So most symbols, one byte, you encode it like this. If you have two bytes, which means your code point fits into 11 bits, and you just encode it like this. So you just reserve some bits here in each byte to say what am I. So this says 110 by just looking at the byte at the first. 110 means I'm the first byte of a two byte UTF-8 sequence. And just the number of ones here tell you how many bytes. It's actually very elegant and very easy to understand. These bytes actually have content now and one zero means this is a continuation byte of a UTF-8 code multi byte sequence, right? So, and here I have the code point and just you can easily see this is 16 bits minus 12345 is 11. So you have 11 bits for contents and it just goes on like this. So now if you need three bytes then you just the first byte starts with 1110. This is no contents, just signaling. This is the first byte of a three byte UTF-8 character. And then comes some bits carrying contents and then I have two continuation bytes which start with 10. And you can easily compute yourself. Three bytes have 24 minus 1, two, three, four, five, six, seven, eight. 24 minus eight is 16 and you can do the same with four bytes and that's it. And here you can fit in 21 bits and 21 bits is enough for two million and there's only one million code points so that's more than enough. That's why the standard only goes until four bytes. So just by and now let's look at some nice properties and then we are done. Many of the slides are just for reference. It's a really nice standard UTFA. There are some standards where you wonder what have these people done, but this is a great standard. So for one, as you can see here, if you have a small number, it's just a normal ASCII code, right? By definition. So this, if you just write the 87, the W, capital W, it's the same in UTF-8 and in ASCII. Which is great because it means that old code will not break if there are no special characters in them. If you have a sequence with only very normal standard characters, it will be a valid sequence in UTF-8 as well. So that's a good thing that they did that. Here's another nice thing I'm just telling. If you have the next more complex thing where you have a one at the top, which means you then this is just the ISO 88591 which was previously the most frequent special encoding. Of course a strong bias here towards where this came from. No, no, this is wrong what I just said, this is a two byte sequence but the Unicode, it's mostly right what I said, just one little side remark was wrong, doesn't matter. This has so, so the A for example, let's take the German umlaut 228 in this ISO 88591 which I explained to you, it's E4. If you write this in binary, it's a... Let me show you this one thing because a typical exam question or at least part of an exam question and you should absolutely know this as computer scientists. In case you didn't, now is the time to converting hex to binary. What is E4? I mean below you have the result, but let's do this anyway. E4 in binary. Of course one thing you could do, this is hexadecimal, you could first convert it to decimal 4. Then E is, A is 10, right? So E is 14. You could multiply 14 times 16, right? I mean this is, I hope it's correct, right? E is 14, yes? So this is 14 times 16 plus four. What's 14 times 16? 224. I hope calculation in your head is, so it's 228 in decimal and then I could convert the 228 to binary again but you don't have to do that because hexadecimal, this is four bits and this is four bits, right? So that's 14 and 14 as a, so this is a four bit number. And let me just write that. So that's a four bit number and it's 14. So 14 is you have a one to four eights so you certainly have the eight here. Then do I need the four? Yes, I need the, what's in 14? Eight, two, and I guess that's 14, right? I have the two, I have the four, and I have the eight. I think that's correct, that's 14. And the four is, yeah, it's just I don't have the one, I don't have the two, I have the four, and I don't have the eight. So there we go, right? So you just have to encode the individual, that's actually called a nibble half a byte, I think it's called a nibble, I hope that's half a, yeah, and so now you get the binary sequence. the nibble, I hope that's half, yeah. And so now you get the binary sequence. So yeah, you can do that, but don't do that. Don't convert to decimal, do this computation, then back to binary, it takes you three minutes, whereas this is half a minute, very fast. So from binary to hex and vice versa is very easy. So E4 is 11100100, and actually it's correct as we see below here. And as it happens, the Unicode code for these ISO characters, the code point is just exactly what it was in this old standard. It's on the slides here. And of course, UTF-8 is made such that only the rarely used characters they need or the ones which came later. Like the Euro sign was a bit late already so it uses three bytes because it wasn't invented and it was it's an invention of the new millennium. Yes the smiley isn't aligned. And that's another thing. You can, that's one more thing and then I think we should almost stop. Another nice thing of the encoding, look at it and that's important in practice. You now have a sequence of bytes and you are somewhere in the middle and now you want to continue passing from there. You don't want to, I mean inherently this is a left to right thing, right? If you go from the beginning left to right, you can decode it, but what if you are in the middle of a string? Well, just looking at one byte, you know what's going on, right? If it starts with one zero, then you know you are in the middle of a sequence. So either go a little bit to the left or to the right until these things, you can always identify, they start with one one, one one one, or one one one, or just with a zero, then you know that you are at the beginning. So just by looking at one byte, you know exactly am I at the beginning of a UTF-8 multibyte character in the middle, or and then just by going a little bit to the left or right, you can continue parsing. And it's always full integer number of bytes, either one. It's not that part of bytes and code characters or funny things like this. One more thing you should know, of course for every Unicode there's a valid UTF-8, there's a valid byte sequence but not vice versa. So this is not valid UTF-8, so I'm using the template for a two byte system here, and now I have all zeros here and just seven bits which are something else. And this is not valid because I could have encoded this in one byte as well, right? So there's some ambiguity here. I could use, if I have a short code point, I could also in principle encode it with two bytes, but that's just not correct. So if you would have a sequence like this, then any program processing it would say that's not valid. And it will be shown, this is a symbol, you don't see it so often anymore but you used to, this says invalid UTF-8 characters. Okay, now I think we have arbitrary characters, we already saw this. And you just use percent encoding, right? The percent and here comes the hex code of the number you want to encode. And you have to do it for the exercise sheet, but I think the slide is enough. We don't have to see it now. So if I send a space in the URL, it will be percent 20 and then you have to encode it. And the last part, I think I will show it in the next lecture in the new year. This is not something we have to do now. You can look at it on the slides. Let me quickly show you the exercise sheet. I think it's clear. So first make it dynamic, what I did in the first part of the lecture. Take the security stuff to heart which I explained to you, make sure it doesn't happen but we have some nice Easter eggs for you in the hidden in the data maybe you find them maybe not and you should combine it with the sequel to sparkle stuff. This should actually, which point is it here? Two or three? It's two, right? Yes. Anyway, it's explained on the sheet. It's basically just using what you have done. I'm sorry, we should now really finish. It's just using what you have done for the previous sheet and building it into your web app, which is just launching a sparkle or SQL query for the entity which you are searching and just showing some triple. It's just to show you that it can be easily combined all in one web app. So that's the exercise sheet. Is there any question for now? Last question in the old year in this lecture. Anything unclear for now? So I recommend to at least start this week, yes? Where did it say, lecture eight? Which slide? Oh my, that's great. While I correct this and think of another, wow. To go to the slide master, nine. Now it's corrected. Thank you very much. Any other question for now about anything? Of course there's the forum. Okay, so happy, nice, long, relaxing holidays and see you next year in four weeks, bye.Welcome everybody to the new year and lecture 10 of databases and information systems, the course that can also be taken as information retrieval. Overview of this lecture, I will say something about your experiences with exercise sheet nine, for which you had four weeks, it was the second part of web apps. The regular week, just remember it's four weeks since we last saw each other. One week as usual, the two weeks of vacation and then we dropped the first appointment this year for reasons. I will show a demo of some of your web apps and also the master solution and the official course evaluation has started and I have a slide on that with more information. Then afterwards today we start something completely new, linear algebra, so it's only three more lectures with content and they will be about linear, the magic of linear algebra, vector space model, word embeddings, we start very, so it's really something completely new. So there is some connection of course to what we did earlier but new topic. And yeah, you will see it. So first experiences with the web apps exercise sheet. So it was quite a bit of work, you also had a lot of time, but those who did it, not very many I should say, also have a slide on that, liked it a lot. The sheet was very complex. Complex I think is the right word, not hard, but just so many things. So complex, right? The complexity here comes from these many different things. Server side, network, client side, JavaScript, database stuff, you have to put all of this together and for that you have to understand what it is, how it works. But best sheet ever, very interesting, motivating to see the website I created, super cool exercise sheet. I feel like I learned a lot, enjoyed the freedom of doing things. We gave you templates but otherwise you had a lot of freedom and lack of hand holding. Interesting wording, very interesting fun but surely a long one. Nice to see all the hard work paying off. I mean in this sheet really everything came together. We will see it when we see the demos. I had major issues with the query speed which I couldn't fix and some people did not manage to do the relations. Okay let's just look at a few as I promised and let's maybe start with this one. We've prepared three and our master solution. So here's one. Okay, let's happy new year. Okay, there's it's a film directed. I'm just typing random things. So this is the website. It's pretty fast, right? And I should also be able to make mistakes. So I typed happy with one P. So I mean that's a lot of data. I forgot how many entities, but a few hundred, I don't know, even millions. I don't know, do you know it? How many entities did we have? It's really, it's instant, right? It's really not bad. So here, show relations, okay. A little bit buggy but works, so I typed twice. So here you see the database, the sparkle query, which just gives all the triples of this movie here. If I click again it may be, and if I click on here, I get to the Wikipedia page. Okay, nice one. Here with another one with a nice logo. Let's type this one and search 5 work. Okay, I type something and maybe I type, yeah. So I make a lot of typos and I still find it instant super fast. More details. Here I get to the webpage. If I click on this, yeah, so I get to the webpage. No relations here. So also pretty nice design. And here is another one. Did I forget one? Number two, right? The one in between, okay. I had this one twice. Here's another one, okay. You know this from the older times. You are the 1000th visitor. You won a free, okay. Nice one. And let's, yeah, I don't know. Let's type some countries here. India. So also instant, right? Very, okay. China, so here I have, yeah, I have two things. If I click on here, I get, yeah, I get to the webpage, and if I click on here, I get the triples from there. So everything is super fast, of course that's what we have worked hard on in the last lectures and exercise sheet, and it worked fine. This is the master solution. So there were some Easter eggs in them, for example if I type, and let's look at how those came to pass. Okay, this does not work for some reason, because I, oh okay, I have to start the master solution. Yeah, let me just show you how you start it. So I have a pi, I want Qgrams with three, I want synonyms, I want, yeah, these are the list of entities. Maybe let's look at them. I wanted to know how many there are, and just so that you remember. I think I'm in the, yeah, I have to go to folder nine. No, I'm also not in the right folder, ah okay I have to go back to, yes, now I'm in the right folder, that was one of the input files, there was also one with the triples and let's just look by typing less minus n here, how many we had. So oh yeah, so it's 1.5 million, quite a lot right, it's not and you see a lot of information here, also synonyms if you go further down the list and let's go back to the command line which I had here. So I'm running the server. I'm saying on which port it should run 8080. I'm giving this list of entities from which it builds this three gram index because I specify three here. Minus S says also find things if I type synonyms. And for the relations I've built a database. And there's also in there the code for translating sparkle queries to SQL queries on that database. So quite amazing that we're able to do something like this in sheet 9. And let's just run it. Now it's building the index. It takes a while but pretty fast and yep just five seconds since it's there. So let's just reload this. And now we should be able to, yeah so if I type USA I get the United States so that was via synonym search here right so it was first even though I didn't type the entity name, it's United States of America. And let me type snow, and some of you found this now, it's snowing. We talked about vulnerabilities and now I can, in case you didn't see this, so those of you who did the exercise sheet, there were quite a few of those. Let me turn around. Okay, this one doesn't work. There was, not all of them work, okay, matrix. That's another one and I will show you in a second how it happens and maybe one, yeah, I don't know, maybe, okay, asteroids and now I get, oh, okay, now I get, oh yeah, hey, now I have a video game here. Destroy, oh yeah, that's the highest code down here. So what happened? Apparently some code got injected in our webpage and there are actually more. So if you didn't do the exercise sheet, at least look at the master solution and try to, yeah. So if you have a security problem, then you can do anything, right? One more here, I think, yeah. Okay, now we have a whole video game here, gorillas in case you didn't know it. And there are more. Okay now you have a whole video game right here. I have two gorillas here and I have to throw a banana here to hit the other gorilla. In case you haven't played it or in case you have, what angle do you suggest? I have to specify at which angle I throw this and at which speed. So let me try 70, which is a rather steep one and I'm a bit lower, so maybe 80. I don't know, I'm not a gorilla professional, but okay, that's pretty steep and yeah okay that was not, and now it's the others so yeah maybe 70 and 85, one more try. Okay, yeah you see the point, that's the game. And is it coming down? No, it was too fast, okay. So how did this happen? Well, let me show you this. And let me, this is the file which I showed you earlier, the TSV file. And if I grab for something particular for this thing here, so there are image tags in there. And actually this gets harder every year to actually inject JavaScript. So what's happening here, this is the description of the entity which is shown in the webpage and as you see it has a, yeah, it says precipitation in the form of crystal ice flakes, but it also has other stuff before which you don't see. And what is it? It's actually an empty image. So it's an image which you don't see, nothing. So like these hidden pixels in mails or something. And when this image gets loaded, it also loads, it also executes this JavaScript, which in fact executes code. In the past it was much simpler, you could just write script here, so JavaScript text, but most browsers nowadays say, no this is certainly not what you want here, but you can still do it, it just gets harder every year. So here you're loading an empty image and when the image has loaded, you start some JavaScript which creates an element and loads it. And let's just see, we can actually look at the JavaScript by, let me maybe just, is it here? Yeah, here should be the link. Yeah, for example, this one. They are also hidden here. I could have loaded them from any way, of course. Okay, forbidden. And I, so we have asteroids, JS. Which one did I want to look at? Oh, I think it was on the slides even. PSOD, that was one of them. Some of you actually found it and also looked at the code. So now I have, this is not even by us, some are by us, some are by others. So now I have whole JavaScript which is loaded into the page. This one was particularly funny because it has these funny comments. Yeah, it's on one line. Well, so somebody wrote this code had strange comments. Yeah. So the easiest way to mess with someone's console. Yeah, so you can look at those if you want to. Here's your, so there are a few more if you want to, just try them out, here are some of your reactions. Wow, the snow animation is cool, I really like the snow and the asteroids game was fun. Asteroids high score, 340. Harlem Shake behavior, that would be loud, so try it out yourself. It's very funny, the script comments are entertaining, we just saw one gorilla shocked as I tested it with a friend and he sniped me with a banana. Microsoft, oh yeah, let's see the maybe Microsoft one, let's see if it works and then we go on with our contents although this is fun but this is so fast that we can just do it again. Research 8080. Oh yeah that's the Windows one. Yeah okay if you type Windows you get that. Nice. Okay so let's go on one more slide which I promised on the third exercise sheet. So to say it again, it's really so cool that just the nine exercise sheet you can do something so complex and not only that it connects many things, it's also the typical things when you write a web application nowadays, right? You have server code, you have network code, you have code on the client a lot, typically JavaScript, and you have some form of database from which you get data. So that's a very typical setup which is exactly why we did it the way we did it. Will this be on the exam? So will you have to, do you have to code web apps on the exam? Maybe some of you thought, well yeah, we should have done the exercise sheets but it won't be on the exam anyway. Of course it will be on the exam. There's always a question in the exam that involves writing server code, HTML, JavaScript. Not only you have to write it, you have to understand how this works. If you have done the exercise sheets, it will be easy. Of course you don't have to write a flight simulator in the exam. Out of experience, I don't see and we don't see if you didn't do this and now for the exam you just look at the slides. You don't learn this stuff from looking at the slides. I think most of you who did the exercise sheets know what I'm talking about. If you listen to the lecture you say, yeah, yeah, nice, I understand how, but you have to do it. You have to actually do it to understand all these intricacies, how it actually works. I mean that's true of most everything in life. You have to do it to understand it. So just looking at the slides and then hoping that you can do the exercises, the exam, it won't work. And again we are speaking from experience because we have always participants in our exam who, I mean you see what they do, either they don't also in oral exams, it's the same, you get the same kind of questions. They make the impression of having never written a web app themselves. And it shows you don't get the points, so don't be one of these people. So even if you didn't do it now, at least do it at some point, at least do it when you prepare for the exam. Otherwise there will be tasks about this, you won't get the points, which would be a pity. And very few people, some people did number eight, very few people did number nine. So this is kind of a separate topic, how few people do the exercises. But I've mentioned this already. So the official course evaluation, you should have received an email yesterday. Who here in the room or on Zoom hasn't received an email? Okay, one person, two persons, it always happens. So we have a forum evaluation, just if you are interested to take part in the evaluation just post to that forum on DAFNA. Let me just quickly check whether it's... I haven't logged in here for a... Yeah, there it is, evaluation. Just say I didn't get the email. Of course, if you don't want to participate, you don't have to. So, but please participate and take your time for it. I mean, you have been listening to nine lectures already, another one, four more. You can spend 15 minutes on the evaluation. And of course, be honest, be concrete, don't say something super abstract and fair. Even if you have criticism try to be fair, very important. And this entails one thing which we added this year because the exercise sheets are voluntary. There will be one question which asks you, did you do the exercise sheets are voluntary, there will be one question which asks you, did you do the exercise sheets? And I mean it's anonymous, so please be, if you don't be honest, you are messing up the whole evaluation. It's not a super detailed account you have to give, you just have to say, I didn't do them, or you have to say, yeah, I did a few, or I did most of them. And we are very interested in that so that we can find some correlation. So please be honest with that answer it's anonymous we don't know who says what it's important for us to understand what's going on. Yeah as usual there are scores of course if you liked it give a high score but the free text comments are particularly interesting because they give more information and yeah why not do it right today after the lecture, it's just 15 minutes actually we condensed this thing some time ago so there is not so many questions anymore, it used to be way too many questions, now it's a relatively short form, so it should be easy and fun, it's nice to give feedback. So that's the super last deadline, so if you missed that deadline there's nothing we can do, it's a centralised evaluation, so we are not running it, university, a team at university is running it, if it's closed, it's closed. We can try if somebody didn't get the mail to write them, sometimes they, in the last years has been a very nice team, very forthcoming if there are problems. And that's what I just said, if you didn't receive an email, just post in the subforum. Okay. in the subforum. Okay. So, new content, but before we do the new content, I kind of promised and I want to do it a little bit. We had to rush, or I didn't rush, I just left out a part of the last lecture, which is important and relevant, and I just wanted to spend 5 to 10 minutes on it. And it was about Unicode. So this is now still from the, we will start with the new stuff soon in 5 to 10 minutes. Unicode. And I mean I hope you remember it at least from the lecture, from the sheets, it was, yeah, how do you encode funny characters? I mean, we have the typical characters, we have the more common ones, at least in the Western world, like aumlaut, e, Greek letter, euro, a new symbol, or smiley face. How do you encode all these characters such that it works for everyone? And there is a scheme called Unicode which is around for many decades now. And so I'm just recapping a little bit so that I can show you the thing which I didn't show last time. They're basically, so these are the slides from the, I hear someone talking, please don't talk. So there are basically three schemes. One is always use four bytes, that's easy. I mean then you always have four bytes, it's easy to parse, but super wasteful because most characters like the ordinary ones here, which we see on the slide, just take one byte. So yeah, I mean, then there is UTF-8 which uses as many bytes as needed, so very often one, sometimes two, sometimes three, sometimes four. And then as always there is the worst from both worlds scheme, UTF-16, which uses typically two bytes but sometimes four bytes. So it has the bad part from both parts. It's variable byte but still wasteful because typically you only would use one byte and it's used by Java. Yeah, I think this was a bad decision. Java made a few bad decisions. So almost everybody uses UTF-8. Okay. And we explained it, I won't explain this again, this is just how you do the encoding and it ensures that ASCII characters are encoded in the typical ones, encoded in one bytes and only for the... Somebody is talking and I hear it quite, please don't talk. Or go outside if you need to talk or wait for the break. Okay, so, and the thing which I didn't show, I wanted to show you now, and if you wrote, did the exercise sheet, you probably had some experience with this. I mean it's a simple principle, but it gets complicated very quickly, and I want to show you, let's just encoding test.py and let me write a very simple program which just prints a character here on the command line and let me just write it here, python3 encoding test test pi and it types A. Okay, that's simple enough, that's how it should be, so that's just a funny character, A, which uses two bytes in UTF-8. And now I just want to show you a few things. There are so many more things, but just to show you what's the problem. So this file is encoded in a particular way and it's actually, yeah, it's encoded in UTF-8. So this is a program, it's a Python program. How do I know in this Python program how this is represented, right? Is this represented in Latin? Then it would be one byte in UTF-8 or something else. Here's a program to look at stuff. And this is, it's XXD, which just shows me it has many parameters, but I'm not, actually I don't know them. Okay that was wrong, G1. Yeah now it's just showing me, now it's showing me the individual characters here in hex code and here it shows what it prints. So here I can see, so 70 is probably the P, right? Let's just look at that. And that's a very useful program, that's why I'm showing it to you. There are other tools as well of course, but if you just want to know, okay, what are the bytes of what I'm looking at? And sometimes can also be output file or something, because something is wrong, and what you see is not necessarily what you get or vice versa. So here you have this German umlaut, how is it represented? And I can see it. So P here, 70, let's look at hex code 70, it's not decimal 70, it's hex code 70. Where is it here? So it's indeed the P, P, so this is the 70. And here somewhere we see the quotes, the quotes in hex are 22 and now comes the umlaut and you see C3A4. Right, so it's represented in two bytes. Now let's change that, let's just pick Latin 1 here and it's not supported. Okay, maybe ISO 885901. That's the encoding before Unicode, which was used a lot, not supported. Okay, I'm a bit surprised. That's not what I expected. So I can't ISO Latin one. And I maybe can't show this. Ah, okay, maybe I should, this is, let me set the file encoding to Latin 1. Okay, now it's in Latin 1 and now let me type this again. And now let me look at the file again. Yeah, and now it's just one character, right? 22 was the quotes and this, so now it's the same file and it's in E4 and let's just run the program and what happens now? Yeah, okay, now it says this is, yeah, it's not valid code, which is a funny situation because the Python code is expecting that the things here are encoded in UTF-8 but the file is encoded in Latin so the program doesn't even run. And this, I could show so many more things. One thing I could do here is actually, how do I specify the encoding? And I always forget this, now I have to, how do I do this in Python? There's a special syntax for specifying the encoding, who knows it? Has its star coding, okay. I think it's ISO88K, let's go back here. Now I'm saying the Python program should understand what's written here as, and let me see if it's still the same, yeah this is still one character, quote E422. And let me see, yeah now I can write the Python program and now it's even printing as an a. Okay which is also surprising because, and now let me do something, how do I actually get this to print as a, okay this doesn't work and only contain ASCII literals. Okay, so that's interesting because here's another thing which I wanted to show you. So let me, and then we will be done. Let me go back here and let me, let me go back to UTF-8 here. And I think the file encoding is, now it's Latin-1. Okay, it didn't let me set it, but it accepts it when it's in the other. So now the file encoding is UTF-8 again. And now let's, the terminal also has an encoding. That's what I wanted to show. This terminal assumes that what you send to it, I mean it just gets a stream of byte from any program that it's Unicode. Let's just say the terminal encoding is ISO 88851. So now I change the encoding. Now this here is, let's just look XXD again. This is now 22, yeah. So this is now two characters again, my umlaut because I changed the file encoding to umlaut. So it's now again C3A4, so two bytes and the Python, I told Python yes, this is UTF-8 interpreted as such. And now I'm running this and now it's sending the two bytes to my terminal which is interpreting it as ISO Latin, so I get these two characters. Which should look familiar because you get them a lot in mail which is somehow gobbled up. Yeah, mail is a lot about somebody sends you an email in ISO 88591 and your mail is set to UTF-8 and then it's interpreted in a wrong way. Typically it should be, there should be headers which says what the encoding is, but when you forward stuff, reply to stuff, it very often gets mangled up. I will stop here, go back to the slides maybe briefly, but you see how many things are interacting here. So it's how the program is written here, how the program, how Python interprets what's the contents of this file. Then it's sent to somewhere, maybe to a web browser, to a terminal where it's also interpreted. Then the editor, I'm not showing this, there's file encoding and there is encoding. There's also two different things. So I have already five different things here, right? Let me just list them and then I go on just to tell you how much fun this is. There is, how is the file encoded, which I can see with XXD. How does the editor show what's encoded in the file? That's the encoding. How does Python interpret what's written in the file when I run it? And how does the terminal what's written in the file when I run it, and how does the terminal or whatever I sent is interpreted, so I have five, and these encodings can be any combination, so I have, if there are K encodings, I have K to the five different possibilities, and so many things can happen and do happen typically. So XXD is your friend if you want to look at you have some output what's really there because what is shown is yeah typically does not help you because what is shown is also already an interpretation. In the different programming languages very briefly and then we move on. It's already past 10 minutes that I want to spend. So the character length, that's an issue. So if it's encoded in UTF-8, like the R Umlaut, which is two bytes, is it length one or two? Well if it's a string in C++, it's length two, because it is two bytes. Is it length one or two? Well if it's a string in C++ it's length two because it's two bytes. In Python this would count as one because it's one character. It's different in the various languages. In Java it's just a total mess. So if you have the smiley face, so typically it will give you length one, but if you have the smiley face, so typically it will give you length one, but if you have the smiley face, which uses three bytes in UTF-8, so two bytes in UTF-8-16, it will tell you the length of the string is two, which is just bad. And then you have all kinds of, yeah, and in Python, here's some more. This is what we have seen, for example. If you have the wrong encoding, you get funny stuff. Okay, just so you have seen this, that you're aware of this, if you have problems with encodings, try to look at the raw data with something like X60. Okay, any questions about that? Otherwise we move on to a new world of linear algebra. And it's very lightweight today, the exercise is also fairly easy. Please do it, because the next three exercise sheets are about using linear algebra and we will even talk about language models in the last lecture. We will even do one ourselves. And the foundation of all the deep learning and everything is linear algebra, pretty simple linear algebra. So let's start. So, so far we had inverted all kinds of indices, inverted indexes. Today we will represent everything as vectors, linear algebra vectors. Here's our running example for today. We have six documents and four words and they are written in a matrix. So every document is a column here, document one is just a document which contains the word internet, web and surfing. Document two contains internet and surfing and so on. If a word is contained twice, I have a two here. There's a reason behind this particular example, I don't think we will make use of that, but let me briefly explain. Oh yeah, we will make use of it a little bit. So internet and web, they are kind of the same, they mean a very similar thing, right? We will come back to that at the end of the lecture. They are synonyms, sort of synonyms. Don't mean exactly the same thing but people use them synonymously. Surfing is what's called a polysem because it means different things in different contexts. So surfing the web is something else than surfing on the beach. So that's a polysem and a synonym and these two concepts are generally very important. So different words meaning the same thing, the same word meaning different things in different contexts. Okay, and if you have a zero it just means that word is not contained in that document and that will be important later if you think of larger text collections there will be many zeros in this matrix because every document of all the possible words just contains a small subset. In this lecture we use TF scores so the two says this word is contained two times. For the exercise sheet you will use BM25 scores. I will talk about them again. So this is often called the vector space model because you represent things as vectors. So in a vector space, words are typically referred to as terms, which is why this thing is not called a word document matrix, but a term document matrix. But that's just, yeah, terminology. Okay, so retrieval. Let's say we have a query now, and the query is web surfing, then we can also write the query, just like a document, right? It also write the query just like a document. Right, it's also you just have a vector and wherever you have one of the words, you have a one, otherwise you have a zero. So this query is like the query for web surfing. And now, yeah, let's start very easy. How could you measure the relevance? You just take the dot product. And let's just take the dot product, let me put this aside and draw something. And let me just, yeah. So let me take the dot product between this perfectly straight line rectangle. What's the dot product between these two vectors? Dot product. It's two vectors of the same size. So if you have the dot product, so you have two vectors, X which contains of XN. You can also write it in the zoom chat. I hope someone here knows what the dot product is. So we have two vectors each with the same, and then the dot product is just the sum of the pairwise products. So you just multiply the first components, the second components, the third components, you sum it together. So what's the dot product of these two vectors? Two, I see a two, that's correct. Two, yeah, so it's this times this, zero, this times this, and yeah, this times this, plus one times one, plus one times one, plus zero times zero. So dot product is really simple and let's just compute the other dot products. Please tell me what's the dot product? One, it's correct. Dot product, two, dot product, here, one, yes, and one, yes. And understand that's written here, that's exactly the TF scores from lecture two. So if I would do ranking with TF scores I would get this, yeah. What did I do? I just looked at my query words and for these query words I just summed up the term frequencies. That's exactly what you get with a dot product. Wherever there's a zero here, I don't even have to look at the value because it's multiplied with zero. And for the others I multiply with one, which means I just, yeah. So what this query does is just sum the TF score for web and for surfing, which is 2 here, 1 plus 1, 1, 1 plus 1, 1 plus 2, 1, 1. It's exactly what we did in lecture, actually lecture 1 already I think. Okay, very simple. And here is one important thing. And then I will show you a little bit of linear algebra in Python. I assume, I mean linear algebra is so important. It's an integral part of computer science. There's no way you're gonna get your diploma if you don't know linear algebra. It's the basis, as I said, of everything deep learning, but also of so many other things. So here's something very basic in case you forgot it or never really understood it. If I want to do what I did on the previous slide, and that's very typical thing, I want to compute the dot product of one thing with very many things. I want to have these scores. of one thing with very many things. I want to have these scores. Now of course I could just write a loop, compute this dot product, this product. That's not how you do it. You can do it as one big matrix vector multiplication. And let me show that to you. Just so. And so let me write this vector in this form and you will see in a second why. So that's now my Q, that's now my vector Q which for some reason here is written in Q. Is it also in this? Yeah, okay. It's written as a capital Q. Okay, I will just, I want to write this small Q, so I just do it like this. And this is the, it's also a matrix with one row. It's a 1 by 4 matrix, one row, four columns. And let me write this matrix here again. Let me just write it again. One, one 1 1 1 0. I'm just copying it. Please pay attention that I don't make a mistake. 0 0 1 0 1 1 2 1. And in case you didn't notice the fifth and the sixth document are identical, but I mean nothing speaks against that having two identical documents in your collection. So this is my matrix and let me call it A. This is my matrix A and it has four rows and six columns. So first rule of linear algebra, don't talk about linear algebra, first rule of linear algebra is if you multiply two matrices and a vector is a special case, these two things have to fit, right? You can multiply a one times something with a something times something else. This has to be the same and you will see in a second why. And the result will be a one times six matrix, which means a flat vector here with, and how do I do the matrix multiplication? Well, let me do it by, let me just put something around here. How you do it, matrix multiplication, I do this times this. And then I just compute the dot product. I mean, that's how you do it. I compute the row here. This row with that column gives me the first entry here. It would also work if I have several rows here, then I would also have several rows here. And let me do it in the right color. So this here gives two. And now if I, let me do it for the next one. Now I do this with this. And this gives me the next entry which is a one which is the dot product of those and let me do it with one more color if I have one maybe this wonderful red. Ok and again take this times this so this is vector matrix multiplication also generalizes to matrix multiplication 2. And then the remaining ones I just write them in blue. So now it's this times this, which is 3. This times, let me show it, this times this here, this dot product is 1. And this should be something something you shouldn't even think about it. So matrix vector multiplication it's so important it's so fundamental you should be able to just do this. Ok and this now with a single matrix vector multiplication. And I made a small mistake here because this is actually Q transposed because here my Q is a column vector, right? So this here is a four times six matrix and this is a four times one matrix or a column vector. It just has one column. And to be able to multiply it it I have to transpose it. So QT is the, let me also write that QT is transpose. You can always, I mean this is not really a part of this lecture but I'm just repeating it in case you forgot it. Yeah, transposing is just to make the rows are the columns, the columns are the rows. So if I have just a column vector, I transpose it, I have a row vector with the same values. Any questions about this slide slide please take the opportunity and ask them now. This is just a very basic linear algebra recap. Maybe something is unclear then do ask. Okay and now the question is how do we do this in Python. Of course there are many packages in Python and we will use and for some reason this, the animation does not work here, let me just Okay, this should be... So we use PyTorch. PyTorch is known for use in deep learning, but it's also, you can also do all the kinds of linear algebra with a basic linear algebra and not so basic linear algebra stuff. So far we always used NumPy and SkyPy, which are still good, you can still use them. Was always a bit strange because you had some, there's this very old NumPy package, which contains very basic linear algebra, and SkyPy, which contains more advanced linear algebra. So you always needed two, they were not perfectly calibrated to each other. So we used PyTorch and there we also have the learning stuff which we want later. Of course PyTorch, there's a lot of documentation on the web. What we did for you here on the wiki wiki you find three links. One is, here's what a tensor, tensor is a generalization of matrix and vectors. I will talk about it in a second. So you have to just to understand the very basic there's a page. Then if for some reason you have used numpy in the past, you're very proficient in numpy, you want to learn PyTorch, here's a wiki, a github page which explains that. So yeah, sometimes you have to situation, you're very proficient in X, you want to learn Y which is similar. There are typically pages for that. And we also did something for you, namely a cheat sheet where we just list the stuff which you need for this exercises in the next two ones. I mean PyTorch is a huge library, it's just huge, you can do so many things with it. It's easy to get lost in the documentation so here we have just listed things for example to create an empty matrix or matrix with only zeros or ones. So here you see okay short description this creates something only zeros and then you click on here and you have the exact syntax and semantics. Yeah, so here's some, this is how it works in principle. You import torch, you have to not pytorch torch, you have to install it first, pip install torch or something like that, or there's probably also Ubuntu package, you want to create a matrix. And so I said this, let me say it again, clearly in PyTorch you don't have vectors, matrices, you just have tensors and tensor is just a generalization of this, right? A vector is just something one dimensional, a matrix something two dimensional, you can also have something three, four, five-dimensional. And in deep learning and stuff you typically have three-dimensional things also or even more dimensions. So a tensor is just arbitrary dimensions. So this is a matrix here, so I just have a list of lists. This happens to be the matrix which we have seen earlier, at least the first and the last row. If I want to create a vector, I do it like this. And then matrix, vector matrix product, I can just do it with this syntax. There's also matmul. There are several ways to do the same thing in PyTorch. I think here torch matmul does matrix multiplication and also vector matrix multiplication. I don't know if they mentioned the shortcut here with the at. Okay, so let's just work with this. Let's leave encoding. And this is the code from the last, now we are going back to lecture two, which was inverted lists, building inverted lists from a simple collection. So this was the simple example, let me just make this a little bit smaller here so that we can see most of the files. So this was just three test documents, a movie movie, a film movie. Let's just look here, we built an inverted index, I hope you haven't forgotten everything. And then we built the inverted list, so now I have an inverted list for the word A, it is contained in documents 1 and 2, which I have 1, 2 here. I have the inverted list for film, which is only contained in document 2, and movie, which is contained twice here, and this was actually from the first lecture, so I have 1, 1 here. Let's just keep it that way for this sheet. One one. Okay, and let me also do a second example here which helps understanding. So this is now, let's take our matrix, our example matrix here, and let's just write these documents here in words because the parsing is done by our code. So this is the document internet web surfing, so let me just internet web surfing, internet, to use the same. The second document is just Internet surfing. Internet surfing. The third document is web surfing. The order of the words is not important for what you are doing. I mean there is no notion of order here, it could also be surfing web. This one is internet web surfing twice beach. So let me just internet web surfing beach surfing, let me write it like this. Now I have surfing twice, I have a score of two. And then I have twice surfing beach. Let me, the second one, just do it this way. Beach surfing. Okay, and let's do a little more and then have our break in five to ten minutes. So this is our document. This is the document collection for this matrix. I hope that's clear. Please ask if anything is not clear. And now of course I could run my code here, let me just do that. But I think it does nothing. Yeah, it will just work. Let's also run the doc test, whether it works, minus m doc test. And let's maybe change the, just to see whether we're doing the right thing. Let me, let me just change this word here, remove the e, yeah. Now I remove the e, now I don't have movies, only once here and I have a new word movie which is in document one. So everything works fine here. Okay. Yes, so the doc test works, so back this just for the, and that's the collection we're working with now. What I want to do now and what you should also do for the exercise sheet, it's like the first step. Let's create a term document matrix from the already computed inverted index. So we have already done the parsing and everything. And let's do that. So I guess the first thing we should do is import torch. We want PyTorch and you will notice one thing that just by importing torch, and I think this is a bit annoying and strange, but that's how it is. That's not how it should be. I don't know why they did that. So it takes a while, right? Just because you import something, apparently it does some initialization. That's not, I mean, just by importing a library and not using it at all, you shouldn't have to wait, I don't know, how many seconds? Okay, one second. So one second of your life will be wasted just by wait and you will do many iterations. So you always have to wait one second just for Torch to initialize. Strange. Okay. Oh, this I forgot to delete it. Here I've already added some things. And actually let me, yeah, this was not there in the original code. Let me just, we didn't need that for the inverted list. So let's just do that. The words which we encounter, so here I have a list of words, just a list of strings. And I mean we already have in our code here. This is self, self words, append word. If I have a word that I have not seen so far, it's not inverted list, it's just a dictionary from words to inverted list. If I have not seen it, I add it here. And how do I get the number of words? Well the number of words is just one way to do it, it's just the size of my inverted list dictionary. The other one would be just to count the words vector. And how do I get numdocs here? Look at the code and tell me how do I get numdocs. I want the number of documents, initially zero. Remember this is going over record ID. Record ID, yeah. Exactly, the record ID is keeping track of the record ID, it's increasing it by one, which means by the end record ID should be. And let's just add it to the test, why not? Self terms, no words, I called it words, okay. And what should be the, that should be a film movie. Is it correct? Yeah, we will see it. And self-numb words should be three. And self-numb docs should be also three in our example collection. Just try it. One second for torch. Okay, self is not defined because it shouldn't be self here, it should be II here. Okay, mm-hmm, yeah. The order is not correct, let's just go upstairs here and check the, yeah, let's just go upstairs here and check the... Yeah, the words are added in the order where I first see them, so I first see a, then movie, then film. So that was not correct. Here they are sorted but... Now it works, one second for, okay, so that was that. Now let's just finish writing the document matrix and then we have a break, but I think you are still with me. So now let's create a term or build, I think we also call it build enough term document matrix. Self, it's a member of a, and, yeah. So what do we do? Build a term document matrix assuming that we have already constructed the inverted index with a build from file method. So how do we do that? Well, let's first start by, so let's call our matrix A, let's just, let's create a matrix with only zeros, and let's, and let's below here so that we can already run it, I I build term document matrix print, term document matrix A and let's print it, I I, and let's print it, I think I want to return the term document matrix, let me return it and not write it in an internal variable. So here I am just calling it A for simplicity and then I print it. Okay, so now I'm just creating a matrix which has as many zeros, but it has already the right dimension, so it should be four times six if I call it. And let's, it always takes this one second. Okay, and then I added time here in the beginning. Okay, looks fine. It's a bit annoying that unlike in NumPy you can style the output in PyTorch, you always get this object thing like tensor, now you have a dot although it's all zeros. So let me do one thing. Let me have a thing here, print matrix. Actually, this does not have to be a member of the class. I just want to print the matrix in nice form because Torch is too stupid for it. Print the matrix in nice form because torch is too stupid for it. You can't, I don't know why. Maybe let me just for each row print and what do we... Do we? That's not what I want. I think I want like 2D maybe, F2, okay let's do that. We print the matrix here, let's see how it looks. Okay, that looks nicer. So yeah, now the matrix with all zeros. Okay, now let's do as usual most of the work in coding is preparing. So given the inverted list, how do I enter the, well I think I just go over the words and then I think I'm just showing me too much information, four. I think I should go over the words and the for word in self words, let's go over all words. So I'm going over the words in the order in which they are inserted, which is hopefully the, yeah. And now, and now I go over the entries in the inverted list. And the entries in the inverted list if I go a little bit higher. So I'm going over the words and now I'm going over that list for that word. And I think let me do it like this in enumerate words, thank you. And now I'm going over, yeah let me call it doc id, it's called record id above in self inverted lists for that word. That looks good, so I'm going over the words in order and I also have the word here because I needed to look up the inverted list. I also have the word ID which should go 0 1 3. And now what do I do? What do I write in this line? Suggestions please. So now I'm at a particular word and I have this here. So how do I write? Yes? We increment the entry like A, word ID, doc ID, 519. Doc ID, okay. And let's write. Increment by one month. I think it should be find one, not. Hmm? I think it should be incrementing, not assigning. Ah, it should be incrementing, not assigning. Very good, I agree. Let's run the code. Something is wrong. What did we do wrong? Number of the joints. Can you tell me the line number? 65, the number of the transverses is excellent. Can you tell me the line number? Enumerate and what, okay what's my problem here? Oh my, yes, it's first the ID and then the thing. Yeah, if I do enumerate, thank you. Oh my, index six is out of bounds. So many mistakes which I didn't expect. What did I do? Oh yeah, I know, I see the mistake. What's the mistake? Typically, typical one of error. Our our docids here start with one, but of course in an index it's minus one, so it should be minus one here. Okay and now I have it and you already said it, so if I encounter the same word twice, here I just set the entry to one, I should of course have plus here. So if I see it several times, I'm incrementing it by one. So here's an example, right? I have it twice here, I encounter it, I increase it by one. So let's do that and yeah. there we have our term document matrix. And maybe that's a good time to make a break document matrix and print it. Now let's also create a query vector. So let's, I think that works as follows. We just want to have a vector here. Okay, let's just look at that as well. Let's print it. Query vector q. And let me also print that as a, let's see whether it works. So that's now a one dimensional tensor. Everything is a tensor in taut. We already saw that. Oh my, iteration over a, mm-hmm. Yeah, okay, let me just make it a matrix with one row. I think that should work. Otherwise I need an extra function for printing a 1D tensor. So let me just do it like this. Okay, that works good, fine. Let me print some empty lines here. And an empty line here. And now let's do the scores. And the scores should now be, yeah this should work Q times A matrix product this is now a 1 times 4 matrix this is a 4 times 6 matrix. Let's see whether it works print scores and print scores. So yeah, now this is exactly the same matrix as we had it on the slides. We are creating this vector and now we are doing the matrix product and let's see whether we get, hmm. Ah, okay, there's another problem. Okay, I see it. Yeah, it's somewhere on the slide, so matrices in Torch you have to say which type they should have. And here I think it implicitly takes, yeah I'm not sure. I think it assumes here because I'm specifying only integer that it's a integer one, but I think in my A matrix above, it doesn't know that it integers, so it implicitly assumes it's a float, or maybe that's also the default when you take zeros. So here now I'm saying type float. I could also say here above take type int and type int here would be fine either way just has to be compatible yes and now it works. Let me also have a new line here so this should be 212311 these were exactly our scores and the important thing and that's maybe the most important thing and that's maybe the most important thing about this whole linear algebra stuff, I did my whole retrieval with a single vector matrix multiplication, right? This is what does the retrieval here. This is my matrix which I have somehow pre-computed, the vector has been entered, now I have a matrix and a vector and now I'm computing all the scores simultaneously with one operation and this also works if I have 100,000 terms and 5 million documents. It's just one big, one operation. This operation is now not done in Python with funny loops and super slow but it's coded in C somewhere in the background in some package so it's really fast. So even if this stuff here constructing might be slow but this here is fast even if it's a large matrix. Okay, back to the slides and please do ask questions if anything is unclear. And we were a bit further. Yes. So, this is how we did it and it works. And this does not work because here it's a 1D, you can't multiply a 1D with a 2D tensor, so let me maybe fix the slides here. How I did it here is I just made a matrix out of it, matrix with one row. A little bit of cheating, otherwise there would be ways to get it to work too, but that's like the simplest way. Okay, here's another slide where I don't have, let me quickly fix that. And maybe it's on the next slide too. Yes. On the next slide too, oh my. Okay. Oh my, why, how did that happen? I don't know, let's just go one by one. Okay, so here's one problem and this I don't show in code anymore. If we do it that way, if we compute the scores, sorry, just with a dot product, then I get a high score just because I have many entries here or high entries here, right? So it's debatable if you think about what's the document about. D4 is kind of about web surfing and beach surfing. And it gets the highest score here because it just has so many entries and a high number here, a high number here, but should it get a higher score than D1, which is only about web surfing, it's kind of unfair. This just gets, because the 2 here is a 2 and not a one because there's also beach surfing, but the query is about web surfing. So it's debatable whether these are good scores and this should remind you of something. I mean that's why in lecture two we talked about BM25 scores, right? These were explicitly designed to handle, I have a bigger document, a larger one, which is also about other things, so it some terms will occur more frequently, maybe with other meanings, and I have a term across several times, maybe just because the document is longer, and BM25 was balancing this. This was what half of lecture two was about. So one way, and this you will do for the first exercise, is just take BM25 scores instead of the TF scores and this is my query vector, again 1,1, doesn't have any special scores, now if I compute the dot products I will just get the sum of the BM25 scores of the contained words, which is exactly what we did in lecture 2. So just by having a matrix with BM25 score, I will get what I get in lecture two. And exercise one is just doing that. But you have to pay a little bit attention. Okay, I will come back to that. There's another variant, so BM25. Another variant is compute the similarity in a slightly different way, namely as followed. Take the dot product and divide by the length of the vectors. So divide x by this length, divide y by this length, which means divide by the product. So what's this length of a vector? Let me be explicit. They are different, it's the norm of a vector or length. We just call it length here. We just take the L2 norm here. There are other norms as well. L2 norm just takes the sums of the squares and the square root of that. L1 norm would take just the sum of the absolute values and there are even other norms. But that's the most frequent one, L2 norm. Sum of the squares, square root. Gives you the length of the vector. Actually in typical space two dimensional, three dimensional, it just really is the length, the Euclidean length of the vector. So this is just dot product normalized divided by the length. Why is this called cosine similarity? This is also in, when you do deep learning with text you will meet this all the time. Cosine similarity, you have two vectors you want to measure how similar they are, you want to account for different length, you will take cosine similarity. There's no cosine here. Why is it called cosine similarity? Well, and for exercise 2 of exercise sheet 10, I will talk about the exercise sheet in a second, you will use cosine similarity. So you will do the BM25 thing and the cosine similarity thing. Law of cosines, here's the links to the Wikipedia article which proves this law. I will not prove this law. One thing that should occur to you, and maybe I, let me draw something so that this becomes clearer. Let me draw a rectangle. This is A, this is B, this is C, let me call this angle Now if, if gamma is 90 degrees, not Celsius, 90 degrees, then what is this? What does this become if gamma is 90 degree? Zero, and then what is this? It's Pythagoras. Pythagorean theorem I think it's called, I'm not sure. How is it called in English? Yeah, exactly. It becomes the Pythagorean theorem, so the law of cosines. And this you can do with elementary trigonometry or, there's an elementary proof for that, which I will not do. But I will show you that if you assume this to be true, then you can, yeah, then we will see how this is related to the cosine. And let me try to do that, maybe I fail and you have to help me. Let's assume these are two vectors here. If these are two vectors, then I think the length of C is just the length of the difference between the two vectors, right? If I would, yeah. I think that's true. What's the, yeah, so if I revert B and do it like this then I would get, so if I just turn this into vectors then A, I think that's true, right? So if I would point the vectors in this direction, then I think C is just A minus B. I don't even need that, I just need the length of the vectors, so if I do it like this then I have... so C is just the difference between the two vectors, A minus B squared is A squared squared and let me just write plus b squared minus two and here I have, if I write this as vectors, no this is the length of a. Here in this law of cosines, these are lengths, not vectors. So if I'm just cosine of, and let me call this angle gamma. So gamma, I'm calling it gamma third letter because it's opposite of C. Of course you have this law of cosines three times in every triangle. I'll do at least a little bit of math here. This is really just writing it equivalently, right? I'm not using any vectors here, I'm just taking these things as vectors and taking their length. And just to clarify let me maybe put... yeah so what is the... now one thing you should notice is that, let me write that up here, if I take the dot product of a vector with itself and so this is very basic linear algebra but I'm assuming some of you might be a little rusty or never fully understood it so I'm repeating it a little bit. A dot product of a vector with itself is well it's just the sum of the pairwise products. In this case it's product with itself so you just take X and X again and pairwise product you get the sum of the pairwise products. In this case it's product with itself so you just take X and X again and pairwise product you get the sum of the squares and what's this? This is just the square of the length right? The square of the length. Wonderful so if we have the square here we can write this as, let me just do that here, so this is just A minus B dot product A minus B. And this is just, yeah this is A times A which is, yeah let me do this A times A minus 2 and this is dot product minus 2 times A times B plus vector b times vector b, which is, yeah, this is the length of a square plus the length of b square minus two times a dot product b. Yeah, so now I have a square on this side A dot product B. Yeah, so now I have A square on this side and on this side, so it just cancels out. So this here, this is on the left side and this is also on the right side. So now I have minus two dot product of A and B is equal minus two length of A A length of B cosine angle. So if I, yeah, so now I have, let me maybe write that one intermediate step. So if I just cancel these things out. And for the exam you should absolutely be able to do some basic linear algebra like this. So this is minus two vector A times vector B length times cosine of the angle which means the cosine of the angle is a dot product of a and b divided by length of A times length of B. Which is just a cosine similarity as defined above, right? Of A and B. So the cosine similarity you typically compute it in this way as the dot product normalized, but it's really the same thing as the cosine of the angle between the two vectors. Right, and for the cosine of the angle, for the angle it does not matter how long the vectors are, right? You are normalizing them here, you're dividing by the length. If this vector is twice as long, you're dividing by the length, it's the same cosine similarity. So the cosine similarity adjusts for the length of the vectors. That's very important. And for the exercise sheet and also for understanding, that's also very important. So when you compute the cosine similarity, what you have to do? You have to compute the dot product, easy. It's just sum of products. Then you have divide by this, hmm, this is costly. I have the sum of these squares and then divide by the square root, square root. Here's a very important thing you should understand also for the exercise sheet. So if you want to compute cosine similarity with many documents, so like here, I want to, I mean I don't have it on the slides now, I want to compute the cosine similarity of this, with this, this, this, this, this. Now what I can do is I go over this, I compute this dot product times this, divided by length of this, length of this. Now what you could do is of course, because you're reusing every time the length of these things, just pre-compute them and store them somewhere. But that's also not what you should do. What you should do is, let me, you can just normalize the vectors. That's what you should do. Now you should just divide each document by its length so that afterwards it has length one. And let me just verify that for you that that is true because people sometimes get it wrong. How do you normalize the vector so that it has length one? Let's just take the length of x prime. So the length of x prime, if I divide it by its length, it's the length of x divided by its length. So this is a scalar factor, a constant, so I can just pull it out of the norm. So this is one times x times x, which is indeed one. So that's one little proof, right? So if you divide a vector by its length, you get a vector that has unit length, that has L1 norm. And if I have two vectors of, yeah, let's also check that, that this here is true. So if I now do cosine similarity, yeah, I'm not writing sim here, just abbreviated it to save some space. So if I now work with the normalized vectors, let's just plug it in here. I have that's now x prime by definition. It's the dot product divided by the length of the vectors, which for normalized vectors is 1. Right, they are normalized and now let's just plug in what we did for, I mean x' is divided divided as x times, so this is now x times y divided by x times y, sorry, and this is actually the same as the cosine similarity between x and y. So this is the proof for this, that if you divide by the length, you get indeed length one. And then you can just compute the cosine similarity between the normalized vectors, it's exactly the same. So what you always do when you want to compute cosine similarities a lot, you just normalize your vectors. And afterwards you can just compute dot product. So you normalize them once, and this also you have to do for the exercise sheet, and then you just compute dot product. So you normalize them once and this also you have to do for the exercise sheet and then you just compute dot products and everything is fast. Any questions about this? Okay, here's one more. That's just interesting. Okay, that's also more. That's just interesting. Okay, that's also important for the sheet, but that's also an interesting relation here. Yeah, just look at this part. Again, the animation doesn't work. I mean, we had these toy matrices, four times six. Right, you don't have a lot of zeros here. In a typical large matrix, 100,000 terms, millions of documents, you will have a lot of zeros here, right? And this matrix is huge. You can't store this matrix explicitly. It's just too large. Let me just write that here maybe. explicitly, it's just too large. Let me just write that here maybe. For example, let's assume, and that's not untypical, we have 100,000 words, and maybe we have 10 million documents, right? Now I already have 1,000 billion entries in my matrix, right? That's a huge matrix. 1,000 billion. Billion is already giga. If I have four bytes, four terabytes. If I have four bytes per entry, so eight bytes per entry, that would be eight terabytes. If I have four bytes per entry, so eight bytes per entry, that would be eight terabytes, just to store that matrix. But most entries are zero. Most words in a given document only contains a subset of the words. So how you will always, except in our toy code, in the exercise sheet, store a matrix is in sparse matrix representation, key and you do it as follows. It's very easy to understand, you just store triples. You don't store the matrix explicitly, you just store like this, so this is now the, these are the row indices here, let me just write it here. Nothing of this is complicated but you have to understand how it fits together. Right, so this for example this entry here says that in row two, let me maybe make it a little bit thicker, so in row two and column three, row two column column three, I have a two, and it's this two here, yeah? So this two is this two here. Yeah, so you just store the non-zero values, and you say, okay, this non-zero value occurs in this row and in this column. That's sparse matrix representation. Super simple. There are two ways to store it. How do you sort the entries? I mean, in which order do I go through them? I can go row major, row by row. That's how it's done here. First all the nonzero in the first row, second row, third row, that's called row major or column by column. I could also go other ways but these are the two typical ones. Row major, column major and here is one important thing, popular exam question make a note look how if you go row major which is the typical representation if you just look at the entries which have the same row index that's just inverted list right? 0, 1, 3. Yeah that's just the inverted list. If assuming the scores are all 1 or you could also store this. This just says that this word here and we are talking about word zero. So the zeroth row occurs in document zero, one and three. It's just where I have a non-zero score, the word is contained. And if I want to also compute the scores, if it's not always one, also store it. I have tuples of docids and scores. We also had that earlier. So in row major, so this is everything from the first row, if we go back, right, if I just say there are three values here in document 0, 1 and 3, that's just inverted list. So row major and inverted list is, and that's not, I mean it's easy to see but maybe not obvious that sparse matrix representation and inverted index lists is just the same thing. This is just all inverted lists concatenated. Okay and you need that for the exercise sheet, sparse matrices in PyTorch, how you use them. It's very easy, you specify the non-zero values and you specify, I mean, you essentially specify exactly this, these triples, just you don't give a list of triples but you give three vectors of exactly the same length. So here, so this is the first triple, right? This here says value one at row zero and column zero, value one at row zero column one and so on. So you just give it as three separate vectors. Of course they must have the same, it's just more efficient that way. And then you call sparse COO tensor, which is a sparse matrix in this case. You say what are the dimensions and then you give this values vector and you have to say which type. And the COO simply stands for coordinates, so in the older NumPy you could say row major or column major, so you could influence in which order it stored, a torch always stores in row major, which stores it like an inverted index. Which is a bit funny, so for the exercise sheet what you will do is you will create an inverted index from that built the term document matrix in sparse form, which is internally stored like an inverted index. But then you can do stuff efficiently. And what PyTorch can also do is you can multiply sparse and tense with NumPy and SkyPy. This wouldn't work because the sparse dense stuff was in NumPy, the sparse stuff was in just a second. So you can do that. You can do Q times A sparse. Here's the sparse matrix, that's the dense one, it works. Yes, please. Is it necessary to have a, the value of the gradient rule is called, is it necessary to have it in a row major or other major or can it be left in a random configuration of the... So you are saying when I permute all three in an arbitrary way, does it make a difference? So the rows stay the same, all three in an arbitrary way, does it make a difference? So the rows stay the same, but the permutation of the rows? Yeah, yeah, I'm not sure. I mean, you could just give each vector, you just pick any permutation of the size of this and apply it to all three, then it's the same representation. And the question is, is this call just as efficient? And I don't know, but that's a very good question. And in particular, is it the most efficient if you have sorted it in this way as I've done here? This is like row major sorted, right? First row zero, then row one. Please look it up and tell me. It might not be super easy to find out, right? First row zero, then row one. Please look it up and tell me. It might not be super easy to find out, yes? Yes. Yes. Yes. Yes, absolutely. It's very important for that. I mean, it's now internal, so. Yes, it does. And it depends on what you do more often. So, it kind of, what gives you a hint. Here, for example, here you see that it's not so easy, right? This is now a topic I could say a lot about that. I mean what do you do here? This matrix here, you go row by row, and it's always good in memory when you scan over things and just imagine this to be huge matrices when things are close in memory. So when I'm going over these values it's good when they are consecutive in memory. Which means this particular product, if this would be a huge vector and a huge matrix, would work well if this is consecutive in memory and if this is consecutive in memory. Which means for this particular product it would be good if this is consecutive in memory. Which means for this particular product it would be good if this were row major and this were column major. And so you already see the problem. If you multiply matrices you would like to have the one in this order and the other one in that order. And now it depends on what you do which order is the best. And this is further complicated by, so I don't know how Torch does it internally, but Torch is of course made for being executed with a GPU where it's still done completely differently. I mean, big matrix modifications are done on GPU nowadays. But it gives you a hint that it's a very complicated, complex topic. So you might even consider for best efficiency to store it in both orders. If it's just created once and then used a lot and you want to use it on the left or right side of matrix multiplication. So very interesting topic, but yeah. And I don't know how PiTorch, how much PiTorch is optimized for CPU, for fast CPU computation. I know it's fast for GPU. Okay so we have one more part which is not hard. We take 10 minutes but you need it for the second exercise and I'm I still, let me quickly go through these slides and sorry for that. That's probably one slide which I copied and then it always had the, I like these meditative tasks a lot, just right for my. So, it's not hard, so don't worry. Word embeddings, and it should actually be called word embeddings, but I won't change the title now. So the words in a collection, we already had this, are called vocabulary. So for example, for our running example, Internet web surfing beach. Typically you would say it's a set, the order does not matter, but let's give it an order. If we give it an order and we did that in our code, then every word has an ID. So, and the order is not really important as long of course as we stick to it. So if we just say okay this is the order of my words, then one is always web. So it's just I have to take any order and stick to it and then my words have IDs. That's nothing new, we already used that. And now this is also not new, we have already seen it, I just give it a new name now. So in the context of learning this is called one hot encoding. I can now write each word as a vector like this and we already kind of saw it for the query vector. But now let's not take a query, let's just take our individual words. So I have a document, a vocabulary with four words, internet web beach surfing. So I write each word as a vector, like the document containing only that word. Nothing, nothing really new. And this is called one hot encoding. So I have an ordered, why is it called that way? Well I have a large vector with only zeros and there's a one at exactly one place. And this doesn't have to be a vocabulary. You have any set of elements, 1000 elements, and you want to give each element a distinct vector. Well, just make vectors of length 1000 and the first element has one, all zeros, second has zero, one, all zeros and so on. So, yeah, we have already seen this, but here we just give it a name. This is called one-hot encoding. So I give each word a vector. Oh, and this is not blue, that's terrible, we can't continue before we make this blue or everything will crash. So how is one-hot encoding? We kind of used it implicitly. I mean, if you take, and there's one more idea here, so please stick with me, stay with me for five more minutes. It's really not more. If I take the, I have a document, so this was one of our documents, I think it was the fourth one, Internet Web Surfing, internet web surfing, beach surfing and it had this vector 1 1 2 1 as a column vector. Well, it's just the sum of the one hot encodings of the words, right? I mean it's almost trivial but that's how it is, right? Here are the one hot encodings of the word, I have surfing twice, so that's why I have a two here. So that's one way to see the document vectors. You just have a vector for each word, which are these simple, these vectors which we have seen. I sum them up, I get the vector of the document. So in this view, words are vectors, documents are vectors, and I just get the vector for a document by summing up the vectors of the contained words. And the same is true for the query vector, right? Let me just go back once more here. My query vector here, and a typical query vector contains few keywords, and each of them once, you don't type the same word twice in the query typically. So a typical query vector, it's also just the sum of the vector for web and for surfing and so it has a one here and a one here. Typical query vector has a few ones otherwise zeros. Now here's one property, it's trivial but we will see something interesting on the next slide. If you do it that way then all pairs of words are equally different, right? If you take their dot product or their cosine similarity, it's zero, right? Because they have the ones in different position. So yeah, I don't even have to write it. I take the dot product, it's zero. Internet and web, internet and beach, the dot product is zero. Any pair of words which are not the same, the dot product is zero. Any pair of words which are not the same, the dot product is zero. And wouldn't it be nice if words that are similar have a higher similarity than words that are similar, that somehow do not mean the same thing. So wouldn't it be nice to have vectors here such that the cosine similarity between these two is kind of high and between these two is kind of low. Well and that's what word embeddings are. So word embeddings, it's just the same, I have a vector for each element from my vocabulary, just it's not of this all zeros and just one type. And now I can also play around with other things. The dimension of these vectors does not even have to be the dimension of my vocabulary and it typically is not. So for exercise sheet 10, we give you documents which have a vocabulary of over 100,000 words, but we only use 300 dimensional vectors for each word. And now, I mean now it can't be that they are all pairwise orthogonal because I don't have enough dimensions. Now I have exactly this property. And before we close, let me write a little bit of code for that to show it to you and also to how easy it is. Now how do you compute such vectors? I mean computing these vectors is trivial, right? They're so easy but these other vectors with the property that now vectors of words which mean the same thing are kind of similar, have a small angle between them, others have a larger angle. How do you do that? Well that's a lecture which we had to throw out. It used to be there in the previous years if you are interested. There are many ways to do this, we just didn't have time for it this time. So for the exercise sheet we just give them to you. And let me just, that's the last quote I write, but it's very nice. So let me just write embeddingstest.py here. Need a few more minutes. Let me just copy and of course they are linked on the web. Retrieval, 24 datasets, embeddings. So now I just copied two files here. Embeddings fast text, this is from Facebook, embeddings computed by Facebook for over one of the thousand works and here is just random vectors, you will see in a second why. And let me just quickly write some code here which just read this file. Yeah, let me just do some arc pass stuff here. Okay, arc pass, I want to add an argument. What do I want? Embeddings file. Yeah, pass to embeddings file and let me, no I want to parse it, okay. Now I should get my, if name. Oh, I had, and please tell me when I make a mistake. So now I have a, yeah, so I need an embeddings file now. And let me just take the, okay. And now let me load it. Okay, and now let me load it. Embeddings is, okay torch, now I need torch again. How do I load the file? I just say torch load embeddings file. So now I'm loading the embeddings file and maybe I should write it here. Okay. So I always have the one second. Now I'm loading them and now it's written on the, it's really just a dictionary. It's a dictionary of words to vectors. So let's just, yeah, let me just internet embeddings and what else do I have? Web, surfing, no I wanted surfing. And beach, I don't know if the words are there but let's see if they are there. Length of, yeah let me just see one of these vectors, how long is it? 300, so it's a 300 dimensional vector. So you see it's very easy, I just load this file with torch load. I want an embedding vector for a particular word. Just use it like a dictionary. Now this is a vector of dimension 300. And now let's just print similarity, cosine similarity between internet and web. Okay, and I don't have to, I think. And now how do I compute the, I think there's cosine similarity here. Yeah, you can just call it chord, torch, cosine similarity here. I'm just using two. I don't have to normalize anything. Let me just... Okay, it's a tensor, one times twice. So it's zero six eight two six. And let's maybe... Surfing and beach. No, I want internet and peach maybe. Internet and peach, so these see a much lower score, let's maybe also do one last between beach and surfing, they should be kind of, beach and surfing have something to do with each other, they are not exactly synonyms, right? So you see it's fairly easy to use, We don't know how these embeddings were computed or I didn't tell you. So you see these kind of have these properties, right? I take the cosine similarity internet web, that's pretty good. Beach and surfing, not quite as high but still related, internet beach. And you don't see they go, don't go towards zero or 1 that's typical because that would be hard but there is some yeah there is some pattern here and if I take the random ones yeah I just get kind of here at 0 for some reason and you will do that for the exercise sheet. So as a last thing, let me just show you the exercise sheet. I hope it's there. It's a simple but important exercise sheet. So first do the evaluation please. The first exercise is take exercise sheet two, take the code and just create the sparse matrix and do the same as an exercise two. So just simple search engine, but with matrix vector multiplication. It's a nice exercise to get into this. An exercise two is just to similarity search. So you have a movie as an input and you use our edit distance stuff to search. It's very easy and then you just find similar movies with just find the movies with similar vectors and we give you that. It's not a lot of code and it's surprising how good it works. So really nice exercises to get into this stuff. So also in the new year we continue our tradition of going slightly over time. But you had two, three weeks for that. Any questions for now? There are just two free weeks for that. Any questions for now? There are just two more lectures with content. If you didn't do the exercise sheets until now, start now. Okay, that's it. See you next week. Bye.Welcome, all nine people in the room to Database and Information Systems, lecture 11, which is about linear classifiers and logistic regression. Welcome also to the 28 million people on Zoom and in front of the television. So we will talk about your experiences with exercise sheet nine, and there is number 10, welcome. Vector space model and word embeddings, light introduction into linear algebra. Another reminder about the official course evaluation, which is also participation is at an all time low, and today more linear algebra classification, linear classifiers, the perceptron, very briefly it's of historical interest but nice intro. Here comes number 11, 12, 13, wow 14. Welcome and logistic regression is the main topic. And exercise sheet 11 is about classifying movies. Two classes, funny and not funny. And you should do that based on the movie plots. And I see a rogue number here. And this, the new line is not correct here. Okay, and they're still a bit noisy in the room. Experiences with exercise sheet 10, I think this should be 10 here, right, and maybe this time let me start by showing the master solution. So this was a light introduction, I will not recapitulate it but just say what it was into linear algebra. Exercise sheet, exercise one was just repeating the search engine from the second lecture, but not with an inverted index, but with just matrix vector multiplication. I basically showed it in the lecture. You just had to plug in BM25 scores, and then the results are the same, but you use linear algebra. Second task was more interesting, and that was, so here's the master solution for the sheet. So we have these embeddings and I've showed them in the last lecture, this is just for every word, not for every word but for over 100,000 words, a vector of dimension 300. And then we have these movie plots which we also use today. So here for movies today we use them with labels funny or not. And now just based on the movie plots let's find similar movies and I've... So this is the command line similarity search. So I'm taking the movie plots as input and these embeddings and let's just do it. And I'm also as usual because it's so useful loading the approximate search so that I can easily select a movie and one of the movies from the exercise sheet was Interstellar. I mean I can even make a typo here. This is something we made for one of the exercise sheets. Convenient even in this context. Type something, you get the movie here. Select one. And now I find movies that are similar to Interstellar. How is it done? You just take the movie plot here. For every word, if it's in that embeddings, you look up the embeddings, otherwise you ignore it. And you just sum them all up and then you get one 300 dimensional vector for this movie. You do that for every other movie and now you just find the most similar vectors and see what comes out. So Interstellar is about future of earth, science fiction like movie, interstellar travel and so on. So what do we have here? Young researchers travel into the future, interstellar is also about time travel and so on, apocalypse. Here we have aliens deep, NASA scientists also about space, also here's nuclear war, earth. So kind of related and let's, what would be, let's another one that comes to my mind, Saving Private Ryan, which is a war movie. Let's look at that. Yeah, Saving Private Ryan is about World War II, lot of movies about World War II. And here the similar movies, this is also GI, World War II. And you might say, yeah, that's easy, but note, I mean, the machine has no concept here of which words are more important than others. So it's kind of surprising that it works so well. It just adds up all the words no matter what they are. I mean, yeah, I think not all the words, there are no embeddings for the stop words, but for exciting there's probably a word embedding for that. So it's surprising that it works. And let's, yeah, and just to see that it makes a difference whether you use meaningful embeddings, And let's, yeah, and just to see that it makes a difference, whether you use meaningful embeddings, which were pre-computed and given, and these are just, now you just have a random vector for each word, so each word, each pair of word is, yeah, it doesn't have to do anything with its meaning, whether the vectors are similar or not, they are all like equally this similar. And now let's do the same. Let's look at similar movies. This is Interstellar again. And now we have, yeah, you would have to read the whole abstract, but just from a first look, it doesn't, it's not about space travel. It's not about aliens. Yeah. So here's another one, also not about aliens. So when you did the exercise sheet, you find that they are not very related. One important thing, that's why I'm showing this, random embeddings also work, and I think you should understand this. So let's just look at Harry Potter movies. There are a lot of them, and they all are called Harry Potter and something. So let's just take the first one here. And if I take the first one, so that's my query movie, I do get a similar movie here. Even with random embeddings. And I'm just saying it now in one sentence but try to understand it yourself. Even with random embeddings this works if the words in these plots if they use very similar words like if they would use the exact same words you would get this at the first hit even with random embeddings right and for these Harry Potter movies I mean here you have yeah you see it right There's just so much repetition here. This is here and this is here. So if you sum the embeddings up for those two, you get the same vector. Think about it in case it's not clear. It was a very nice sheet with a relatively easy technology. You get something which already works quite well. So very cool sheets, so impressed how much one can do with linear algebra, that was our first intro. Next two lectures are also about linear algebra. Really enjoyed this sheet, more straightforward than last two. As always I admire the beautiful code templates which are due to Sebastian, he puts a lot of work into them. Thank you. So with fast text embeddings, as I said, we didn't compute them, we just gave them to you. They're from Facebook. Results look like they make sense with random word embeddings, yeah. Look random mostly, but not always. I just explained it. Fast text embeddings were faster, I don't think so. They were not faster than the random embeddings, right? They are faster than, if you, faster than exercise one maybe, but I'm not sure about that. Yeah, well we suppose in the lecture I just used torch cosine similarity for the exercise sheet. You should have implemented it yourself, but if you didn't, yeah. Yeah, I think that applies to some people. It was for some of you first contact with PyTorch. PyTorch is a huge library. We did provide a cheat sheet for you. But still, you have to get used to it and I think one of the things that maybe is a bit hard in the beginning, it's not mainly a linear algebra package, it's a deep learning package and everything is a tensor. So a vector is a one dimensional tensor, a matrix is a two dimensional tensors and then, so all the interfaces are more tuned towards working with tensors of possibly higher dimension. So linear algebra things are sometimes a little unintuitive. And that's a good observation. If you look a little bit closer which was also part of the exercise, what, yeah, if you enter, now it was movies where kind of it's clearer what similarity means, interstellar, other movies about science fiction, future space travel. What's really similar? If you take, yeah, I don't know, a movie like Inception, Fargo, but many of you don't know movies so I want to talk more about this interesting aspect. One reminder, so the official course evaluation in the past when the exercises were mandatory and we gave you 20 points for the evaluation, we always had about 90-95% of the people filled out the evaluations. Right now it's below 20%. I even got a warning. So of course if you don't listen to the lectures, but also then you don't listen to this and you don't follow the course really, you don't do the exercises, you don't, I mean then you don't have to fill it out. If you follow the lectures at least maybe do. Let me just remind you, so it's still, you should have gotten this link. And it still runs until Sunday, but after that it's just over. So if you are participating in this lecture in some meaningful form, please give us feedback. And answer this question honestly, I mean it's anonymous. Are you doing the exercise sheets? A little bit, not really, mostly. It's interesting. Okay and we will try to, today after, I think we got some messages here and we will mail them and see. I think there are one or two people. Are there still people in this room who wanted to participate but couldn't for some reason? Can you show hands? Okay, nobody in this room. And it also filled up a little bit. Great. So, 15 past, we continue with a Danke Frank with the actual topic. First half will be relatively lightweight and I will go a little fast because it's easy so that we have more time for the interesting stuff in the end. Today it's about classification, very important problem. Yeah, given an object, predict to which class it belongs and no, there are not two examples. Example on the next slide. It's just I deleted one example on the next slide and here's an example. So yeah we use movie, so again movie plots. So now we have, that's the beginning of a movie plot. In a small island off the American coast, the Waterleys live in an old mill where a mystery bloody being, and now the class, the label is what genre is this movie. And yes, that's a horror movie. And kind of looks like, I mean it doesn't say American horror movie from 1972, it just gives the plot. Otherwise it would be easy. I mean, but there are kind of hints here that it's not a comedy or romantic something, maybe a combination. A starship crew in the 23rd century goes to investigate the silence of a distant planet's colony. Science fiction. Two actors attempt to make it in the cruel world of showbiz without an ounce of talent between them. Comedy. So let's assume you get a lot of movie plots like this and they are correct labeled. And now you're supposed to learn just from the text what's the correct label. That's what classification is about. So after this maybe you're given 1000 of this and let me just quickly show you the data set we give you for the exercise sheet so it also contains these, I mean it's on the wiki here, I'm accessing it from the file system, data sets and it's, they are all our, and it's movies, funny and not funny, not funny means it can still be very watchable. Train, yeah, so now you just have the movie plots in one line here and here it says is it a comedy or not, is it a funny movie or not. Geography professor has divorced his wife, funny, okay. And you get these and let's just see how many of these you have here for the training set. So just to learn, yeah imagine you're a computer, you get 50,000 of these and now you're supposed to learn from the text, predict the label. And now learning is done and now you're supposed to predict. Now you get a new text which you haven't seen before. Professor Iris discovers a secret in an ancient stone and when he opens the crypt he revives. What's your guess, genre? Which genre would you guess given this plot? Fantasy. Okay, interesting. Yeah, maybe fantasy, maybe horror, maybe comedy. So it's not so clear, right, just from the plot. Okay, that's the problem. Quality evaluation is important. Now you have something that predicts how do you measure it with a single number or a few numbers. And for that you have, so you have this training set for training, we already had that in the third lecture when we talked about evaluation. And then you have the test set. I mean this concept you already know. These are different movies. You learn on some set on movies, you evaluate on another set, which you haven't seen. Let's just see. This test set is usually smaller. You want a lot to train but also a meaningful number to learn. So you have 16,000 movies to evaluate. So how do you do that? So let's first talk about the problem when you have many labels like in the for the genres, science fiction and so on. So what you have, you have what's in the ground truth. So this is the set of documents labeled science fiction for example and what your algorithm predicts. And these two sets ideally are the same but your algorithm will make errors. Then you can compute for each class the following, the precision. And let me just very briefly explain this. So you have for a particular class, for example here, this is the horror. So this is all movies which are actually horror movies and this is what your algorithm says are horror movies. This is what's meant by that. And the two are not the same. So what we have here, we have always three sets here. These are the movies which are actually horror movies, but your algorithm said something else. These are the movies which are horror movies, and your algorithm correctly identified them. These are movies which are not horror movies, but your algorithm said it's a horror movie, your prediction. And then you can compute these sets. So here we have, among my predictions, how many are correct? That's the precision. I will say this very briefly, you have to think about it yourself, do the exercise to understand why it's precision recall. This is here from my predictions how many are correct and this is from the ground truth, from the horror movies, how many do I predict as horror movies. That's called precision, that's called recall and then you take an average here and you may wonder why is this not precision plus recall divided by two. You take the harmonic mean here. Let me just very quickly do that calculation. So what you do here, 1, f1 is the harmonic mean which means you take the reciprocals 1 over p plus 1 over r and you divide them by 2, that's called the harmonic mean. Harmonic mean. And if you do that, let's just do the math here, if I multiply this by p and r, I get, I can do the following. Let me just do the following here. 1 plus p plus 1 over r. If I multiply both by p and r, then I get r plus p over RP. That's just some very basic math and if I plug this in here then I get R plus P over 2. Yeah, let me call it PR and if I turn this around then I have f1 is equal to 2 PR over P plus R, which is I think exactly what's written here. So yeah, it's just the harmonic mean. You take the means of the reciprocals and then turn it around again. And this is something nice to prove. I think it has also been an exam question in the past. Try to prove, I think it has also been an exam question in the past, try to prove it yourself. This is 100% if and only if these two, these documents are the same. So if the algorithm always does the right prediction. If there's only one error, this thing here cannot be 100%. Okay. We will mostly talk about two classes today and also for the evaluation for the exercise sheet it's two classes so you will look at this for two classes and there's more terminology because one very often has two class problems. So for two classes you often call the two classes plus and minus, positive and negative things. So for our exercise sheets, funny are the plus, examples not funny are the minus ones. So you have a certain property, is it fulfilled or not funny in this case. And then your predictions partition the input into four sets, it's a partitioning. So each element now exactly has one property. And you just look at what is the prediction, what is the truth and all four combinations. So you have the true ones. This is where the algorithm was correct, true positives. It is plus and predicted as plus or true negative. It is minus and predicted as minus and the F ones are where your algorithm makes a mistake. So yeah it is minus but predicted as minus and the F1s are where your algorithm makes a mistake. So it is minus but you say plus, that's a false positive. So this here says the first word whether it's correct, this says what it actually is. Now what you said, this here refers to the prediction, right? You said positive but it's wrong, said positive, but it's wrong. You said negative, but it's wrong. And when you do that, and this is useful for the exercise sheet and you should know this, then, I mean, just look at it. We have precision and recall, then just refer to the precision and recall of the positive class. You could also look at it for the negative class, but for two classes you only look at one. And just, so what you are, try to understand this, what your algorithm predicts is, well, it predicts the one which are positives so that's that's what I just said here it's what your algorithm predicts and which are indeed positives but also where it predicts and where it's wrong. So this is I think that's clear from this so it's just this union this. And the ground truth, you can also write it in terms of one of these four, the ground truth is the one which are positive and your algorithm finds them and the ones which are positive but your algorithm doesn't find them and these are the false negatives. So this is something you should also know, right? Fn is, yeah, I mean just look at this, this here is, yeah, that's the one which are predicted as positives, it's T TP and FN, I don't have to say it again. Okay, you need this for the, just one thing about understanding, difference to lecture two. So in lecture two we already talked about evaluation, training set, test set and so on. And there we had the ground truth was a set, the relevant documents. And let me just write that here. The relevant documents. And I had two slides about this. The relevant documents, but we computed the ranked list. Here the situation is the ground truth is a set, so I want to find these funny movies, and my algorithm also computes a set. It doesn't do ranking or anything. Whatever my prediction is, it will give me the set that says here the funny movies, the others are not funny. And you will, I'm sorry, in the future when you do research work, when you write a project or thesis, people always, they sometimes say I evaluated the precision and record. You always have to say what these sets are on the previous slide. So if you want to understand what your evaluation did, always explain what these sets are and what the classes are. Then it's also clear what you evaluate. Okay, linear classifiers. So this was classifiers in general. You have a problem with two labels or more and you want to classify it, some basic terminology. We will talk about linear classifiers. So objects are vectors now, so it will be more mathematical now in d dimensions. We have exactly two classes like I just said, plus minus. And now we want to separate them by hyperplane. And a hyperplane I will give the mathematical definition on the next slide. And for d equals two, so the hyperplane just has one dimension less and it's easy to understand for d equals two. So let's just draw an example. If I'm in two dimensions and I have some points here and some points here and then, yeah, this is my hyperplane. My hyperplane is the line. Space is two dimensional, the line is one dimensional. And you can, yeah, if you want to separate things in three dimensions you can do it with a plane which has two dimensions. Now, of course the points do not have to be separable. Maybe I have a minus over here and what do I do then? I have a slide about that. And how this works is the algorithms, we will look at two of them today, we find such a hyperplane and then the decision will just be made, okay, for a new point. So I have a query point. And now for example here, and they ask what label is it, and I just look at okay on which side of the hyperplane is it, it's on the plus side, then it's plus. Okay, and this can, this is, yeah, if you have a plane, it naturally divides things into two classes. The question is can you also use this approach for more classes? I also have a slide on this. First some basic mathematics which you should know from a linear algebra course but maybe you forgot. So a quick recap of a few things. Two definitions of a few things. Two definitions of a hyperplane. And I used to prove that they are equivalent, I will not do that, but the proof is still there if you are interested. So that's one definition, which is maybe the more intuitive definition and let me draw a bad picture for this. So let me just draw it in two dimensions. So you have here, so that's definition one. So what do I have? I have some point on the, and maybe here is my coordinate system and I have here some point on the plane so this is my hyper plane here, some plane somewhere in space floating around. This is my hyper plane. I take one point and that's my A here, that's my anchor point and then I have two vectors in the plane for example this one and this one and let me call them H1 and H2 so here I have two because I'm and this picture is in three dimensions. So if I want to describe a plane in three dimensions I think this is even something you might have done in school. I have an anchor point here, a point and then I can write every point as this vector, this point on the plane plus some linear combination of h1 and h2. So that's one definition and the other definition is which is it's also not hard and we will use that because it's not much nicer to work with. Let me draw it again here. The plane is defined by, I just have a normal vector and I will call it W. W for weights, not for normal because the components are weights or values which we are going to learn. And the hyperplane consists of all points where W, the dot product, and let me say this again, this is the dot product here of two vectors. Dot product is a fixed value B. It's actually a very nice exercise to prove that these are equivalent. It's not hard, but we will just assume. It just takes time if I do it now. And I think anyway you only learn it if you do it yourself. And we will use that definition. So a hyperplane, to define a hyperplane you have a vector W in the full space, you have a scalar B and then the hyperplane is just all points which will fill that property. And this is the normal vector. And let me write the, so in that case it's a little less intuitive, in that case H is all points for which the dot product with this W is B. And please do ask questions anytime. Now we do some mathematics because we need it all the time. How do you compute the distance from a point to a hyperplane? So let's assume we are working always with that definition now. Hyperplane is all the points where for a given W and B this condition holds true. And now I'm claiming that the distance of any point to this hyperplane is this. And let's prove this. And more than that, so this is always, because I'm taking the absolute value here, a non-negative number, if you take the absolute away, then this here, the scalar dot product minus the B tells you is this on the left or on the right side. I mean, that's not obvious, right? And let's prove this. It's a very nice proof and some recap on linear algebra. Let me just draw a line here and let me have my point. That's not the dot product, that's my point. So maybe let me not make it so thick. Or let me make it. Come again? The B? or let me make it come again. The B, okay the B is just, so this is a vector in D dimensions and B is a scalar. So these are the two values which define the hyperplane. You give me a vector and a B and they define hyperplane. hyperplane. You give me a vector and a b and they define a hyperplane. So that's a definition too. And I didn't prove that that is the same as this. So if I give me a vector, I give you a vector and a number and then this defines a hyperplane. Namely all the x's which satisfy this condition. To really understand it one would have to do the proof but we now just take this definition. So this is my x and this is my hyperplane here. So this is my hyperplane in two dimensions. So now my example here is in two dimensions and this W is a normal vector. It's perpendicular to the plane. Now let me draw the following picture. Namely let's take here the projection of, let me draw it perfectly. So this is a right angle now, let me call this x prime. It's a very nice proof and some, if you have trouble with this you should definitely do it at home. That's the level of linear algebra you should know for this course, not not more not less. Let me take the normalized version of this which is I divide it by its length we did that in the last lecture. L2 norm of which is Euclidean length. And then I can, then the distance, which is what I'm looking for here, the distance of x to h. Let me just name this r, and then I can write X as X prime plus R times W zero right so this thing here the length of this is R times or this different here, yeah this is just, I mean it's the distance. Let me just say that again. No I didn't want the highlighter, I wanted the pointer. To get from x prime I just have to, yeah this is the projectionlighter, I wanted the pointer. To get from x prime, I just have to, yeah, this is the projection of x, so I have to go away from here perpendicular, and I have to, if the distance is five, I have to go five times the normal vector to reach x. This is what this here says. Okay, and now, even if this was not 100% clear for you, you have to think about it at home. Let's just, as you do in mathematics, let's just take it for granted and proceed from here and try to prove it. And what could we do? Let's multiply it with W on both sides. Let's do some, so this implies, multiply it with W on both sides. Let's do some... So this implies, if I multiply it with W on both sides, then I get this times this plus R times this. And with a dot product, everything is linear, so you can pull normal multiplication in and out. So you get this. So nice thing about linear algebra, you can just do stuff and then magical things happen. So what's this? Do you have an idea what this is? From what is on the slide you can say what this is. What do we know about x prime? So x prime is on the hyperplane, right? It lines on the line. So what do we know about this value? Yeah, it's b. I mean the hyperplane is all the points where w times is b. So that's b. So it was a good idea to multiply with. And this here, okay, this also looks familiar, multiplying a vector with itself, it's almost, that is, yeah, I mean let's just write the definition of W zero, that was W divided by W, a vector multiplied with itself. We had that in the last lecture, it's just the square of its length divided by its length one. Nice thing, you just do it. Don't necessarily have to understand, you could always ask for the geometric intuition, but let's just do it here. Yeah, and now we almost have it, so now I have, let me just write what we have derived. So this is B plus R times W, and this just says that if I solve for R, then this is just x minus b divided by w which is just X is on the other side. I'm sorry, writing down here is a little bit harder. It's on the, on the other side, then I have X, X prime, yeah. Now I have to go to the other side, then I have x, x prime. Yeah, now I have to go to the other side. I have to go in the opposite direction of the normal vector. So the normal vector points to the positive examples, or at least that's how I define it. X prime minus R times, analogous. You can just do the same thing again and then you will get this thing reversed here. So this is just to show you some, the level of linear algebra that you need, how you do it in principle as usual. I mean, maybe you were convinced now by the individual steps to really understand it, you have to do it yourself. There's no way around it. So please, it's also, yeah, typical exam question is to do this proof, parts of this proof or variants of this proof. So you should certainly understand this, but you have to do it yourself to fully understand it. Let's go on. One thing that's important, it's a bit annoying that to specify hyperplane, I mean, let's go back to these two definitions quickly. This is intuitive definition. You specify a point on the plane and then D minus one vectors, but you have so many values which you have to specify. The A and these vectors. This is nicer, you just specify one vector, which is the normal vector, which also gives you a direction, this side, the other side, and this offset. Wouldn't it be nice if we wouldn't need this number? Then we only have one vector, and that's what you usually do in these tasks. And that's what this slide is about. So that's our typical definition. So you have a vector and a scalar, and this is the definition. Now let's look at this definition here. I just add one dimension and now the bias is zero, which means this goes through the origin, right? It's a plane, yeah, H contains the origin. I mean it's clear that it contains the origin right if you plug in the all zero vector yes it's zero here not necessarily so contains the origin and the origin is just all zero vector. So if I do that, then I can very easily prove that if my original vector is in this plane, then this vector is in this slightly different plane where I've added to my normal vector one dimension here and the value is minus B. And let's just verify this by taking this value here. So W times X minus B. So if this is zero, then X is contained in this set. And let's just write it down. So that's this dot product minus B. And this is just the same as writing, just adding one dimension here, minus B here, and then one. I mean, just convince yourself that this is the same. And this is, I want to write this by hand here. And this is, I want to write this by hand here, and this is just W prime times X prime. If I just, if I call this here X, and this here X prime. That's the formal way, what you can take away from this is what you can do if you want to get rid of the b and again do it yourself to fully understand it. You just, you have a problem, you have vectors for the exercise sheet, 300 dimensional vectors for example, just add one dimension and give every vector in your input the value one in that dimension. So that's what you do here. And then you can search for a hyperplane with where the bias term is zero. And understand why it works, you add the additional dimension and what happens that in your W vector the value in that additional dimension will effectively the bias term if you would have solved the problem in the lower dimension. And this is the math. So that's a nice trick. So you should understand two levels. One way is how you do it in practice and for the exercise sheet, just add in dimension and one everywhere and why that works and that's the mathematics which is easy. Two more things. This is just to understand, so what if the data is not linearly separable and let me just give you one nice example. So all the methods we look at today and in fact most methods also work if it's not linearly separable. They will also always compute a hyperplane no matter what. But here's a nice trick and I just wanted to show it to you by an example. So let's assume our, so we are in R. So our points are just values, scalar values in R to the one. And let's assume here I have minus one and I have minus two and here I have one and here I have two. So I'm living in a very not so interesting one dimensional world. And let's say I have, how do I draw them? Maybe in red, and maybe I have a plus value here. And then I have two minus values here. So a linear separator here would just be a value also. I mean this can't be linearly separated, right? There's no way. I mean you would need two cuts or intervals or something like that, but you just can't cut them. Of course if all the minuses are one side, all the pluses on the other side it would work. And now by lifting it to a higher dimension, so let's just look at the function. Here's one function. And there are many functions which would do the job. So this goes from R to R2. And I'm just doing the following out of every point I make it now transforming my problem in a more interesting two dimensional word and now minus two becomes, let me also draw the one, two and maybe you can already see it in your mind becomes minus 2, 2 becomes 2, 2 right. So this is the, let me just write it here for clarity, this is the point minus 2, 2, this is the point 2, 2 and what happens to those? This goes here and this goes here. And alas, they are linearly separable now in two dimensions. I mean, such an easy trick, very effective, right? I've just lifted them to one dimension higher and now I could, yeah, for example, there are many and you also see there's not just one hyperplane. These points here, these I could not linearly separate, these I can linearly separate. Nice trick. So you can always linearly separate by lifting to a higher dimension. One more thing, so these hyperplane things seem to work. A hyperplane in one dimension lower always divides the space into two parts. The one where the normal vector points called positive ones and the other one. What if I have more classes? Do I have more hyperplanes or what? Here's just, you don't need it for the exercise sheet but you should know it. One is you can just build K binary classifiers. So assume you have K classes, just for every class learn a classifier like we do in the exercise sheet. Funny or not, thriller or not, science fiction or not. Then you can do multi-classification. Now the question is what do you do when your one classifier says yeah it's funny and also yes it's science fiction. Then you have to take the one with the highest score but maybe their scores are not comparable because they are different classifiers. Or you play like a tournament, you compare all, so you classify okay is this comedy or science fiction, is this comedy or horror, so you just play all pairs of classes labeled against each other. That's a lot of pairs. Or whatever you do to learn your hyperplane, you somehow extend it so that it also can deal with more than two classes. So later we will talk about logistic regression. There is multinomial logistic regression which can then work with multiple classes. So there is always extension of the theory but we don't do that today and it doesn't work for all methods. So for exercise sheet 11, we keep it simple, you just have two classes. For today, two classes. Okay, before the break, one super simple method. Now, so now we have seen the foundation, I think it was just some linear algebra, you know what classification is, how do you find such a hyperplane now? So that's what the second part of the lecture is about, so where is my, I don't see the pictures here. Just to clarify, so we are given our points now and the task is to find this hyperplane, which as we have seen, defined by just a W. How do I find this? How do I find such a W? That's the... Here's one method which is just so simple and so surprising that it works. So you just iteratively compute the W. So all we need now is a W with the B. We got rid of that by just lifting to one dimension higher so we just need to find and let me just say that again. It's called W because the values are called weights for some, yeah, you typically call the values which you learn weights, that's the reason I will not go. So as I said we will only consider hyperplanes, you can always do that by lifting one dimension high with B equals zero, so we just have to learn a W, makes everything simpler. So, and you just improve this step by step, hoping that you arrive at a good one. The Perceptron is from 1958, so it was already, and it's kind of the, yeah, like the core idea behind everything learning wise. So you don't use the perceptron anymore but you use some very similar stuff and it's so easy that you learn something by seeing what it does. So let's just do the algorithm. You start, you want to find a hyperplane. Let's start with a very stupid hyperplane, which doesn't even make sense. I mean, that's not even a hyperplane because it contains all the elements. So that's a degenerate case. But you can even start with that, or you can start with a random vector. You don't know anything. Now you go over the objects from your training set, over your examples, over your movie plots, one by one and you do the following. right prediction means, so right prediction means so this is greater or equal to zero it's a and the label is indeed plus or you are it says yes this is negative and the label is minus. So in that case we do nothing then it's right. Now we have our hyperplane now we get it has a normal vector. We see a point and we just check, oh yeah, it's on the point, it's the right side of the hyperplane. So the hyperplane is fine for that point. We do nothing. Now comes the, if it gives the wrong prediction, so these are the other two cases, it says positive, it's on this side, but it's actually a negative one or the other two cases it says positive, it's on this side but it's actually a negative one or the other way round, then you do the following. If it's wrong, so the sign of this is negative but should be positive, you just add the point to the weight vector. And this should look very strange to you. Adding, I mean this is the normal vector and now you're adding the input point. I mean this doesn't make sense really. But it's just super simple. And if it's the other way around, you have to decide what you do when it's exactly on the hyperplane. You can either do it here or there or even do nothing. Then you subtract it from the W or even do nothing, then you subtract it from the W. That's it, that's the algorithm. So you just start with a random W, then you go through your points, and one by one for each point you check, does it do the right thing for this hyperplane, then do nothing, if not, do one of these two steps. And then you do that again and again. Note, important thing, you start with a random vector, you have 100 input points, you go through them in some order, ideally in a random order, and now you've done it for the 100 points, for all of them in some order. Now you have a different vector, right, than with the one you started with. So it makes sense to do it again and that's also what you do in learning. And this is called an epoch. So you go over all your training samples, one, but now you're in a different situation. You have a different W and you're improving it iteratively. So now you can do it again, again, again, again with the same training samples. Yeah, you learned something, you went through all the slides of the lectures, now you can start again and learn them again because your brain is now in a different state. That's an epoch. Go over the training data many times. Okay, and that's what the perception does. Now there's a lot of theory behind the perception we won't do that of course but just this one slide why and really you should think this doesn't make sense. Why does it make sense? So here's one intuition but that's maybe not a very good intuition and then I will give something more formal but very simple. So this is my hyperplane and this is my normal vector which defines the hyperplane, right? So this, on this side of the hyperplane where the normal vector points are the positive examples or are supposed to be the positive ones right and on the other side I have the negative examples. Okay so if I have an object that's labeled so let's assume, let me, so I have here my X, which I'm now, and it's, so here's my X. Ah, it's labeled, okay, how is it? It's labeled plus, but it's wrongly classified as minus. So let's say I have my X here, so this is my x and it's here. That's what I'm writing here. But it's wrongly classified as, and then I'm saying it makes sense to correct w in the direction of x, yeah. Then maybe I should move the hyperplane a little over here, right? So that's the situation. I have an X here which is actually plus, but my predictor says it's minus because it's on the other side of the hyperplane. Maybe I should move the hyperplane a little over here in the direction of this X and this is what adding X does and subtracting X says so it moves the hyperplane closer to X or away from X. And let's, yeah, just, and if you look at the mathematics, actually very easy, but I think you need the combination of the mathematics and this intuition. So our current W says it's negative, but actually it should be positive because it's a positive label. Now look what happens when I add the X. So I've already written it down here. Let's go through it step by step. So I'm now updating my weight vector, my normal vector is following, I'm shifting the hyperplane so now the new prediction is this plus this. And now here Yeah, what's this? I mean this is x square greater or zero. I mean that's all the mathematics you need to see it. So if I add x, then the new prediction, so this value here will be larger than the old one. I'm adding something that's larger than zero. That's all you need to understand here. It was negative, it should be positive, and I'm adding something. So maybe it's still negative but I'm still making an improvement in the right direction, right? That's what I'm showing here. It said minus three, it should be positive. Now I'm adding one maybe or 1.5. Maybe it's still negative but it got better. Maybe I didn't shift the hyperplane fast enough so that it's now on the other side but at least I shifted it a little. That's what the perceptron does. And that's also what the other algorithms do. And the same goes in the, so the correction goes in the right direction. And of course if you switch these two, then it absolutely, the sign here is very important. You have to make this little shift in the right direction. And here it's just the other way around. If it's positive but it should have been negative, then you just correct in the other direction. And it's interesting that it's so simple. So that's all the intuition you need, why, and it's really, it's like the simplest algorithm ever. It can't possibly be any simpler, right? If you implement it, it's like, I don't know, two lines of code. And just to mention this, and then we have our break, so why does it even work? And again you should stop and say wait this can't work because think about it, let me look at this picture again. I get a training sample, it's on the wrong side, now I shift my hyperplane. Now I get the next training sample and then oh, now I shift it back because that's wrong. Why doesn't it go back and forth and cancel previous things out? I mean, even if you look at this algorithm, right? Why I'm going point by point, I'm making corrections. Okay, now it works better for this point, but maybe I destroyed everything I've done before. It's not at all clear why these things converge. And for the perception, it's so old, people had a lot of time to think about it. There is a convergence proof that if my data is perfectly linearly separable, that is there is a line which perfectly separates the ones labeled plus and minus, then you can even give a bound on the maximum number of steps before it will find the perfect hyperplane. We don't do that theorem here, it's merely of historical interest, but yeah, at the time this was a big thing, and not at all obvious. And nobody uses the perceptions nowadays. We will now look in the last part after the break at a more principled method logistic regression and it's interesting, this is also interesting on a meta level, this is kind of, let me say this one more sentence, or let me just try this and oh it works and then let me say this one more sentence. Or let me just try this and oh, it works. And then let me try some, and yeah, there's an intuition for this, why this works, but this isn't very principled, right? It's like the, I'm coding something and it's the first thing that comes to my mind, the simplest thing that maybe works. And what we will do in the next part after the break is a bit more principled. And then it's interesting that you arrive at something which actually looks similar but with a twist and works much better because it's just well-founded. So that's a nice experience on the meta level. And that is the next part is seven slides but more mathematical and afterwards that's it. So take a break and prepare for seven more mathematical slides, see you. Seven more slides, but rather mathematical slides. Maybe the most mathematical slides from the whole lecture. But they have this super nice property that the formula in the end, the update step for linear regression is super small, but you would have never come up with it yourself without the mathematics. So let's start with the sigmoid function. Who has heard of the sigmoid function before? Okay, great. So maybe that will be a nice warmup. That's the sigmoid function. And these are also typical exam questions which I show now. And in any way, good practice. The sigmoid function looks as follows. So I have a, it maps and that's also why it's useful. Any value to a value between zero and one. That is useful, it's for turning any value into probability. And it looks like this. any value into probability. And it looks like this. And then it's, and this here is 0.5. Five and that's sigma of x. So that's how it looks like. So what does it do if you have values which are negative, if they are very negative, not even very large, then it's basically zero if you have values that are quite positive, not even very large, it's basically one and a lot happens around zero. So if it's exactly zero, it's 0.5, which is basically I don't know. So negative is no, positive is yes. And let's look at a few properties and let's maybe prove them. P1, let me go back here. Yeah this is, I think, let's prove some things here. So that's the function, and let me write it again as a fraction here. So the function is one to the one plus e to the minus t. Yeah, and I don't know, maybe when when I write it let me not write t but let me write x because the t always looks a bit like a plus. We can of course change the variable. Let's look at what happens if I go to minus infinity, well, let me write it not quite correctly, but I mean that's okay in this case. What do I get when I have e to the minus infinity here? e to the minus infinity is zero, right? No, it should be minus minus infinity. Yeah exactly. So I have one divided by one plus a very large number and this is zero. Yeah so this is not quite correct, correctly written but it is correct and let's just do the sigma of x. So if I have 1 divided by 1 plus e to the minus infinity now e for minus something very large becomes 0 so this is 1. Well and if I plug in the value 0, you just have to plug it in. It's 1 divided by 1 plus e to the minus 0, which is e to the 0, which is 1, which is 1 over 2, which is 0.5. And it's also easy to see that it can never be less than 0. I mean, it's always non-negative. No way you get a negative number here and it also can't be larger than one because this here is negative. Okay, that was a warm up, let's do the second property. That's interesting and it will be useful in the following. Just some warm up about the sigmoid function. So what is sigma and let's do it with x, minus x, let's just plug it in. It's 1 divided by 1 plus it's e to the minus x, so it's now e to the x and let's just multiply a numerator and denominator by e to the minus x. And when I do this, now I get e to the minus x plus e to the minus x1. Now I have the numerator, denominator from this again. And now let me write this as, let me just add plus one and minus one here. Why not? E to the minus x plus one. And now I have this part, I can now just write it as two fractions. This one here is one. And the rest is, so that's now just write it as two fractions. This one here is one and the rest is, so that's now just one minus and the rest is one e to the minus x plus one is, and that's just one minus sigma of x. And what this, I mean this also has an intuition in the graph. This just says that it's like symmetrical around this point here, right? If I want to know the value for, let me just show that here, if I want to look at the value for minus 3, yeah, and let's say 0.01 and I want the value here at three then it's one minus 0.01. It just looks exactly the same mirrored on the other side. That's what this says. So it's very symmetrical. And the third property, let's also prove that. And please tell me when I make a mistake. That's the derivative. It will be important in the following. So let's compute the derivative. And a lot of this here is recap of stuff which you should know but maybe you forgot it or never really understood it. Now is the time to practice. So what's the derivative of one over x? what's the derivative of 1 over x? That's just minus one over this thing squared. Yeah and I could always say a lot about mathematic didatics, I didn't do it a lot in this course so far because we didn't have a lot of mathematics yet but one important point because I know that many of you are struggling with mathematics, maybe not those in the room, maybe those who are not listening, but anyway, the thing about mathematics is, nothing is really hard, but if you're already struggling with more basic stuff, then it's very hard to understand more advanced stuff, right? If this is something where you have to think about for longer because your foundation is shaky, then now applying it in a context where you have a more complicated function and now need the chain rule and the inner derivative, the cognitive load just adds up. That's not one problem with mathematics. My claim is that whenever somebody has problems with mathematics, that's it. When you even have trouble adding together fractions and thinking about, wait, when I add together fractions, what exactly do I have to do? And you need cognitive load for that, how can you, you don't have anything left for what is actually being taught. So yeah, it's like stuff building on other stuff. And the way to go is if you have problems, it's always look for the level for where you have problems. And if it's going back to basic arithmetic, maybe you have problems multiplying two numbers in your head, then that is where you should start. I think that's the single most important advice for getting. You can always catch up, but you have to go to the level, however low or not low it is, where you're having problems and go from there. Because if your foundation somewhere is shaky, you can't build on it. Okay, that was, what, some Zontag? So we have to, we first compute this derivative and now the chain rule. And let me at least write it here. So that's the, it's a bit small, but that's the bit small but that's the chain rule so now I need the derivative of the thing which is now I just did as if this was an x but it's not it's a function so now I have derived this and multiply it here the derivative of this is zero the derivative of this if it were e to the x it were e to the x, it's just e to the x because it's e to the minus x, I have another chain rule and I get another minus. And this is exactly what I was just saying. Maybe by looking up the chain rule you can do it, but it takes you five minutes and that's a problem. This also, time adds up, cognitive load adds up. e to the minus x. Is that correct? Maybe I'm... Yeah, maybe it's correct, maybe. And now something still is missing, right? I think I forgot something. Now I talk so much meta talk, I... is this correct now? Minus times... is it correct or did I do something wrong? I'm not sure, let's just continue. Okay, this is, let me write it like this. This is one to the one plus e to the minus x. I mean, I know where I want to go. Times and then I have two minuses here which cancel out, 1 plus e to the minus x and this is actually sigma of x and this is, I think we have already seen it above here. Yeah, this is just this one here. Right, it's the same. So we've already seen that it's one minus sigma of x. One minus sigma of x. So this is just, and that's, now you wonder what's the, so we've already proved this, we have proven this, and we have proven this. And this is relevant because the nice thing is, I mean the derivative could be any complex function, but the nice thing here is, let's say I have something which evaluates the sigmoid functions then the derivative I can also evaluate it in terms of the original function. That's the nice thing about this. I could just also write what the actual thing is like here but I can express it in terms of the original function. Okay this was the warming up the three central properties of the sigmoid function and let's move on. Here's some terminology. So D is the dimension of the input space. Again we got rid of the B. We are only looking for a W so it's that's why I gave it a capital D here and not typically you don't do that but just to avoid confusion. We started with a small d, our capital D is just one dimension more. And here it should of course be not n but d. I used to call it n. Okay, now what do I have? I have my training examples which are vectors. Let me very quickly maybe relate it to the, right? So for the exercise sheet you have your text. Let's say you do it with embeddings. You do it with embeddings and with frequency vectors. But let's just talk about embeddings. So for each document you just, for every word you look at the embedding, if there is one you just add them all up. You get a 300 dimensional vector for each object. This is how you always do it. Whatever you have, your objects, you somehow turn them into vector. That's what we did in the last lecture, that's what we do today and what we will do in the last lecture. Yeah, so every plot becomes one vector now, the sum of the word embeddings. So this is my input, yeah, and the label is... okay, here I wrote it again for multi-class because logistic regression can be generalized to multiple classes, but in the lecture two classes. So for each I just get zero or 1 and it's exactly how it's written in the file. Just 0 or 1. 1 is funny. Positive 1, 0 is not funny. Okay, that's our input. I think that's easy to understand. And yeah, there is a generalization but we won't do that today. Five more slides to go. But mathematical slides. This one is also easy and nice but maybe something you didn't fully understood in slides to go, but mathematical slides. This one is also easy and nice, but maybe something you didn't fully understood in the past, so it's just a quick recap. Maximum likelihood estimation. And I deliberately did that to, yeah, I think it's important to, so if you have problems with that, again, go back to it, try to fully understand it. Simplest possible example I think to explain maximum likelihood of estimation. I have a coin and it's a biased coin which means a coin which doesn't, it's not 50-50. It shows maybe head more frequent than tails or the other way around. So let me just clarify this. So this is heads, so one side of the coin and T is tail. Is it heads, head, tail, tails, I'm not sure, it doesn't matter. It's just two sides, two possible outcomes and I just give them a name H and T. So I cross the coin 20 times and this is what I observe. So that's the situation in maximum likelihood estimation. I have an observation. And I have a model. I have a model and it's a probabilistic model which makes an assumption on how this came to pass. And here the model is, okay, this is a coin. It has a certain probability to show up heads and this probability is P. That's my parameter. If it were a fair coin, it would be 50%, but it's not necessarily a fair coin. Maybe it shows heads 10% of the time, but I don't know this P. So that's my model. And also, and I should write this here, independent. And that's what we'll also be, that's also implicit. So different tosses are independent. So the outcome of the third toss has nothing to do with the outcome of the first one or the fifth one, independent. Now I can compute the probability, typically called the likelihood in that context and I'm sorry that this is not blue, we can't continue before we make this blue. That's also easy, that's super basic. So what's the probability given this model, what's the probability that this happens? Well, I do 20 times independent things in a row. So I'm just multiplying. Okay, there's five times heads, P times P times P, not in that order, but it's commutative, and 15 times this. So this is the probability of this happening. It's just how it is. That's very basic probability. P to the 5, P times P times P five times and this probability 15 times. And now the goal is, okay, that's my observation, that's my model, that's the probability of making that observation. What is the most likely model parameter? Yeah, what is the most likely P for that to happen? Or you could also ask, was it a fair coin? Is it likely that it was a fair coin when that happens? Probably not because tail showed up much more frequently. And this example is so simple that we can actually compute it. And I will do that and on the next slide you see an example which is also not super complex but very quickly when your model is not very trivial like this you can't compute it anymore. But here we can compute it. So what do we want? We want, and I think there's one more. So I define this here as likelihood, L of P, it's often called the likelihood. We want to find the P which maximizes this. I mean that's a simple optimization. I think you also do that in school. I think you still do that in school. You want to find the P where this year is maximal. Which means you somehow have to compute the derivative. It's not nice to compute the derivative of this. Think about it, right? Multiplication rule, it gets messy. Here's one thing which you always do. You want to find the P that maximizes this. You want to find the P that maximizes this. You can also find the P that maximizes some strictly monotone function of this, like the logarithm. It's just the same. I will not prove it here, but whether you maximize l or log of l, because when the one is larger also the other is larger, it's the same thing. You're not interested in finding the maximum value you're interested in finding the value, this p that maximizes this. So let's just do that and if that's not clear please take it home and try to prove this that is the same thing. So let's write down L of LP and you will see instantly, and that's a very typical situation. Likelihoods are usually products of probabilities, very ugly to derive, differentiate, upline. If you take the logarithm, it becomes a sum, much nicer. So now we have, yeah, and let me just write it down. Ln of p to the five times one minus p to the 15. And now just by the way logarithms work, that's now five p plus 15. No, that's not true of course, I'm sorry, 5 times ln p, that would have been 2. ln p right, it's product becomes sum and so on, it's ln p to the 5, 5 times ln p plus 15 times ln 1 minus p. And now let's derive this. And that's now much nicer to derive. That is 5 ln p, ln x this is something also basic derivative magic power is 1 over x and there's no chain rule stuff going on here so that's just 5 over p plus that's 1 over 1 minus p chain rule in a derivative is minus one, so we have a minus here, minus 15 of one over minus P. Okay, and now we don't have a lot of play, but I think it will be enough. We want to find the P, that's how you do it. You want to find the P that maximizes this, so you want to find the P that maximizes this, so you want to find the P where the derivative is zero. L of P prime is and that's a equivalent, it's only equivalent when p is neither 0 nor 1. Otherwise it's not equivalent and strange things happen. Let's multiply by P and by one minus P, then I get five times one minus P is 15 times P. One more step, minus five P, it's five is equal to 20 P, I think, P, PP. And that's P is 1 over 4, right? So that's, and now I have also, yeah I think also check and let me maybe do that here and we will not do it but you can check it at home. That the second derivative, no it's not small l it's capital L of l L when applied to this one over four is negative, which means max, it's a maximum at one over four. Yeah, that you also have to check. Otherwise could also be a minimum, but it's actually a maximum. So at P one over four, and look, it makes it makes sense I mean you might have said yeah I've seen that already I mean it's three times less likely to have heads than tails so it's p1 over 4 which means by the way that the opposite probability is 3 over 4 right they have to sum to So yeah, it's three times. So it's kind of the obvious thing, but you had to compute it. You never know, right? So if you see this observation, and you have to make the best guess of the parameters, then it probably was a biased coin, where the probability of showing heads was one over four. And that's the way to do it. So here here I could compute it exactly but it's always that principle. That slide in case you didn't understand that so far is really important. It's a very simple principle and for this simple example we can do the whole math with very basic methods. Any question about that before we proceed? And there are only four more slides, but they will become increasingly mathematical. So logistic regression also solves and also does this. And I'm trying to make it analogous to the previous slide so that it becomes clear. So here my observation is sequence of coin tosses. So here now my observation is I have my input points and my labels. So this is my observation. So I'm getting these movie plots and labels with them. And the thing is which, but I can't help you with that. The assumption is always that this kind of happened according to some probabilistic model. But then again, our whole world happens according to a probabilistic model, right? A quantum physics is probabilistic model. But then again our whole world happens according to a probabilistic model, right? A quantum physics is probabilistic deep down. Everything is in a superposition of states and when something interacts, something happens. So I think that's always confusing to people. They say, wait, what? This is just given, it's fixed, but you assume a probabilistic model. This is something I think I cannot help you here, can just say think about it, try to wrap your head around it, right? Again, the toy courses once they happen, they just happen but there was some probabilistic process underlying it and that's the model. So here what happens happens we have our movie plots with labels but some process led to it and now the assumption is not only can give you a very bad intuition but still it's intuition it's just a probabilistic model. The probability that the label is one for a point and I think there should be an xi here. I'm sorry, let me just do that and then also true here. Xi professional power point skills. So in case I always have second job. So that's the, and first check without checking why it makes sense, that's a probability distribution, right? This plus this gives one and also sigmoid, that's the nice thing about the sigmoid, it gives you something between zero and one. This is a probability, this is just the opposite probability. So yeah, so the labels are somehow created using this probability distribution. And now let's skip this, let's look at this. It's not easy to get an intuition for this model, but try to understand this. So I'm assuming, and it is worthwhile, I think, now you could say I don't care. Maybe you're one of those people, I want to just to see the mathematics, I don't care why this is the assumption, so I will compute with that. But I think it's worthwhile to think about it. But you have to do it yourself and at home. And by the way, do you also see that there? There's some cryptic writing up here. Where does that come from? Oh my, aliens. And just to give you a little help, so kind of what this model assumes is that I have my hyperplane as usual and I have my normal vector w. And what is this here? This says, this is a value we already know that, that if it's on this side of the hyper, it will be a positive value on this side. And the further away it's from the hyperplane, it's the distance from the hyperplane, right? It's a large value. So what this is saying is, okay, if something is very far on this side on the hyperplane, it's surely one. This is a value extremely close to one. And you don't even, that's what the sigmoid function does, you don't even have to go far from the hyperplane, so that's kind of the assumption. I have a hyperplane on this side, if it's only a little bit away, it's almost surely the label is one and on the other side it's zero and if I'm on the plane then I get 0.5 and if I'm very close to it, it's zero and if I'm on the plane then I get 0.5 and if I'm very close to it, it's also not so clear but only further away it becomes very clear. So kind of the best intuition for this model is you have this hyperplane, but these values are not probabilities and the sigmoid function is just you're looking for a way to turn things into probabilities. That's kind of the simplest way to go. It's a bit, if you think about BM25, that was also, we wanted formula with a certain property. Here's the easiest formula that does it. So it's a bit that kind of thing. But yeah, like quantum mechanics, you can think hours, days, weeks about what does it mean. And I think it's worthwhile to do it a little bit, but then you also have to just say, okay, these are the assumptions you have to carry on. I think it's important to say this because it's not that I don't understand why these assumptions don't make sense. It doesn't mean you are stupid. It's good to ask that question. Right? Yeah. Why? But on the other hand, these are the assumptions. Okay. Now you can compute the likelihood just like before. Let me switch back and forth between the... If I have this, sorry, this model assumption, now I can compute the likelihood of this observation, right? Okay, I have a toy in courses, let me just do this. Let me just do this here, the formula is now a bit more complicated, but actually it's very nice here. complicated but actually it's very nice here. And also let me say this on the meta level. It's really important for mathematics. Yes, at certain points try to get an intuition but then sometimes just compute with what you have, right? Ignore the intuition and just do the math and that's what we will do here. Okay, let me just compute it and see what comes out whether I understand it Ignore the intuition and just do the math. And that's what we will do here. Okay, let me just compute it and see what comes out, whether I understand it or not. So this probability here is, let's just write that. The probability for a label yi is, and it's kind of, I have two cases, right? If yi is one, then it's sigma times w times xi. I'm just copying what's written up there. If yi is equal to zero then it's the opposite. One minus. Now that's not nice, right? If you compare it again to the previous slide, here I didn't have such a thing. This probability was, now I have a case, if it's this label it's that, if it's that label it's that. Let's, and the nice thing is I can write this as 1. And that's very strange but yes I can do it. Yi times 1 minus sigma 1 minus Yi. It's strange but it's true. I mean just One minus yi. It's strange but it's true. I mean just plug it in. If yi is equal to one, then one minus yi is zero. This disappears. I get this one, which is this case. If yi is equal to zero, then this here disappears. Something to the zero is one. And this here is one and I get this term. So I have this case distinction, and I can just turn it in one expression, which moreover I can differentiate now, derive, apply, right? And this is what I get here. So the probability of this here, now just, so the probability for a single sample becomes this, it would not be nice to compute otherwise I would have to write, I mean like this I just have one closed expression, right? It's really nice that I can do it like this and it's trivial, I mean that this is equal to this, you can trivially check, it's just funny that you can do it. And then, yeah, independently, again, I have an independence assumption, I do it for every sample point, so I just multiply the probability. So that's my likelihood here. And as you can see, it's now slightly more complicated than the one on the previous slide, but it's in principle the same thing, that's what I did these two. Here my model parameter is the P and that's my likelihood and here of course I have a fixed example, I didn't say so many times head, so many times tail, then I would have two more variables in here, so here it's a bit more abstract. So here I also have the sigmoid and I have vectors and not just scalars. So more complex on multiple levels. But the task is the same. This is a probability and now I want to find the W that maximizes this. So when you look at this at home or prepare for the exam, go back and forth between these two slides and to see that it's the same thing, just more complex here. So now our task on the next two slides will be to just try to solve this problem. And here's the, so and again, same trick, we don't take the likelihood because deriving a product of strange things is not nice, let's just take the logarithm as usual. Instead of trying to find the W that maximizes this, find the W that maximizes the log of this which is the same thing, same reasoning as before. Now, so that's our task. Given this, find the W that maximizes this. This looks very hard and indeed you can't do it. Let's again go back to this slide. This you could solve. Find the P that maximizes this. We could do it logarithm, derivative, one over four. That was easy. And even if you don't have five and 15 here, but M1 and M2, you will get M1 divided by M1 plus M2. That's what you get here, right? It's basically five over five plus 15. So that was easy. And that's what you typically do. In all of learning and all these problems, whenever you want to have such problems, you can't compute the exact solution with a closed formula. You do something iterative, just like for the perceptron. But now, not with some rules which just came to your mind during sleep, or not sleep, but we do it more principally. So again, you start with some W, and now in each step you compute the gradient. Now why do you do that? That's actually also not hard to understand and you maybe have already seen this somewhere. Let me do it in, and now I'm drawing, I can't draw a 1D, 2D landscape so I'm drawing a 1D landscape, which is just a function here. Right, so I'm here at my W, and now I'm looking for the maximal value in that landscape, and the comment I made here is to think about it, let me just make this here capital R, bold phase R. The picture I want you to see here is think of the my weight vectors are now, but I can't draw it, so my weight vectors are now here, a plane and now I have some landscape here, I can't draw a 2D landscape very well, just think about a 2D, you have the plane like the earth and now you have a landscape there and you want to find the highest point, right? That's what you want to do and now you are at a certain point. Why is this not working? You are at a certain point and now you just look at, okay, the direction of steepest ascent and that's the gradient. That's if you compute the partial derivative at that point. So and for the one dimension lower, you just compute, if you compute the derivative here, it will give you a direction and it just say, go in that direction, you will reach higher points. And go in the direction where you reach higher points the fastest. And you can do the same thing in two, three, and more dimensions. You are somewhere in a landscape, go where it goes the steepest uphill. And now that's what we will do now. We will do derivative in higher dimensions, which is just partial derivatives. It's also conceptually easy, and this gradient just gives you the direction of steepest ascent. So that way we can't, we don't have a closed formula for finding the highest point, but at least we can try to get there by always climbing upwards. So now comes some nice calculus, there's only one slide to go, the next slide is easy and it's the last one. And let's do the mathematics. It's very nice and also the thing here is this is really very simple derivatives, but now you are not in one dimension but in multiple dimensions. And you can actually see it here. I mean, if you compute it, this is now a vector product, right? And let me maybe just, I think let me spend that one minute because it's worth it. So this is our W and maybe it has, yeah we know it has D dimension and this is our x. And it also has d dimension, then the scalar product is just the sum of these things, right? W i times x i. And now I'm taking the derivative of this thing, and let me just take the derivative with respect to one of these components, W i, right? Now I have the sum of these W i x i. I mean, everything is constant now except for the maybe it's the W5 right? So I just get W5 times X5, all the others are constant terms now when I derive with respect to W5. And what happens if I have W5 times and the XI is also it's not a variable, so yeah, it's just a X-I. If you derive five X by X, you get five, you get the factor. So this is just W-I, and when you do W, and that's just the definition of, if you now derive by the partial derivative here by the whole vector, then it's no xi, not wi, thank you. I said it right and wrote it wrong. And if you just do this for all of them, you just get xn, which is the x again. Xn which is the X again. So it's actually, it's trivial, but you have to lift it to the higher dimension. And the funny thing is, so, I mean if this were all scalar numbers computing and this just normal multiplication, then this is obvious. The derivative of 5x is five, but you can just also do it with vectors, and here's the proof. So even lifting this to higher dimensions is not hard, you just take the concepts which you know from one dimension and lift it to higher dimensions, and then it works. So this we have already proven here, and of course, I mean, you can just do it and assume that it works, or try to convince yourself that this is true. And so now let's just do as if these were normal numbers and compute the derivatives. So we have, yeah this here I don't think I will do the math again. Let me just go to the, if I do the logarithm here, then it should be, maybe let me very quickly do the beginning, right? If I, this is now the product becomes the sum as usual, one over N, and now I have a ln of this which means yi times ln of sigma, I don't need these parentheses and so on. I will not write the whole term sigma of w times xi plus, it always looks like this. You have this product and then the logarithm of the product becomes the sum of these things. Yeah, and that's what I have already done. And we have already seen it once for the, so the log likelihood, and I think this should not be L, but I think I called it, this is actually my LN, L, so that's what I optimized, but let me now just call it L, it doesn't matter, I could also use a simple symbol for that. That here, we have already established, we have even proved it, the derivative of the sigmoid is this, I mean we want to compute derivatives here, we have the sigmoid, we certainly need the derivative of the sigmoid because of chain rule, and this we have just proven. So let's just do it, let's do it. And again, now's the time. Maybe you have some intuition for it, maybe not. Now just do the math. Yeah, that's very important when doing mathematics. Take your time to understand the mathematics, intuition, but then also, okay, now let's just compute. Let's just take this monster here and compute the derivative. And that's okay, I have to make this one favorite remark of mine, this is like divided by four plus 15 to the power of three minus seven. Is this hard to evaluate? Yes and no, I mean it's complex, you have to do it, but it's not really hard to evaluate? Yes and no. I mean it's complex you have to do it but it's not really hard right? If you know how to multiply, subtract, divide and take to the power two you just do it. It's just complexity by having a lot of it. It's exactly the same here. Every single thing is simple like here but it's just oh I have to compute the derivative of this. So it's, let's do it. It's the last thing we do, and it's a lot of fun actually, so let's compute this dl of dw and you, yeah I don't have, so let's do it. So it's, we derive by dw, so it is the sum of i 1 to the n, it's yi ln of, ok, so that's yi times 1 over this thing. xi, 1 over xi. Now I have to compute the derivative of this thing, chain rule, right? Again chain rule, 1 over this, ln x, 1 over x, chain rule. Derivative of this thing is this thing. And then I even have the derivative of this inner thing. So multiply with this and then the derivative of this inner thing. So multiply with this and then the derivative of this thing. I just do it. So I get the derivative of sigma times xi times 1 minus, that's the derivative of this, and one more chain rule times the derivative of this. This was the derivative of just sigmoid. There's not x in, w in here, but w times something. The derivative is this, so times xi. Of course, very easy to make a mistake here. Times xi. Enter, so nice. Something cancels out here, right? Cancels out. And now plus, so we just do the same thing for the next term and see what happens. Plus, and it's still part of the sum, plus one minus sigma i times one to the derivative of this one minus sigma of w xi. Now derivative of this. Oh it's 1 minus hmm I'm confused myself how How do I derive this? Oh, this is actually, I should say this, it was on the previous slide. This is equal to sigma of minus. I think that will be easier if we do it that way. One minus or sigma of minus is this, it was on the previous slide. So let's do that. So then now we have the same thing. Now we have sigma of, oh now I have a lot of minuses. Sigma of minus, times one minus sigma of, oh now I have lost track of the minuses, times, times, okay I think I have a minus here now, minus xi. I hope, I think I do. Okay, anything, anyway. If not I will just pretend it's the other way around so that it works, but just for the lecture. So let's see what comes out. Is equal to i equals 1 to n. It's always the sum of these things. It already simplifies quite a lot. y i times 1 minus sigma x i times xi minus one minus yi times sigma xi times xi and I think more stuff cancels out, I hope, right, does it? Yi minus, yes please. You said earlier that like one minus sigma with w times 6i is the same as sigma minus w times 6i. Is it not true? I hope. No, I mean it can be true. Yeah, it was this one here, P2, right? Yeah, that was this, yes. And then I made a mistake probably yes Yes, you're right I think we get the minus probably twice here right? I mean it doesn't, I know what you mean but it doesn't matter if we have the minus twice here it's still the same thing right? This here with a minus. Yeah it still the same thing, right? This here with a minus. Yeah, it becomes the same thing, but what I'm confused now is I think I need another, do I need another minus here or not? I don't think we have time to resolve it for reasons of time, but let me just, but you are right to point out that there was something where I lost over here. So this is still correct but because even if I put a minus here and here the product is the same thing. there should be a minus before it because we have a minus W on top. Yeah. So we do another one minus sigma of. Yes. There should be another minus. There should be another minus where? Coming from where? From the derivative of LN one minus sigma of w times xi. Yeah, I think so too, but I'm a little bit confused now. So, I think you're right. Let me just, I think what I... Let's just see whether this last equality holds here. Y i minus, yeah, I mean this would, this cancels out, that was true. Maybe there are three minuses here or one minus, one of them is true, so this also cancels out here. Right here, yi minus sigma of this and here I have plus yi sigma of this, so this also cancels out here. I can't really write it that way, I would need an intermediate step. But anyway, I think, so a little shaky here, but that's fine, anyway you have to do it yourself too. So, yeah, please do it yourself and pay attention to the minuses. And sorry that I can't fully resolve it now. So there are a lot of minuses here to keep track of and I was a little quick here. But that is, I think, yeah, that is what I wanted to get as a result. So either there's one minus here or there are three minuses here and effectively one minus. Let's go to the next slide, which is the last slide. So, this is what comes out. And now, let's just apply this not for all the points, but just for one points. Then what do you get? Now I'm just, yeah, let's just go back to one slide before. I'm correcting my W in the correction of the gradient. I've computed the gradient. Let's just do it for one point and now I get this and then how much do I go in the direction of the gradient? That's just a factor which we call the learning rate, right? I see this direction. Now how how much do I go there before we evaluate? Maybe 0.1, 0.01. It's a parameter, it's the learning rate. So now I got this and note what we had for the perception. Right, let me just write it here and then we are almost done. For the perception we had W, W plus X or W. I mean all this complicated, not so easy mathematics and now you get something very similar but you have this factor in front of the x. It's y minus sigmoid of, I mean, how should you have come up with that yourself without theory, right? And that works much better. So you're also going in the direction of x but with this factor here. And here you have the, depending on the label, you will go a positive or negative direction. But it looks very similar to the perceptron, which is interesting, with a twist. And the last thing, and then we are done. So here, we did it for all the samples. So you sum up the gradients for all the samples. It's a sum over all my input points. Here I just did it for one point, how you actually do it in practice. I will not explain why. Is you divide your input into batches of equal size, can't be fully equal because maybe you have 117 inputs and your batch size is 20, then one batch will be smaller, but let's just assume. So here I have a batch, so I'm just summing this up for all things in my batch and divide by the batch size. So I'm, yeah, that's very easy. It's really very easy. I have this gradient thing here, and now I just take 20 of my movies, 20 of my objects, compute it for each of them, this thing here, which is easy to compute, hard to prove, easy to compute, and take the average. And then I go, just change W a bit in that direction. Yeah, and experiment yourself with batch sizes. It's just a parameter in your code. Let me just summarize. There are three parameters now. The learning rate, how much do I go into the direction of the gradient. The batches, how many at once before I make a correction of my W. 10, 20, 100. And how often do I go over the training set number of epochs and that's the typical hyperparameters you have. And last word, so that will be your task to implement that. The formula is simple, how do you implement that in linear algebra? You can do it in very few lines of code. It's very interesting to understand what you're doing, also getting used a bit more to PyTorch. So that's it. Any questions about this? Please do the exercise sheet. Please do the evaluation if you haven't done it. Thank you, see you next week, bye.Welcome everybody to lecture 12, databases and information systems, the course that can also be taken as information retrieval this year. Our topic today is language models. It's the last lecture in this course with real content. First I will say something about the last sheet which was about logistic regression. One slide about the exam registration and then language models. So let me briefly say it was a very ambitious plan which we had for the last three lectures which culminates today because I mean we all know language models, chat, GPT and so on. And we wanted to at least explain to you how it works in principle. But it's not so easy and doing this in three lectures, well, we tried. Today is a completely new lecture, so it was a lot of work for Sebastian and also for me. So there will be some errors, I'm sure, but we tried our best. So it's the last sheet and it's kind of a high point maybe. Let's see how it goes. So first about the experiences with, okay, here's already the first error, maybe you can count how many we have. With the last sheet, so here's some excerpts from what you wrote. One more exercise sheet to go along with a set of absolutely great and well thought out sheets from this course. Thank you so much of that is due to Sebastian's fantastic work. It's really a lot of work and as I always say the exercise sheets are the most important part. If you don't do it, you learn by doing. That's just how it is. The lecture was fun. It's the one about logistic regression. My first contact with anything learning, so I think yeah we have like two kinds of people in this course, those who already heard learning courses, they also found it interesting and some for which it was the first time. Fun and interesting to play around with the hyper parameters. Quite a few of you wrote something about that, sometimes not as you expected to be, nice sheet to learn. So yeah, you made very different experiences as to if I learned for longer more epochs, for some people it became better, for some worse, for some it first became better, then worse. So that's the thing with hyperparameters, completely topic on its own. Yeah, you can tune the batch size, you can learn for longer, you can change the learning rate, you can even change it dynamically. And that makes a big difference. And the interplay, maybe you have a bug in your code. Let me say this one thing with, maybe something is not quite right in your code, then it still works, but just not as good as it can. That's not easy with this kind of work. Small alpha and batch size, but use enough epochs was the conclusion by somebody else. A bit more implementation advice could have saved trouble. And one comment was, I think it was the only strongly negative comment, I dislike that the last sheets rely so much on PyTorch. We discussed this in our weekly tutor meeting. I don't think I agree because, I mean, when you use linear algebra, you don't want to do the matrix multiplication by yourself. You have to use a library for this, and not only because it's more convenient, because it's super inefficient if you do matrix multiplication with loops by yourself in Python. It was never finished. So you have to use some library and we really only used PyTorch for the very simple stuff. As with every library there's so much else, but I think we separated quite nicely, that's what you need, and then all the other rest you don't need it. We will use a bit more today. So yeah, they relied on PyTorch, but really only for multiply this matrix with this vector, give me these columns from the matrix, normalize the vector and so on. So pretty basic stuff. I think that that was doable and we gave you a cheat sheet. And we will show the master solution from this exercise sheet because it will be our starting point for today. So these are two lectures in one, the same lecture. So 119 of you speaking of yesterday registered for databases and information systems, and that is this exam number in the HISN 1, and 90 of you registered for information retrieval. So you had a choice. And we looked at the list and we just quick check whether they are disjoined and they are not disjoined at all. So it was not possible in the HISN 1 to say either or because that's a very unusual situation. So some of you I don't know what exactly your motivation was. You maybe thought better register for both, better safe than sorry. Yeah sorry you can't register for both. I mean it's just you can only register for one. So if you register for both, actually I don't have the, Sebastian do we have the count? How many registered? We don't have it yet. It's also not important now. In case you registered for both, you have to make up your mind, obviously. You have to say I want the ECTS points for this one or for this one. And just to make that one clear again, it was allowed, a one time thing, that if you heard information between the past, then you can also take now database and information systems although there is some overlap with the previous course, that's also okay. Of course you also can't get the same points again for this course. So make up your mind, it has to be one of them. I think you will manage. So, on with the contents, let's see how we do with the time. But I think, yeah, so models, what is a model? First, what's the goal for today? So in lecture 11, let's look back very briefly how we started, lecture 10, we just started with all things linear algebra, right? Before that it was more classical approaches which are also important, standard databases will be important for, they will not die out because of learning stuff, both stuff is important. Both things, lecture 10 we started very lightly with look at things as vectors in linear algebra. Lecture 11, logistic regression was our first like learning method and we looked at like the simplest problem. Let me maybe show that problem again. It was these movies and we just had movie text, it was movie plots so that it was a little bit harder what happens in the movie, not the Wikipedia description of the movie and then say just from what's happening in the movie, not the Wikipedia description of the movie, and then say just from what's happening in the movie, from the words, say is it funny, is it a comedy or not. It's kind of the simplest kind of learning problem, just learn yes or no. Binary classification, two class. And we used kind of the simplest method to do this called logistic regression, and we will look at it again. It's very simple, very basic, but very important. It's not, yeah, just because it's simple, it's important. And today, starting from this, and there's some logic to it, you will see it, we want to generalize this to learning almost anything. And we will do something very nice which I show you in a second. Namely now. What we will do now, so what we kind of did in the last lecture, you implemented and in the exercise sheet logistic regression from scratch just using linear algebra and I will have some recap here. And now we will just implement it again, but in a more general framework. And the goal is, typical refactoring thing, we just do it again and the result is exactly the same as before. So we have achieved nothing, except now we are on a different level. It's now in a, and now in this new framework we can do much more powerful thing easily. So that's the goal for the first half, to do exactly the same, but still achieve something. And then we solve this more complex task which is something like GPT-like. Very simple of course, but it works. And so this is what we will do together, we implement logistic regression so that you learn how to do learning in PyTorch and then you will apply this to this more complex task and please do the exercise sheet. And it will be very hands on but also not naturally some background. So pay attention and do ask questions, use the fact that you are here. So let's lift what we did in the last lecture to the next level and try to understand this and please ask questions. It's a bit abstract but I think understandable. So we have a function and look how general it is. Any input, any output. But the thing is we don't know that function. And it's, for this example the function is given a movie plot, say whether it's funny or not. That's our function which we want to compute. And we know it for some movies, we don't know it for every movie. And the point of this first line is, as often is the case in mathematics, that function exists, but we don't know how to do it, right? This is what we want to learn. Okay, given the movie plot, yes, it's funny or not, but how do we find it out? And that's what a model does. A model is another function, and we call it m for model, which also takes the movie plot in this case as input, and something else, namely parameters, n of them. They don't have to be real parameters, but for the lecture today, let them just be real numbers. So we have the input, the movie plot, N parameters, N real values, and then we say funny or not. And depending on which of these N values we put here, the function does this or that. And the goal is to find the W, to find the setting of the parameters such that the model does something good. So if I set the parameters in any way, I will get a function of the same kind, given a movie plot, say funny or not, and now I can compare it to the original function. And I want to compare it and of course what I want, I want it to be as similar to the function which I'm trying to, which I don't know, as possible. And for that I need a loss function which is kind of a similarity measure or a difference measure. It's called a loss function in this context and yeah it's called loss. I mean you always have to decide do you want to maximize or minimize. You could also say a gain function you want to maximize it. One speaks of loss function you want to minimize it. That is here's the real thing this movie is funny. My method said it's not funny. So now I have to say, okay, this costs you 20 euro, something. So I somehow have to measure the difference. For this problem it's easy. And now what we want to find is, we want to find the setting of the parameters so that this model with this parameter setting is as close to my function as possible. And by close, that's another thing we already saw in the last lecture, we evaluated on some functions because we don't know all the values, right? We have our test set, some movies for which we know the labels and we evaluated on that or some training set. And we just took the sum. This could be made even more general, but that's kind of the level of generality we will work with today. So a little bit of notation here, but I think in the context of the concrete problem we already saw, it's understandable, but please do ask questions, because that's the basis for everything else. Now is that clear enough? Any question? Yes, please. So the function f is like black box function we only know in an output and we try to replicate forgive inputs the outputs to match that. Yes, exactly. And not only is it a black box function, but we also don't know it. We don't have a formula for it. That's the point. And we are looking for a function for which we have a formula. Yes, that's exactly the point. And I think it will become clear by more examples. Danke frang. And we will start with logistic regression again. So now let's just put logistic regression. That's what we will do for the first half in this framework again. It's just a recap from the last lecture. So, and what kind of functions? Now for every model you have, you have to say okay, this is a model for which kinds of functions. And logistic regression is a model when you have functions that take as input something in N dimensional. So let me always go back to this example, what we did here, we took a movie plot and turned it into an N dimensional vector by just summing up the word embeddings, right, that's what we did. 300 dimensional it was. And the output is something in 0, 1. Where actually what we are given, there's little detail here, maybe I come back to it. The function is actually just 0 or 1, but that's also not wrong. We could also imagine that it says 0.5. It would also work. And the model looks like this, that's what we did. So now we have, we get a movie as n dimensional vector and we get some parameters. Actually, okay here I've already, it's debatable whether I should write plus one here because of this additional dimension. Let's forget this for now. and that's what we did. So we had these parameters here and the input vector and we just took the dot product and the sigmoid function and this then gives us something between zero and one, yes or no or something in between. That's what logistic regression did in this framework. We didn't have a loss function last lecture, but we had likelihood, if you remember. We maximized likelihood. But maximizing something, this is what we maximized, and I will have it again on a separate slide. Maybe you remember this. If here for the, so the Y is the real thing, the actual label, and the prediction is the sigma and then the escape, yeah, the dot product of W and X, that's what we had. If you plug this here for Y prime, you get exactly the formula from the last lecture and we wanted to maximize this, we call this likelihood. Let's just put minus here and call it a loss and then we want to minimize it. So it was not the likelihood, it was the log likelihood. And we will see it again on another slide. So that's just what we did in the last lecture, right? We had this kind of model, these kinds of functions, this was our loss, we said without the minus, this is what we want to maximize, with the minus, that's we want to minimize, and that's what we did. Very briefly this is a two class classification. You get yes or no. You can also do a multi-nominal logistic regression. We will not do it today. And then the function is, and let me introduce this funny symbol here. I didn't know it myself yesterday. Now, if you have m classes, m things to decide between, then the output is a probability distribution, right? Now you have to say, you could either say it's one of the m classes, but you could also say, okay, it's 20% of this one, 40% of this one, 60% of this one. So what this really is, it's the set of all probability distributions over m things. Let me briefly rest on this. Why then is it called m minus one? Well it's called m minus one because probabilities have to sum to one, right? So you really just have only m minus one degrees of freedom. You cannot understand this when you go back. This is also you have two outcomes here it's binary right yes and no and if you want the probability distribution there 20% yes 80% no it's two numbers but just one degree of freedom right if? If you say the probability for yes is 20%, probability for no is just one minus. So that's why if you want the probability distribution you have m minus one things which you want to compute and the mth one is just one minus the sum of the rest. And this is called the standard simplex by the way. I was looking for notation and this is the best I could. So that's called, I think the standard simplex. Because if you simplex, if you imagine it as a geometry thing, and it's like a pyramid, a triangle pyramid or something like this. Standard logistic regression is a special case of this and this is just for information. So for what is this for M equals two? If you have two classes, then this is a probability distribution over two things. And then this is a probability distribution over two things and then this is essentially just one probability is enough. I don't think it's correct to say equal here but it is more or less that. Yeah, it's the, you have two numbers between zero and one which sum to one. We don't have time for MLR, I would love to do it. We could do it in 15 minutes given what I will show you next, but it's actually not hard. You might want to do it yourself if you like with the techniques I will now show you. Okay, and now we don't do this yet, it's for the second half. What's the language model? A language model is now the input is, now look in this general thing we just have to say what's our input, what's our output. Now the input is n things from our vocabulary. Let's just call them words for now. Later we will also see it doesn't have to be words, but for the first part I want to call it words. I have n things and now I want to predict the next thing, which is, and this is a probability distribution again over the things I have, for example the words. So this is just given N words, compute the probability distribution over the next word. And understand, just like in logistic regression, why it's not, why is it not the set V here? Just predict the next word. Well, there could be several words, right, that are possible. Given the sequence of five words, you might want to say, okay, the truth might be 20% of the times this word, 30% of the times this word, 50% of the times this word. So it's really a probability distribution over. Let me just look around. 20% of the people are busy with their smartphones or other devices. I would pay attention if I were you. So that's what we will see in the last part. And I will show you how to do this. And chatgpt.l which you all know is based on exactly something like this, right? If you use chatgpt, you know that things come word by word, it's exactly for that reason. It will always just predict the next word. And it's quite amazing that something intelligent comes out of this. Whether it's intelligent or not is a matter of debate. So let's briefly go back to our definition here. So we have this model, we have these parameters and the goal is to find the best parameter settings so that we approximate, we have something which we can evaluate, which approximates what we are interested in. How do we find that? Let's also generalize what we did for logistic regression last time. That's now the more general routine which in principle works for any model. So you start with some setting. I'm looking for the best setting so that my model function does what it should. Let's start with a random I think for the exercise sheet we will do something random but sometimes it also just works to start with all zero weights. It's not meaningful yet. Now you take, you have some examples like here, right? I have all kinds of movie plots and I have the right answer. Divide it into batches and process the batches in random order. You always do that for reasons. We briefly talked about it in the last lecture. And now for each batch you compute this gradient. That's also what we did in the last lecture. In the last lecture we would see it again on one of the next slides. We actually computed the gradient of the likelihood. Now we will compute the gradient of the loss. And one thing here, I made a slight notation mistake here. I wrote in lecture 11, I wrote the gradient as, I wrote it like this, I think, using this partial, this is a bit, I have to write this nicer. It's just a notation thing but just so that you aren't confused. I wrote it like, did I use a different color now? So let me try again like this. And this is what you use for partial derivatives, right? Let me just briefly explain this. I mean it was of course, where is it now? It was of course correct to use this. So I have now different directions because I'm in a higher dimensional space, so when I'm in the direction of this weight, so if I change this weight a little bit, what will happen to my likelihood? If I change this weight a little bit, what will happen to my likelihood? Will it become more or less and by how much? This is what the partial derivatives give me, and if I put all the partial derivatives together in one vector if I put all the partial derivatives together in one vector, this gives me the gradient. So if I go in this direction with all my weights, then my likelihood will increase the most. And this thing is called, yeah, I just wrote it like this. So nabla is just, this is just if you put all of them in a vector. So I take the, I mean it might look frightening to some of you, but it's just notation. So if I, and we have N here. Yeah, you just put all the partial derivatives in one vector and that's called a nabla. It's the nabla operator. Let me also... Nabla. Okay, so it's the gradient. We call it the gradient. Okay. We will see on the next slide, one of the next slides, actually we didn't compute the gradient of the loss but of the likelihood, of the log likelihood, but it's the same, one is just minus the other. And then we went in the direction of the gradient which means we did plus gradient but actually if you take the loss which is the minus you take minus gradient. So that's what you do and we will do this again, recap this again a little bit, you compute, okay. And let's try to understand in loss terms again what we do. What we do here is say, okay, I want to minimize my loss, I'm at a certain parameter setting and now what this gradient tells me, okay, if you go in that direction you are getting a little bit lower in your loss. So just go in that direction. And some landscape, I drew a picture. So walk a little bit in that direction and there it will go downhill with the loss. It's not clear that you will reach the absolute minimum there, but at least it goes downhill, right? In this landscape you want to find the lowest loss, so go in the direction where it goes downhill the steepest. And then you have to say how much do I go in that direction before I evaluate again, and that's the learning rate, right? I mean the gradient gives you a direction, now the question is how much do I walk in that direction. So that's the general procedure. And we already also introduced some notation terminology here. You divide into batches for efficiency reasons but also for several reasons. One reason is efficiency. You could just evaluate it one at a time. Just look at the next example, change a little bit, that's not efficient. Or look at all the training samples together and make a change that's also not efficient because the matrices are huge. That's why I do it in batches. There are also other reasons. I won't talk about them today, stochastic reasons. And then once you do this for all your training data, please also understand this again. I go through this data, let's just look at how many there are. It's also important to understand. So here there are 50,000 of them. Now you could say, okay, I've seen them once, now I'm done. But now you can just do it again, because now you are at a different parameter setting, right? Can do it again and again and again, that's what you also do. And doing it once is called an epoch and you can repeat that for any number of times. That's what we did. And when does this approach work, this generic approach, it works as long as we know how to compute this loss, this gradient, which we did for logistic regression. And now let's do that again, recap and let's also implement this, but now a little bit differently. So exercise sheet 11 for those who did it, this is what you did. You turned the document into vectors and maybe I should write documents here. And let me say it again because it's so important. So what you did is you, each document here, we gave you word vectors, 300 dimensional vectors for each word, you just summed them up and now you have a 300 dimensional vector for this movie plot, which is exactly the kind of input we want, right? So now for each movie plot, we have a 300 dimensional vector and we wrote that code for you by the way so you didn't have to do it. Now we split it into batches, that's what we did. And for each batch we computed this gradient of the log likelihood. According to a formula which we derived was the last part of the lecture and then we updated the parameters using the gradient. And what PyTorch lets you do and what we will use today, and I think now is a good time maybe to go to the code. Let's just look at the code, and it's the code from the master solution, and let me, okay. Dark mode, not bad, but I don't know why. How do I get the color scheme now to... Cannot find, why is it not showing me the... My, now I have to choose the color scheme online. So which one do you want? I don't know, that's hard now. I don't know why it's, I already, there's a, okay, it's just 500 or so. Let's see whether we, which one is the one which we? You can try set background, it's not. Where? If you do colon set space background equals light, I think that's the words. Okay, but shouldn't I just choose a color scheme? I may be default, maybe default, because I just want to use the one from the, that's, yeah. Okay, so we have something to do for the next color, okay. I think we have to, what was our color scheme? This is stupid. L-flored. No, we don't want L-flored. I'm sorry, but we have to do this. I think a dark screen does not work well with the dark blue. What do you think? No? Let's, it's fun. Industry, morning, we already had morning. Peach puff, this sounds bad. That's not, how is it? No, it's not so not good. Peach puff, oh my. Pablo Murphy. Soppy. Oh my. I'm so sorry, but it's only two minutes. Do you have any idea? I really don't know what we, we tried desert, right? That was Elf Lord evening. No, not evening, I'm so sorry. You can try yourself in the, in your, if you're sitting in front of a machine. But it's not so many, okay. I really don't know why it doesn't work. That's not nice, right? At least it's... You now do the set background, it will change the green into something more real. Do you think so? Yeah, because... Background into? Background equals light. And then you just do the thing to default, I think it works. Okay, you are... Yeah. Yeah, no, it was worth a try, okay. I'm really sorry. Do you have other plans for today? No, right? I should try out the RetroBox. Which one? RetroBox. Do you have other plans for today? Which one? Retro box? Sounds good. I don't know why it's not doing the, ah I have some idea, but maybe it's the wrong idea. But let me try it. If I just, if something changed, you always have to ask yourself what changed. And what changed is I think that I wrote something in my, where's my NeoVimCon? Here, now here should have an init.vim. And I think I, da da da da da, do I have, yeah yeah, here I set the Scala scheme. Let me maybe not set the Scala scheme. But that looks like what we had, right? So I think I solved it. Sorry. So let's just continue. Short break. But that looks like what we had, right? So I think I solved it. Sorry, so let's just continue. Short break and let's look at the master solution code. So I hope you're still with me for logistic regression. And it is 400 something lines, but it's a lot of boilerplate code. Let's briefly get an overview because we will work with this for the rest of the lecture, it's important. And we gave a lot to you. So this is tokenization. You have to break, yeah, let me just show this on the, yeah, so you have to break this into words. We gave that code to you. Compute the vocabulary, so the distinct set of words in this, we also did that for you, I think. Then you have to read the data, and some unit tests here, we did that for you. And here we have, I think that was not supposed to be in there, doesn't matter. And now the logistic regression. So, yeah, these were the weights. So if we have n dimensional vector, we take one more for this additional dimension, the one. And then we just set it to zero. So we started with zero. This is for adding the one to the input vector. We did that. and that's what we did here, just add one. And then we have the training. And in the training, so we had several epochs, I just explained that here. We had, we divided it into batches, that's what we do here. Here I just get the indices of the batch, so now I have my, this here in a matrix, in a tensor, 2D tensor, so a matrix. So I just say, okay, give me these column indices and this just says, okay, select these columns. So now I have a submatrix, I call this XB. That's now a batch of documents. B stands for batch and these are the labels. So the things in column one here. And now that's just, let's maybe go back to this in a second. Let's go back to the slides first. Okay, one point I wanted to make here, so what do I mean when I say from scratch in the lecture? We did everything ourselves, right? Epochs, we have a for loop here, yeah, this is exactly what our model does, right? Compute the W, multiply it with the documents. Here it's a batch of documents, so it's a matrix multiplication. It simultaneously multiplies the weights with each document. You don't need to, it's the point of linear algebra, right? Then you compute the gradient. This is the weights with each document, you don't need, it's the point of linear algebra, right? Then you compute the gradient, this is just a likelihood thing here, and now you update your weights plus learning rate times gradient, right? So we did everything ourselves like we developed in the vector. It's not a lot of code, that's always nice and it worked. Let me maybe show that before we continue so that we see it once. Let me just run this for one epoch. So this is now running it for one epoch and it takes some time so it's going over the training data once. We have seen it 50,000. And okay, now it computes. And let's look at these numbers because whatever we do now, we want the same numbers. So remember them. So this works. And let's also check whether the unit test works. Yes, they also work. And now we want to rewrite this. And we want to rewrite it, the thing like the hard thing, and let me say that again, this is kind of, yeah, I'm just taking a batch of things, this will remain, but this is kind of where we put in the work, right, we said okay, evaluate this model function, and then we had to compute the gradient on paper. And now this decisive change, we will now let PyTorch figure out this gradient. We will not compute it ourselves. And let's see how we do that. So this was our model function and now PyTorch has the bytes explicitly again and I have a slide where I explain that. So and in PyTorch I will now do the following. So now I will just write, let's see how, let me just call it implement logistic model function for logistic regression in PyTorch. And let's call it logistic regression model. And it's a subclass of, so I use torch and module, that's just how it's called. And now I need two steps. The first is what are my parameters? This I put in the constructor. So let's write the constructor. Def init. So let's write the constructor def init and I have to say how many features I get. Yeah, num features or let me just call it n here. So now I first have to call the, yeah that's correct, I have to call the constructor of the base function, whatever PyTorch does there, don't forget that. And now this is just the W. It's just a vector of N dimensions. And it's just something, why is it called linear N1? It's something which you can multiply with an N dimensional vector. That's what the N here says, and then you get one, just a scalar, a one thing. If you would write two here, it's something you can multiply it with an N dimensional vector, and then you get two values, then it would be a two by N matrix. And let's, so that's what we do here. So we have something from PyTorch, which is just an N dimensional vector. And now we want to initialize it with zeros. So it's very similar and we also want to have the bias. So now, little difference to before and I have a slide on why it's not a back step. Why do we have, didn't I say it's nicer if the bias is part of the weight vector, well in PyTorch it's not, because then we don't have to add once or something like that to the input. So we just, yeah, we just have this additional line here. And I have more on the slide. And let, and now I need to say, okay these are my parameters. And now I have to specify my function right. These are just parameters or the components from which I build my function. Now I have to say and maybe write that here parameters of our model function and how we initialize them. Okay. Our, the implementation of our model functions implementation of our model functions based on parameters. Let me just be verbose here in init. And let me bring this to the top. Okay, and now that's very easy. We basically do the same thing here. Now we just use torch functions and that's important. So as I said, we are doing something here and achieving nothing because the goal is to have the same functionality as before. But, so let's just do this here. Let's return torch and it's already, okay, and let me put, so I'm just saying linear x will just, it will just do this weights here, w dot product with x plus or minus the bias rather. That's what it will do, and I will say, and let me add this flatten here and I will explain. I have a slide explaining this. So modular details, this is exactly this here is exactly if I scroll down a bit this here, right, just in a few more lines and using PyTorch. It's exactly for evaluating. And by the way, the initialization we also had it somewhere up here. Yes, here we had our initialization. We had, yeah, we just put everything in the weights. That's plus one. It's like the bias term and then we initialized it to zero. So basically we just rewrote that part using PyTorch. A few explanations on this one. Why the bias again? Didn't we say it's easier without the bias? Well, the point was, if you think about the last lecture, it's much easier to do the math without the bias because without the bias you don't have to write W and B everywhere, it's just W. If you do it in code, actually, yeah, you have to add the ones to the input vectors. Let me briefly go back to the code. We have a function for this, add bias, which now we don't need because PyTorch will just take care of it. So PyTorch will actually compute W times X minus B here. So I don't need to add anything to my vectors, which is nice, also I don't have to change them. Why the flatten? We have already seen that too. Let's just go, I hope you don't lose orientation. What is this doing? That's a PyTorch specific thing. This is now computing the dot product of W with X minus B. And what does this give me? You would expect that it gives me a scalar vector, but it actually gives you a, or actually if you do it for B documents, that it gives you a vector of B values, but it actually gives you a matrix with one row vector with B values. And I think I have it here on the slide or even the, yeah, I think. So this component here, which does just dot product, gives you a matrix with just one column vector. We already had this in the last lecture and what I just did, I just want the column vector. So that's just a PyTorch thing. You get a good error message whenever you have that problem. I think several of you encountered this that it tells you dimensions do not match and that's when you use Flatten. So you have a matrix with just one vector as a row or a column and you just want the vector because the next function wants a vector and not a matrix with one vector. And then just say flatten, flatten has also more arguments. For example, if you have a whole matrix of vectors, you can also turn it into one vector by concatenating them and stuff like this. Flatten can also do that, but here it's like the simplest case of flatten, just extract. Okay, so that's really the same thing as before, just with a little more code, but not even more code. And the nice thing is we have it in one class. Is there any question about this? That's just the same thing written in PyTorch. And maybe one more aspect here. Here we get a tensor as input, not just, I mean of course it's all tensors, that's not just an individual input, that's why it's capital X, it's a batch of inputs. Right, so this framework is always for processing batches. So this will compute with one operation, the sigmoid of the dot product with B things at the same time. So the output will be B probabilities. And now we want to do the, compute the gradient. And how do we do that in PyTorch? Well, let's now go to the code and let's maybe, let us be bold. I think I will, yeah I think here I can now just, let me do it like this, let me just call. So now I'm just using, I'm deleting this and this I will upload so you can use it, don't worry. So now I say, okay, let's use my model and initialize the weights here. Here I call it num features, just the dimensionality of my input 300 for the embeddings. Add bias, I don't need it. I can delete it, I hope. And now, let's go down here. Okay, now I have to say I want to compute gradients. And here is, I think I, let me, yeah, okay. Yeah, learning rate, I think I can just give it as second parameters. Let's just take it for granted for now, I will explain it in a second. That now, this object now, and I call it like this, will compute the gradients for me. And I will explain it in a second. But the point is, I'm doing this step by step to show you how we go from doing it from scratch with PyTorch, and that is very similar actually. So this is what we did before, and I actually want to comment it out now. So this we just do as before, splitting for several epochs, splitting into batches. And now I do the following. This is not I put the flatten back in the model again. You can decide do you want to have it here or in the, it's just a detail. So, so now I'm computing the outputs of my model. Yes, so I'm just applying the model which this is nothing special, I'm sorry for jumping around a bit. And what I do here, if I call it like this, it will just execute. You have to call this function forward. So it will just call this on my batch of things now. Very easy, right? So this here is now the output of my model for each thing in the, for each document in the batch here. What's the next one? I'm sorry. Now I'm computing the loss. Here it says binary cross entropy. Don't worry, I have a slide on that. So now I'm computing the gradient. And the gradient between, this is my ground truth label, so in my file where it says one or zero, and now it's computing the loss function. And I'm claiming this is minus the log likelihood from the last lecture and I have a slide on why. And I do it like this, and now I want to compute the gradients of the loss, which, I mean this is just one line, but we worked really hard in the last lecture to find out that formula. And now comes, if we go to the, I think, I can also, yeah let me just, let me just do it this way. Loss backward, this computes the gradient, yeah? Compute the gradient of the loss. And it computes the gradient because above here I specified a method, it's called stochastic gradient descent it's on the slide or will be on the slides in a second. And now I say, update the parameters. update the parameters, yeah and that's a bit, I will explain that in a second in more detail. But that's doing exactly the same as here. Let's go through it again, I'm just computing the current values. I'm computing the loss, the difference to how it should actually be, y cross entropy I will explain. Now I'm computing the gradient, why difference to how it should actually be, Y cross entropy I will explain. Now I'm computing the gradient, why is it called backward? We will see in a second. Now with this gradient I computed, I will update the weights, so step just update the parameters, and how does it know, here I said use this way of updating and this is the learning grade. So this will effectively do W equals W minus learning rate times. And we can check it by seeing whether we get the same output in a second. And then actually what it does internally, it adds it to the gradients, that's a technical thing, so here you have to zero the gradients again. So this optimizer just has variable for all the gradients. And if I don't zero them, and then in the next step it will add to those, so this is a technical thing. Because in some settings you want that, you actually want several steps and add up. Here I don't want that. So I think, I mean some things are cryptic here. Let's have some explanations. Okay let's first verify that. We can run it now. If I didn't make a mistake, I can run the code just the same and I don't know whether it will work. I mean we changed a lot. I deleted stuff, add bias. I'm, hmm, it's going for one. Oh, it's no, actually where do I still have add bias? I'm still using it apparently, but I'm not sure. Predict, oh my, ah. Oh the prediction, here's the prediction. I see. XB, I don't need that, right? Okay, when I do the evaluation I also have to use my model. So here I think I'm just using, yeah, I think I'm just bold here. And I'm calling my model here. And the outputs of my model are, I think the outputs are already sigmoid things, right? So I have to put a 0.5 here. I just want to decide funny or not. So the model computes the sigmoid. If it's greater equals 0.5, it's a yes, otherwise it's a no. Let's try it again. Now you don't see the previous results but maybe you remember them. Suspense, suspense. I think it looks the same, right? Doesn't it? 86, yeah that was for 10 epochs, that was 86, 65, 24, 38, it's exactly the same, which is quite amazing. I hope you can appreciate that because, I mean, it's only a few, but here, this is the formula we did in the last lecture and now we just used stuff from PyTorch, right? And it did exactly the same thing. And how did it do that? And I will now explain. And then we have a break. Three more slides, then we have a break. So it does do exactly the same. So if it's not clear now, at least you should appreciate it later or when you do this sheet, this was a major thing now. Now we used PyTorch, we let PyTorch compute the gradient. I will explain in a second how it did it and the result is exactly the same. That's quite amazing. So last lecture in sheet, we worked hard to compute this formula, computed the gradient, used it to update the W. compute this formula, computed the gradient, used it to update the W. It's not blue, it can't go on like this, terrible. Ah, so far the mistakes are just... So on the previous slide it says cross entropy. What's cross entropy and then why does it give the same result? So the cross entropy is just a function that compares two probability distributions. Forget about the term, you should understand that it should make sense, right? So my, and let me just show you the documents again. This gives me a probability distribution for this document. It's an extreme one, right? It says 100% funny. Actually, my input could also tell me, I'm not sure about this movie, or I could say it's 80% funny or something like this. Our training examples could also give probability distributions, but they are just zero or one. So this just says 100% funny, 0% not funny, and now my prediction says 80% funny. So, and now what's the difference between these two? What's the difference between my prediction and the ground truth? And this is the way, and we saw earlier that more general models maybe have probability distributions over more things. So here, and that's a very common thing to measure the difference between what I predict, difference between two distributions. And now I have this formula, which looks strange, right? Why should you, you're measuring the difference between two things, you would expect like something, the sum of the absolute of the differences. Why is it, this is also not even symmetrical, it's sum i Pi times log Qi. And I don't think it's even log, it's log two, which is, it's log base two, I think so. I would love to explain cross entropy because entropy and cross entropy are just beautiful concepts and we had a lecture in information retrieval so far which we had to for now take out for reasons of time. It would take me 15 minutes, no time for that. We need it for other things. This lecture from the information retrieval course, for those of you who know it, it's about compression. I have something, I have symbols, I want to encode them using bits. How many bits do I need on average? That's what compression, that's what entropy measures. And what cross entropy measures, let me say it in one sentence although it may be hard to understand, you're using an encoding of symbols into bits that are optimal for one probability distribution and you use it to encode symbols which you get according to another probability distribution. This is what cross entropy measures and it gives you some number. The only thing you need to understand for now, for here is, that's the formula and lower is better. So when the two are the same, this is the lowest. It's not easy to see from that formula. But what is very easy to see is if you plug this in, so this is the prediction from logistic regression, this sigmoid of, that's what we did for logistic regression and the yi, then you just, I mean let's just do it. Let's just plug it in without understanding why this formula makes sense. Let's just do it. And then you will see yi comma sigma. This should be i here. Let's just plug it in what's up here. This is just minus sum of i, I think it was, let's not write what we sum over the batch. So I get p i, so that's the y i, the label which was zero or one times, I, a label which was zero or one times, no, no, I'm sorry, I is not the, it's the two probabilities, I was confusing. I am, yeah, I'm sorry, I see what the problem here is. I'm sorry I see what the problem here is. I'm sorry for the slide. Let me, want to keep the annotations, yeah. I don't know whether, just a second. The problem, the confusion here is, this I here is the items of my probability distribution and this is an index over my sample, so this is a bit, let me just do it for one sample here and then try again. I hope it will become clear when I write the formula. So let's just do, so here I have a true label and a prediction. And now some magic mathematics and according to the formula that's the probability that this is, I think this was, y times, okay, now it says log two. It's also not quite right as I realize now. Sigma times x and the probability of the, a little bit of confusion here, but I, so actually when the ground truth label is y, we want a probability distribution, which is y and one minus Y, right? This I explained earlier. So what you actually, what you want here is not a single value, you want a probability distribution. And here you also want a probability distribution. So I, how do I write this? Give me a second. I hope it's clear what I'm saying. For logistic regression I have a probability distribution over two things. And in the last lecture I just wrote the one probability. So, I'm just wondering a bit how I, be patient and do it again. So this is, let me write it like this. I hope it's clear. So this is, the y is actually y minus y is a probability and this is, this here is sigma. So if I turn it into probability distributions, it's like this. And now if I write it down, so now I have, let me write it on top of each other, so that's the minus y times log 2 of I just plug it in and then minus again so let me do it like this plus one minus times log two of one minus sigma. And this is, it's not, last time we had the LN, here we have log two because that's just in the definition of, so it's log two and not LN. Let me correct that on the slide. And it's, yeah, it's just the same thing. And I could also talk more about why, I mean that's just what we had in the last lecture, right? Except the minus, we also had, we just derived that differently by computing a likelihood and then taking the logarithm, but it just happens to be the entropy. So sorry for the notational confusion here. Now I touched my microphone and here are, so that's the reason why this here binary cross entropy just computes a formula that's minus the likelihood, module LN versus log 2, but optimizing the one is like optimizing the other. So what's the advantage of our more generic code? We already saw last thing before the break, and then last part. So as I said, we hard coded the gradient. And now look what we did. Appreciate this please. So here, we define our model. This is just a model function. We just say, use this function to evaluate it. So here, we can do other stuff now if we want a different model function. But look at this code here, which we have now, and let me just delete the old code. This is absolutely generic now, right? That's the point. I'm evaluating my model function. I say how do I want to compute the loss, and this here is like, PyTorch does it for me, it computes the gradients. So I can now define any model function I want and use the same code here, and let's see what it will do. So that's the thing, yeah, you can define a much more, and I mean, if I now give you a much more model function and then say now please compute the gradients, well that's a lot of work. And you could also use a different loss function if you like, but we will use, I will say more about this in the second part. And now before we go into the break, let me at least give you a hint how, because that's an interesting question, you don't need to understand it for the exercise sheet, but it's interesting, how does PyTorch figure out how to compute the gradient here? I mean, I'm not telling it how, how does it know that? And at least let me give you a hint, it's very, and why on earth is it called backward? One slide, how does PyTorch manage this? It's beyond the scope, but I think it's, so what you do in this model forward method here, and we will see a more extreme example soon, I'm just putting a function together from components. I'm saying okay this linear function which was W times X minus the B and then apply sigmoid to it. I could do much more complicated stuff here. What this effectively does is function composition. We will see in this second part of the lecture we will compose a lot of functions. So we are writing our model function as something, yeah, k different functions and then we just apply them to each other like this. That's what you always do. And these are simpler functions like functions which PyTorch knows like this linear function, the sigmoid functions. And by the way each of these functions has parameters, some of the parameters of the whole thing. The parameters of these functions in total give the parameters of the whole model. And then on top I compute the loss. The loss is just another function which adds to this composition. So now I have loss of this here minus this. So I have one big function which is the composition of a lot of functions. Now if you're computing the gradients or derivatives of a composition of a lot of functions, you can compute this from the derivatives of the individual functions via the chain rule, right? You know the chain rule, I think it's school stuff. And PyTorch just knows all the derivatives. If you just use PyTorch components, it knows the derivatives of this function you use, of this, of this, and now you compose them together, it remembers how you composed them, and then knows how to derive, compute partial derivatives of this big thing, and from that compute the gradient. And the algorithm which does this, we will not talk about it here, is called back propagation. It's a strange name, it's an algorithmic name, because it's like the algorithm which does it if you implement a chain rule for complex, but that's not important. That's why it's called backwards. It's named after the algorithm to compute, to apply the chain rule here. And it's just one side remark that's a bit strange, but I think it's interesting, it's the last thing before the break. We talked about this Sebastian, and PyTorch is written from the perspective of people applying this stuff. So let me do a step of back propagation. If you would do it mathematical cleaner, you would say compute gradient here, right? That would be the appropriate name. But PyTorch, you will see this in many places, just views this from the approach of someone who just wants to do learning stuff. So in an operational way, not in a mathematical way. That's why we will see another example of this. So compute gradient would be a better name. Now we have the break and then the second part. So five minutes break. Thank you. So let's continue. Second part. Let's continue, second part. Just to take off again, there are two levels of understanding this. One, 99% of people doing learning probably just have the understanding you have to write this and then it works. That's one way of understanding. I mean, that's I think how most people are using these frameworks. But understand this computes the gradient and this adds the gradient to the weights. And there's a lot of magic behind the scene. And now with this small step, logistic regression, that's the same as before. Now we can use a much more complex model function. And let's do that in the second part. And we have done the hard part now. This is now just... So what's the task for the rest? Now we want a different task. So far it was binary classification. And by the way, let me say this one thing I mentioned. Multinomial, just to see how easy it is to now generalize this. Now maybe I want multinomial logistic regression. Now I just have to put a function here. Here I maybe don't want linear n comma number of classes. So now I get k different values, and now I need a function that turns it into a probability distribution. We will talk about it in a few slides. That's also an easy function. I just put that function here, it's called softmax by the way, and then I have multinomial dec. So I just put a k here, number of classes and a different function here and it will do all the rest for me. I have multinomial regression. I went from binary to multiple classes without adding a line of code. So, but we don't want multinomial, so multi-class regression, we want next word prediction, or rather next token prediction. So, this is much harder. And one thing, I didn't really say it but let me mention it now. So the question is you have this function here which it exists but you can't compute it. Now you have your model function. The question is which model functions are suited for the approximation of which functions. And if this is a complex, I mean if this is a very simple function, like plain and simple, you don't need the complex model function here. If this is a very complicated function, you need a complicated function here. So just to say this. So now we need a more complicated function. Now essentially we will give you a function now because in three lectures for this whole linear algebra thing we can't explain everything, although we would love to, but at least we will give you some intuition. And your job, pay attention for exercise G12 will be, understand the framework as I've explained it so far, take the model function, we'll give it to you and put it together in the right way. Just putting it together in the right way already requires understanding how this all works. And of course there's the forum for question. And then a little twist at the end, I will come to this in a few slides. And you absolutely have to do the exercise, you only learn this stuff by doing. It's true for all the exercises. I mean you have to do this and then you see all these, then you n words, predict the next word. So I have a vocabulary again. Let's look at here. So far our vocabulary was words. So all the different words. So maybe I have 100,000 different words here, or 50,000, or I don't know. Just the distinct words. And we will denote the size by small v in the following. Important when I'm talking about language models, one thing is words given some number of words predict the next word, it can also be characters and for exercise G12 you can try both. We give you both tokenizers. You could also break this into characters, just the letters. And then given 12 characters, predict the next character. Now here's one question, just as a tangent. Why did we choose? Would what we have done in the last lecture have also worked for characters? If our tokenizer splits this into characters and then you have a vector, I give you embeddings for characters, you sum them up, you get a vector and then you ask is it funny or not. What do you think? Yeah? Probably not. Don't you get the zip slot and so the frequency of the characters so you just get the distribution? Yeah, when you do that you get, think about what adding up these embeddings did. It tells you this word occurs so many times, this word, it doesn't tell you anything about the sequence of words. Just there is Harry in it three times. And now it will tell you the letter E is in here five times. So now you have to predict just from... here I have a movie plot with seven times the letter E, five times the letter T, and so on, predict whether it's funny or not. Maybe it's possible, maybe not. Now we will do something different. We will consider the order. And when you consider the order, and of course it makes sense, right? You are asking, given this, what's the next character? And the framework, what I will show you is agnostic. It doesn't care what the tokens are, whether it's words or characters. So that's an easy change, especially since we give the tokenizers to you. You can just, yeah, we give both to you, break it into words, break it into characters. Will be interesting to see the difference. We provide, we, Sebastian. So we call the sequence what we input the context, that's called the context. So the input is K tokens or not K because we will call it C. So the length of the, yeah, given so many, and by the way GPT and so on, well what you can use now also via API, the content length is a big thing, right? So how much context does it consider? Given the last 1000 words or characters predict the next word. Of course, the more context you have, the more meaningful you can predict. If the context length is just one, then just from one word predict the next word, forget everything before. But it's more expensive as we will see. So number of things, letters or words, V, context length C, these parameters we have. So now comes something which falls from heaven. So this is now, and let's go back one more time to this code. Here's our logistic, it was very simple. It just had this linear weight here and then applied it with the sigmoid function. So here's what you do for the language model. We would love to explain this more. I will explain it a little bit. So there's more stuff now, more functions which I put together. And I have some comments here. So this model is from Sebastian. We've tried around with stuff a little bit. Sebastian we have tried around with stuff a little bit. I have a slide where I will explain a little bit about these things, but the important thing for you to understand is this is now just a more complex function which I composed from simpler functions. So let me not go into the details here and each of these functions has parameters. Let me go to the next slide. And some things look familiar here, right? So here is a function which just takes an input, multiplies it by, this is a matrix now here. Yeah, everything is tensors, vectors. You multiply it by matrix, you get another vector. You take that vector, you plug it into a function, you get another vector. That should be clear, you plug it into a function, you get another vector. That should be clear, right? You just start with some vector and you apply all kinds of functions and you get another vector. And I deliberately didn't put four loops here or something like this. You wouldn't write it like this in the code, but you can so that you see I'm just applying functions here. By the way, this is not yet the forward pass. This is just all the individual functions I have. I'm just defining them, right? So just so, I hear somebody whispering, if you have a question, please do ask. So this, what this does, it defines a function with parameters, right? That's why I call it F3. It's a linear function which takes something of this dimensionality as input and gives something of that dimensionality as output, which means it's a DC times D matrix. That's what matrices do. They take vector from one dimension, give you a vector of another dimension. And let's, give you a vector of another dimension. And let's maybe before this slide look at the, this is the forward function. So now what I do is I just do all kinds of functions. I will come back to this in a second. So here I do a matrix multiplication with these weights, more matrix multiplications, I add it together. Here I do some flattening again. And now I, this year, I just, and now I have something, a vector, I apply F3 to it, F4, F5, F6, F7, F8. And then I, now I still have a vector, and then I apply F9 to it, and I get another vector, and that's my output. So, and yeah, and the question is, why do you do it like this? I will explain a little bit, but just most of it you have to take for granted, but understand, I'm just composing the function from simpler functions. So let's try to understand some of these functions here. Linear you can understand, I just explained it. What's Gelu, what's layer norm and what are these? These are just functions here. Okay, let me just... why are the first two lines just weights and not functions like in the remaining line? That's one thing to note, you may want to understand it. So why is it not F1 to F9 and then I just compose like apply F1 then F2, F3, F9. The reason is these are functions which PyTorch provides you. So multiplying something with a matrix and we will talk about this. Sometimes you want to do stuff which is not provided by PyTorch and this is actually what happens here. So here I'm taking these weights, multiplying it with some other vector and I get a result. And here I'm multiplying these weights and then I'm adding this up. So I can also do this. Just so you understand on this level, sometimes I can just use functions from PyTorch, sometimes I want to write functions myself. So that's the reason for this. No predefined function for this, so we just do it ourselves. What is GELU? Well GELU, clear, right? It's the smoothed version of RELU, just like the programming language C is the successor of B. That's a correct answer. So let me very briefly, you always, you saw the sigmoid function, right? The sigmoid function looks like this, very roughly. So, you have here 0.5 and so on. That's the sigmoid function. Then there is a ReLU looks like this and I think it's, I think it looks like this. So it's a much simpler function. It has the, I could talk a lot about what's the advantages of this function. One thing you can see it has a sharp edge, it's all zero here, that's a bit problematic when you compute partial derivatives and stuff. So let's look at a picture of the Gelu. Yeah, Gelu is just a smoother version of this duck function, right? So that it has partial derivatives, that it's not all zero to this tuck function, right? So that it has partial derivatives, that it's not all zero to the left and so on. So it's just a, so GILU just looks like, then makes a, yeah, and it's just a bit more smooth. And now the question is, and there's a lot of stuff of this kind, we don't have time to go into this. Use this, it works better, right? You could also try here and use the sigmoid function or something like this, or not use this. Why does it work? These are important questions, but yeah. There are reasons for this of course. One reason here why GILU is better than RELU is that RELU has this, it's all zero, it's not differentiable at zero and so on. It's nicer to have a smooth function. Just some pieces of information. What is layer norm? Just so you know what function this is. This just, you have some input and maybe these are very large numbers. So one way would, one way to know, we have seen normalization, let's just normalize, we have 300 values in the vector, let's normalize them so that the mean is 0 and the standard deviation is 1. And what the layer norm is, it does scale it in this way, but to some mean and some deviation, which are parameters, which means they change during the training. So that's what it has two parameters, so weights, and then it scales the input according to these weights. And these weights can change, they are updated during the training. So this is just, and there's not too much else here, right? And then I'm just, so I have linear, we have seen this. Gilou, I just explained it. Normalizing a little bit. You do that because for some reason by these functions, your numbers can become very large, right? You always have, that's also, I think, understandable stability problems. What if your numbers become larger and larger and then it overflows? So it's always good to normalize. But how do you want to normalize? Really so that the mean is zero. Let the function figure it out by itself. So here I just have function normalization and so on. And now, yeah, let's try to, I can just give you hints here. So this part is like clear, you just concatenate, compose these functions. What about this here? Let me try to explain it as best as I can, but it's really just intuition. So as input, our model gets as input the token IDs. Let's go line by line. So it just tells you, so you have a context, let's say 10 tokens. Let's talk about words. Let's talk about words. You can do the same with characters. I have 10 words. I want to predict the next word. So the first word, let's just say I have 1000 words in my vocabulary and I want its word number 12. So the token ID, this is just a vector now of 10 IDs, and the first one is word number 12. I first turn it into a one-hot vector, right? If I'm a vocabulary, it's 1,000 words, it will be a vector with all zero, but at position 12, the index of the word, I have a one. We have seen that. And let's forget about the positions for a second. Let's just look at the token. And now I'm multiplying this with the weights and what I will get now is for each vector I now get an embedding vector, like the ones which we gave to you. This is what this matrix multiplication does. If I multiply this one hot vector with this weight matrix, I just get a certain line or column from that weight matrix, which means now I have an embedding vector and we can actually also see which dimension it has. Yeah. So this here gives me a d-dimensional embedding vector for each word or token from my vocabulary, but these are weights, which means these can change now through the process. And this is not a very satisfactory explanation, but at least it tells you something. So I'm not inputting embeddings here, but they are part of the things that can change throughout the process. So that's how it's related to our thing. So no need to give you embeddings here. This step here now, it just computes an embedding for each item from the vocabulary and these can change over training. That's I think gives some understanding. And now I do the same thing with positions and I add this up. So without the position I've explained it now, I do the same thing again, but not. So here I get this input 10 IDs if my context length is 10 of 10 tokens and this will turn it into 10 embeddings now. And then I do the same with positions. I can also explain it only. So now this will do the same thing. Each position now gets a unique number. And now I also have an embedding for each position. So I have a distinct vector for position one in the context. I have a distinct number for position two. Let's say the context length is ten. I have a distinct number for each position. And I have weights which give me a distinct number for each position and I have weights which give me an embedding for each position. So I have a distinct vector now and how this vector looks like is something which can change over the course for each position. And I just add this to the... and now, and I'm sorry I can't give deeper intuition here, at least this gives the model some way to know something about the position. At least there's now a difference in how we did it before. There was absolutely no difference if I sort this movie plot by words, right? It was just ignoring the order. Nothing and what we did in the last lecture considered the order. Now at least we are considering the order, right? Because we are saying, okay, the first, for the first thing, we give it some embedding vector and add it. And how does it help? That would require a deeper explanation. We give it some embedding vector and add it. And how does it help? That would require a deeper explanation. I don't have time for it, but at least we are considering the position. So it's a bit unsatisfactory that we are just throwing these things at you, but yeah, you can use them and see how it works. That's as much, I think, as we can do in one lecture. So at least we are considering position information and it would not work. I hope that's clear. Next word prediction without position information. Cannot really work. And here's one thing that can be understood again and let me explain that because it's quite beautiful. That's always the same thing. The last layer, what do we want? We want the probability distribution over the things in our vocabulary. I want the probability distribution over v things. Whatever you do here, it always gives you numbers and the final thing gives you now V numbers. So I have 1000 items in my vocabulary and now I have 1000 numbers. Any numbers. Minus 12.5, 0, 5 point. Like for logistic regression, right? I do the dot product and now I get minus 12, 0.5, 173. Now I have to turn it into probability. Now I have to turn these into a probability distribution. How do I do that? That's, you always, so we have to turn our output in a probability distribution. And that's what, okay, that's what the, let me go to the next one. That's what the softmax function that, so softmax takes a vector of any values and that's not a bold face here. So it's m real numbers and it's super nice. That's how you, and the question is you could do it in a number of ways, let's just do it in this way. So we take each component e to the power of that, I'm just doing it, why? You will see in a second, and then I'm dividing it by the sum of all these. Now it should be, it is very easy to see that that's always a probability distribution. It's kind of one explanation of softmax is, think about it, what's the easiest way to turn m numbers into a probability distribution over m things? That's the easiest way. Why? Probabilities need to be non-negative, e to the anything is a non-negative number, so that's like the easiest thing to turn anything into a negative, non-negative number. And now these things sum to e to the z1 plus e to the zm just divided by that, now it sums to one. So it's kind of, yeah, it's the easiest function which turns this into probability distribution. And this we already saw, it's a probability distribution over m things. Here's one very beautiful thing. We did the same with sigmoid. Sigmoid is actually a special case of this. So nice, easy to see. So if we have a single value and we want to turn it into probability, as I said, single value and we want to turn it into probability. As I said, single value, so let's make two values out of it. Let's say, okay, we have Z and zero. Like zero, if you think of the input to the sigmoid function, negative means no, positive means yes, zero means I don't know. So here is my output, let me take zero here. And let's look at what is the soft marks of this one number and the zero here. Yeah, what is it? It gives you two probabilities. This is just one minus the other, it's just two. So let's look at P, and P will now be e to this number, e to the z divided by e to the z plus e to the other number which we set to zero and yeah I mean we can just just do the I mean it's basically we just write it here e to the z divided by e to the z plus e to the zero is one. I just divide by e to the minus z at the top at the bottom here minus z plus e to the minus z and that then is one over one plus e to the minus z plus e to the minus z and that then is one over one plus e to the minus z and that's just the sigmoid function, right? So it's really, you can see it both ways. Softmax is generalization of sigmoid to more than one thing or sigmoid is just a special case of the softmax. And that was also our intuition in the last lecture, right? We have to turn a number into probability. Let's take sigmoid. It's the simplest function which does that. And that's what softmax does. So we just, that's what we do in the end. And now here's another thing. Pytorch is full of these peculiarities and it's a good way to explain it again. Why? You would expect that the last layer here, this is my model function and it should output a probability distribution. It does not. Here it outputs V numbers. Shouldn't there be self F10 is softmax of the last one? Then it would output the probability distribution. Yes, it would be the meaningful way to do, but, but I will come to the but now, that's on this slide 7. Now I need a loss function like before, right? Now my model, let's say it predicts a probability distribution and here's the probability distribution for my input. Now I want to compare the two. Now I need a measure that compares to probability distribution. We have already used binary cross entropy for logistic regression. Now we need cross entropy. Well, PyTorch doesn't have a function for cross entropy, which is, at least Sebastian and I haven't found it yet, which is super strange. I mean, the thing is, unfortunately, PyTorch is not really a mathematics library, it's a library for doing stuff, for learning stuff. And when you learn stuff, when you do this model learning, you always have a model that outputs numbers. So what it has, it has a function, cross entropy loss, which takes two arguments. The second argument is indeed the probability distribution that comes from your test set or training set, your input, where you know the values. And the first argument is just a vector and then it applies softmax to the first argument and not to the second argument. So it does the softmax implicitly. Which is kind of encoding that's convenient because you always have it that way, otherwise you would always have to add the softmax layout by yourself. It also efficiency considerations, but from point of view of abstraction it's a bit strange, right? Let me just say it one more time. In PyTorch you don't have a function cross entropy which computes the cross entropy between two probability distributions. Instead you have a function cross entropy loss which computes, takes the first argument, it's just values which cross entropy loss turns into the distribution via softmax function, and the second argument is a probability distribution. Strange, but that's how it is. And that's the reason, I hope that's clear by this, why you don't, why the output, why the last thing here is not, yeah. Because if you would have softmax here and you would have, now you have a probability distribution, you plug it into cross-entropy loss, it would compute the softmax, this to something that already is a probability distribution? It's not idempotent, right? It doesn't remain the same. Just think about it. I mean for example if I, let's say I already have a probability distribution. Is softmax of this, does it remain a probability distribution? No, right? I don't think so. Let's just do it. Softmax of 0.005 becomes, yeah, what does it become? It becomes e to the 0.5 divided by 2 to the 0. It doesn't look like the same function, right? Or is it? No, no, that's a bad example because that is again, that's a bad example, okay. Now I'm confused, is it? So far, if they are all the same, then they remain all the same. I think we have to take a bad example. It's just a question which you may want. Let's just take one over four, three over four. And now it becomes e to the 1 over 4 divided by some of the two, let me just call it e, e to the 3 over 4 divided by some of the two where e is e to the 1 over four. So it does again become another probability distribution, but that is certainly not equal to one over four, three over four. If you think about it, I mean it's just a small tangent. What will happen? I think softmax pushes the things more to the extreme. Right? E to the, if this is, yeah, I think it will, things become more extreme. I think this becomes smaller and this becomes larger. I think that's what will happen. Figure it out yourself. But yeah, so that's strange. And it's the same kind of strangeness like with the backward. Let's go to the code for a second and PyTorch is full of that. I don't like it, it's not nice but that's how it is. This shouldn't be called loss backward, it should be compute gradient, right? Because that's what it does mathematically. But it says backward because it uses an algorithm called, you shouldn't call a function after the algorithm it used to implement. And the same for cross entropy loss, which doesn't compute the cross entropy, but does something with the first argument. But yeah. So a lot of people using this is just, oh you have to use cross entropy loss, and it works. And you don't even need to understand what you're doing. But of course you should understand what you're doing. Two more things and then we are done. Oh yeah, by the way, the exercise sheet, let me look at the data sets of the teaching information retrieval, that's the name of the folder. How are our data sets called? Oh lecture transcripts. Ah okay, is it because I'm, it's all in one line. Oh yeah, it's just a transcript of the lectures, of the videos, just a text. Welcome everybody to lecture one, information achieved in the web. It's just a transcript from our lectures, concatenatedated even from the lectures from last year. So Sebastian did this and you can just use this as input and see where it goes from there, whether it delivers a lecture like I give lectures or not. You can take this, this is one data set we give to you, also try it with any other set and just see how it works, play around with it. I think it's a lot of fun and you will see it will work. And I think it's, I'm not sure whether it's on the sheet. You can also input sparkle queries, like a half of a sparkle query and see whether it continues to write the sparkle query. You can use it for anything, right? That's the power, at least the principle power of language models. It's really nice that we have this as an exercise. Let me just look at the sheet. Yeah, and so I've explained to you how it works in principle. I've given you, that's something that just falls from heaven, but you just have to accept that. Where do we have it here? This is basically, yeah, I'm sorry, that's now. You can just copy and paste this. This is your any function, but maybe you also want to play around with it, take a different, yeah, you can just play around, add even more functions. And then this here, for example just play around, add even more functions and then this here. For example, one thing, you have some limited but at least some understanding, for example, try to remove this position information. You can just do that, say okay, the positions are not important, I can do it without them. Well, just try it, remove it and see how well it works. So that you can do now. You have full control. So the exercise sheet will be to implement this and you can just use the, you can look at the code from the lecture but Sebastian again prepared a great template. So last two slides what you should do. Prediction. Now you have the model. You have learned some parameter setting. Now you have the model function with that parameter setting and you can ask it give me a probability distribution over the words. I don't want the probability distribution. I want the next word. How do I do it? There are two ways basically. One way, I want the next word. How do I do it? There are two ways basically. One way, I mean your model, now you have to use softmax because your model doesn't give you a probability distribution. One way is pick the token with the highest probability and take it. So take the letter or word with the i-th probability and then proceed. I will say in a second what proceed means. That's not so nice, now it's deterministic. You always, and there's bound to repeat itself and so on. A nicer method is as follows. Pick a k, like 10, and now take the 10 items, words or letters with i's probability. This is not the probability distribution, just turn it into probability distribution, not with soft marks, just normalize it to one, just divide it by the sum. So just take the k most, and now pick one of these according to this distribution, right? So that's how it's typically done with these models. And you should implement this. This is fairly straightforward, but it's much nicer. Then you have a probabilistic element, not always picking the most likely one. Now you pick a token, so if you run it again, you always get slightly different results. And now how do you proceed? You take the first K words or letters, let's say words, you predict the next one, now you have predicted one, now you append that to your context and remove the left most one so that you again have K or C, we called C the context length, C words from which tool, yeah. That's easy to implement, you should do that. And one more thing that's important to understand, the code is also in the template. Well, logistic regression, let me show the program one last time. It's just one program, right? Why do I have a... It's one big program and let me maybe run it again. What does it do? It reads the data, tokenizes it, turns it into vectors, trains the model, finds the best parameter setting and then does the evaluation all in one program. And it takes some time and I just did it for one epoch, right? I can, while we're talking, run it for 10 epochs. it will take some time. Now, for a complex, I mean this will take time. What you want to do is you want to train it. This will take minutes and now you want to save the parameters. And not only the parameters, you also want to save the hyperparameters. Like I took batch size 10, context size 5 and whatever else you parameters you have. So you want to save it, so you have a program train which trains it and then just saves the model, especially the parameters to a file. And this is called a checkpoint in PyTorch. So just the state of it. What you can even do is you can just train it forever. That's how Sebastian also did it for his solution. After each epoch, just save what you currently have. So which means when you abort your program and it has done three epochs so far, you have the model after three epochs. You might even want to change a different, say write it to a different file after each epoch. That's easy and it's also easy to do it in PyTorch. It's just in PyTorch with torch save. Here you can just put in any dictionary. The code is in the template. Save it to that file, load it from that file. So you will just get these.pt files. They are called, or you can call them however you like and how exactly it's done, you will see it on the code template. Okay, not so bad in time. So I hope many of you will do the exercise, I think it's quite fascinating. Any questions about anything? Oh yeah, please. I cannot follow, like you said, if you use the Rawson-Cokey laws, Oh yeah, please. Yes. Yeah. And what's the question? So let me say it again, cross entropy loss takes two arguments and the first argument is not a probability distribution and cross entropy loss computes the softmax of the first argument. It's part of the function. So cross entropy loss computes softmax of first argument, second argument, and then the cross entropy. So it's a strange function. And that's why when you already give it a probability distribution, we compute the softmax again, which is wrong, because softmax of softmax is not softmax. Softmax, if you apply it repeatedly, will change your distribution. Good question, and please do ask in the forum if you have any questions. Any other question for now? So this is the last lecture with real content, the last exercise sheet, the great one, I think. In the next lecture, which is the last lecture with real content, the last exercise sheet, the great one I think. In the next lecture, which is the last for this semester, we will talk about evaluation, about the exam, maybe interesting to some of you, and then about stuff we do at our chair, introduction and everything. So do come. That's it for today, thank you, bye.