Breaking into data and AI engineering with practical, real-world advice
TL;DR
Introduction
🎓 Every other week, I get a message from a student or graduate asking the same question:
Breaking into data and AI engineering can feel overwhelming, especially with so many tools, cloud platforms and career paths to choose from. This guide is based on the same advice I’d give over a coffee chat or mentoring session: practical, experience-driven and applicable regardless of your preferred technology stack.
Whether you’re still learning, building projects or preparing for your first role, the goal is simple: help you move from learning → building → getting hired.
Table of Contents
Curiosity is the real prerequisite
- Cloud free tiers. Azure, AWS, and GCP all offer free credits for new accounts. Databricks has a free Community edition. Microsoft Fabric has a 60-day trial. Use them – and don’t stress about racking up a bill, the guardrails are reasonable as long as you turn things off when you’re done.
- Free agentic AI tiers. GitHub Copilot is free for students and verified open-source maintainers. Claude, ChatGPT, and Gemini all have free tiers that include some level of agentic capability. Pick one and actually use it daily – you’ll learn far more about how these tools think by using them than by reading about them.
- Project hosting for free. Vercel and Cloudflare Pages both offer generous free tiers for hosting front-ends, dashboards, and small APIs. This is how you turn a side project into something you can share with a link instead of a screenshot.
- GitHub for everything else. Public repos cost nothing and double as your portfolio (more on this later).
The hard part isn’t access – it’s making the time, and giving yourself permission to be a beginner for a few months while things start to click.
Pick a lane: one area, one cloud
- Pick one role (e.g. Data Engineer, Power BI Developer, AI Engineer)
- Pick one cloud (Azure, AWS, GCP, Databricks)
You can branch out later, but depth in one stack is far more valuable than a shallow tour of all of them.
Start by understanding what a good, current data platform architecture looks like on that cloud.
- What does ingestion look like?
- Where does raw data land?
- How is it transformed, modelled, governed, and served to users?
Then try to spin up a small version of that architecture using free credits, free tiers, or trials. Begin with common source types – APIs, CSV files, databases, event streams – and common data products such as Power BI reports, dashboards, or lightweight apps. Get comfortable browsing the vendor’s own documentation too.
For Azure and Fabric, for example, Microsoft Learn is often the fastest way to understand how the services are intended to fit together.
The reason this works is simple – employers hire for specific roles. “I’ve built end-to-end data pipelines on Azure using Fabric and Synapse” lands much better than “I’ve touched a bit of everything.”
When choosing your lane:
- Look at job ads in your city, not just LinkedIn headlines. If most graduate roles near you are Azure and Power BI, that’s a strong signal. Pick the lane that has volume where you actually want to work.
- Pick a cloud that has a real local community. Meetups, user groups, and Microsoft / AWS / Databricks events are where you’ll meet the people who eventually refer you in. The “best” cloud is the one where you can shake hands with practitioners. Recruiters sometimes attend these events too, so introduce yourself. Being a familiar person in the room gives you more visibility than being one more resume in the pile.
- Commit for six to twelve months before re-evaluating. Switching every few weeks because a new tool is trending is the fastest way to stay junior. Set a goal before you start, and regularly ask yourself if you have achieved what you set out to do. Sense-check that you’re following up-to-date, long-standing patterns, but don’t abandon your lane every time a new tool gets noisy on LinkedIn.
- Within your lane, get one signature certification. Azure DP-700, AWS Data Engineer Associate, Databricks Data Engineer Associate – whichever fits your chosen cloud. It’s not magic, but it forces you to cover the breadth of the platform and gives recruiters one less reason to skim past your resume.
- Pick a marketable role for yourself and your location. In Australia, entry-level Data Engineering roles are generally much more common than entry-level AI roles. Unless you have a strong research background (often a PhD), pure AI openings can be limited, so targeting Data Engineering first is usually the more practical path. Don’t worry, you’ll definitely get to gain experience using AI while being a Data Engineer!
- Explain your daily work to non-technical people. Practice describing your project to family or friends with no tech background. If your grandma can understand what you built and why it matters, you are communicating at a very high level – “If you can’t explain it simply, you don’t understand it well enough”. Abstract away the finer details and only communicate the essence that needs to be heard.
Pick a lane, commit to it for at least six to twelve months, and let the depth do the work for you.
The two programming languages data engineers need
Focus on:
- For SQL – get comfortable with window functions, CTEs, and query plans. Most graduates can write a SELECT with a join. Far fewer can explain why a query is slow or rewrite it to scan less data. The latter is what gets you taken seriously in a team review.
- For Python– pandas and a bit of PySpark cover most data engineering work. For AI engineering, add requests, httpx, and at least one orchestration library (LangChain, LlamaIndex, or the SDK for whichever model provider you’re using). You don’t need to learn everything – you need to be able to wire a small script together without copy-pasting blind.
- Learn git properly – Branching, rebasing, resolving a merge conflict without panic, and writing a decent commit message. It seems boring, but it’s the single skill that separates someone who can join a team on day one from someone who needs a week of hand-holding. To supercharge this, start contributing to an open-source project that matters to you. Nothing teaches collaborative git faster than a real issue, a real pull request, and real review comments.
What about AI tools?
- Use AI to accelerate work
- Understand and validate outputs
- Step in when things go wrong
Build your first end-to-end data project, even without real data
A lot of graduates I speak to are waiting for a “real” project before they start building – a company dataset, a paying client, an internship.
My honest advice is don’t wait. Initially this may be hard when you don’t have a concrete project to work with, however you can find a lot of sample data online and create your own end-to-end pipelines for a use case.
Pick something you actually find interesting – sports stats, public transport data, a Kaggle dataset, your own Spotify history, weather data, a council open data feed. The domain genuinely doesn’t matter. What matters is that you take it from raw data all the way through to something useful – a dashboard, a report, an API, a model, or an agent.
Some good places to find sample data:
- Kaggle Datasets – thousands of datasets, often with example notebooks attached.
- data.gov.au, data.gov, data.gov.uk – government open data portals. Usually messy and realistic, which is exactly what you want.
- Public APIs – GitHub, Spotify, Strava, TfL, OpenWeather. Working with a real API teaches you auth, rate limits, and pagination – all things you’ll deal with in a job.
Going end-to-end is the bit that teaches you the most.
Anyone can write a notebook. Far fewer people have actually wired together storage, ingestion, transformation, orchestration, and a consumption layer – even on a toy project.
A rough sequence I’d suggest:
- Land some raw data into cheap storage (Azure Blob, S3, or a OneLake / Fabric Lakehouse).
- Transform it with SQL or PySpark into a clean modelled layer.
- Orchestrate the run on a schedule (a Fabric pipeline, ADF, Airflow, or even a GitHub Actions cron job for a starter project).
- Expose the result – a Power BI report, a simple FastAPI endpoint, a small Next.js dashboard on Vercel, or a chat-style agent over the data.
The moment you’ve done that once, even badly, you start speaking the language of the role. Then do it a second time and make it 30% better. When you’re done, build an ambitious data product on the right-hand side that showcases the data – not just a pipeline diagram, but something a user can open, explore, and remember.
Your GitHub is your second resume
Once you’ve built something, put it somewhere people can see it. I’d suggest committing your code to a public repo on GitHub so you have something to show in your resume and interviews. This is one of the most consistent things I’ve seen separate candidates who get callbacks from those who don’t.
A few things that make a portfolio actually work:
- Build a good README.md. A clear README – what the project does, the architecture, how to run it, what you learned – often matters more than the code itself. Include a small diagram if you can; even a hand-drawn one in excalidraw is better than nothing. It shows you can communicate, which is half the job.
- Pin three to five repos on your GitHub profile. Recruiters don’t scroll. The first thing they see should be your strongest work, not whatever you committed last.
- Host a live version where it makes sense. A dashboard, a small web app, or a chat-style agent over your data is far more memorable when the interviewer can click a link. Vercel and Cloudflare Pages both have free tiers that handle this comfortably for personal projects.
- Commit small and often. A green contribution graph signals consistency. One huge “initial commit” with 80 files signals the opposite.
- Don’t worry about it being perfect. A working, documented project beats a half-finished “ambitious” one every time.
- Fork and learn from others. Browsing existing data and AI projects on GitHub, and forking a repo, is a great way to quickstart your own project. It’s not cheating – it’s how most professional work starts too. Just be honest about it in the README. As mentioned earlier, start contributing to an open-source project that matters to you as well. Those contributions show up on your GitHub profile and give interviewers something concrete to ask about.
Working with agentic AI, not around it
A year or two ago, “AI engineer” mostly meant building models. Today, a lot of the role is about wiring up agents – giving them the right tools, the right context, and the right guardrails to do useful work.
In my own work, we’ve started incorporating agentic workflows even on cloud data platforms – for example, using Copilot CLI with access to the right tools. It’s still mostly traditional data engineering, but the surface area where agents add real value is growing every quarter. I’d expect this to keep improving as models and tooling mature.
My advice for graduates – get comfortable using these tools as part of your day-to-day, not just as a chatbot you ping when you’re stuck.
Here are some ideas:
- Pick one agentic tool and live in it for a month. GitHub Copilot in VS Code, Copilot CLI, Claude Code, or Cursor are all reasonable starting points. Use it for an entire project, including the parts you’d normally do by hand. You’ll quickly develop a feel for what it’s good at and where it quietly makes things worse.
- Build a small project with an agent in the loop. A LangChain or LlamaIndex agent over a dataset you already know is a good first step. Bonus points if you give it real tools – a SQL query function, a file reader, a web search – rather than just wrapping a prompt.
- Pay attention to context and prompts. A surprising amount of “AI engineering” is just thinking carefully about what information the model needs and in what shape. This is a skill you build by doing, not reading.
- Keep the fundamentals sharp. Agents amplify whoever’s driving them. If you can read the code they produce, sanity-check the SQL, and understand the cloud services they’re talking to, you’ll get ten times the leverage out of them.
- Use Agent Plan Modes or Spec-Driven Development Checkout GitHub Copilot Plan mode, or OpenSpec for spec driven development. Ensure that your agents are basing anything they do on a plan that exists in your workspace.
- Get agents to document project/architectural decisions so that they’re not making changes that conflict with earlier decisions.
- Flush the context and start a new chat because when context bloats, agent performance suffers. The documentation in the previous point will be referred to when starting fresh context.
The engineers who can frame problems for an agent and judge its output will be in demand for a long time. Be one of them.
Are graduate programs the best way into data engineering?
Consulting in particular gives you exposure to a wide range of clients, industries, and problem shapes in a short time. You’ll see more end-to-end projects in two years than most in-house roles give you in five. The pay at the start usually isn’t the highest in the market – but the learning curve, and the network you build, often pay back many times over.
A few practical things if you’re going down this path:
- Apply early and apply broadly. Most grad programs open intakes 6 to 12 months ahead of a start date. The Big 4, mid-tier consultancies (Mantel Group, Servian, Telstra Purple, Insight, Avanade), and Microsoft / AWS / Databricks partner shops are all worth shortlisting. Don’t put all your eggs in one or two baskets.
- Tailor your application to the practice, not just the firm. A consultancy’s “Data & AI” practice often hires separately from their general tech grad stream. Mention the specific cloud and practice in your cover letter – it shows you’ve done your homework.
- Use your GitHub in the interview. When they ask “tell me about a project”, open your portfolio on screen and walk them through one. This alone puts you ahead of most candidates.
- Ask about the first 12 months. How many projects will you rotate through? Who pays for certifications? Is there a mentor structure? The answers tell you a lot about how much you’ll actually learn.
It’s not the only path, of course. Product companies, in-house data teams, and start-ups all have their own merits – start-ups in particular can give you enormous responsibility very early. But if you’re not sure where to start, a grad program is a hard option to beat.
A 3-6 month loop: curiosity, mastery & standing out
If I had to compress all of the above into one idea, it would be this loop:
- Curiosity – you try things in your spare time because you actually want to.
- Mastery – that curiosity, focused on one area and one cloud, slowly turns into real depth.
- Standing out – the depth, made visible through projects, GitHub, and how you talk about your work, makes you the obvious hire.
In this way, the curiosity feeds into mastery and depth of a certain area. You also start becoming a go-to person – someone people ask when they want an opinion on a tool, pattern, or platform because you’ve tried it, built with it, and have lived experience rather than just second-hand takes.
Graduates who move quickly have a few things in common:
- They keep a running list of things they want to try and pick one off every couple of weeks, rather than waiting for the “right” project.
- They write up what they’ve learned – a short LinkedIn post, a blog, even a thread of comments on someone else’s repo. Writing forces the learning to stick.
- They show up to user groups and meetups, approaching people even when it feels awkward. Most of the people in those rooms hire or refer people.
This is a path of action – you’re not idly watching other people try out tech on YouTube or reading endless blog posts, you are steadily gaining and showing your first hand experience with tooling.
Closing thoughts
If you’re a graduate or intern reading this and feeling overwhelmed – don’t be. You don’t need to know everything. You need to pick one thing, get genuinely good at it, build something with it, and put it somewhere people can see.
If I were starting again today, the very first week I’d do a few things – create a GitHub repo for my first project, sign up for a free cloud account and a free agentic AI tier, and attend a local meetup in my desired area of expertise. The rest tends to follow from there.