Sr. Big Data Engineer Resume
Austin, TX
SUMMARY
- 7+ years of experience in IT industry experience in exploring various technologies, tools and databases like Big data, AWS, S3, Snowflake, Hadoop, Hive, Spark, Python, Sqoop, Cassandra, Teradata, SQL, PLSQL and Redshift.
- Hands on experience in working with Hadoop framework stack including HDFS, MapReduce, YARN, Hive, Impala, Pig, HCatalog, HBase, Kafka, Sqoop, Flume, Zookeeper and Oozie.
- Strong knowledge of PostgreSQL, PostGIS, MySQL, Oracle Database, and Cassandra for the design of highly per formant databases.
- Expertise in Amazon Web Services, particularly Elastic Cloud Compute (EC2) and Dynamo DB, EMR, Athena, Glue, Redshift, Lambda, and Automating Cassandra Cluster Deployment on EC2 Using EC2 APIs.
- Expertise with numerous ETL technologies, including Informatica Power Centre, in data migration, data profiling, data cleaning, transformation, integration, data import, and data export.
- Using the Spark framework with PySpark and PyCharm, experience indesigning and developing an ingestion framework from numerous sources to Hadoop.
- Experience with Azure Cloud Platform, including Data Lake, Data Storage, Data Factory, Data Bricks, Azure SQL Database, and SQL Database Migration.
- Developed ETL pipelines in and out of the data warehouse using a combination of Python and Snowflakes SnowSQL.
- Experience with Cloudera, Hortonworks & MapR Hadoop distributions.
- Experience in Python programming in application development over several years.
- Preparing Python scripts by using Python, pandas, NumPy and SQL Alchemy for ETL.
- Hands on experience working on NoSQL databases including HBase, MongoDB, Cassandra and its integration with Hadoop cluster.
- Assisted in the migration of data from LINUX and UNIX file systems to HDFS.
- Extensive expertise with version control systems such as Git and SVN.
- Developed Spark scripts based on the requirements using Scala shell commands
- Extensively used Teradata tools such as Fast export and Multi Load to export and load data to and from a variety of sources, including flat files.
- Experience in designing and implementing RDBMS Tables, Views, User GeneratedData Types, Indexes, Stored Procedures, Cursors, Triggers, and Transactions.
TECHNICAL SKILLS
Hadoop Eco - systems: HDFS, MapReduce, Spark, Hive, Sqoop, Kafka, Zookeeper
Cloud Platform: AWS EC2, S3, AWS EMR, Athena, Azure
Programming and Query Languages: Python, SQL, HiveQL, PySpark, Unix, Scala, PyCharm
Tools: and Methodologies: Databricks, Jupyter Notebook, Docker, GIT Version, Scrum, Tableau, Alteryx (Knowledge)
Relational and NoSQL Databases: Oracle, MS SQL, PostgreSQL, HBase, MongoDB, Snowflake, Teradata, Redshift, Aurora DB
PROFESSIONAL EXPERIENCE
Confidential - Austin, Tx
Sr. Big Data Engineer
Responsibilities:
- As a Senior Big Data Engineer, I worked with Apache Hadoop, MapReduce, Shell Scripting, and Hive.
- Participated in daily scrum meetings with cross-teams and was involved in all phases of the SDLC utilizing Agile.
- Developed complicated Hive queries to extract data from a variety of sources (Data Lake) and store it in HDFS.
- Data mining, data collecting, data cleaning, constructing models, validation, and visualization are all things I have worked on.
- Developed the code to extract data from an Oracle database and load it into the AWS Data Pipeline platform.
- HBase, Flume, Pig, and Sqoop were installed and configured in the Hadoop environment.
- Designed and developed Hadoop-based Big Data analytic solutions, as well as engaging clients in technical talks. worked on creating Hive Tables and partitioning them.
- Installed, configured, and maintained Hadoop clusters for application development as well as Hadoop ecosystem components such as Hive, Pig, HBase, Zookeeper, and Sqoop.
- Developed an Oozie pipeline to automate the duties of loading data into HDFS and Pig pre-processing.
- I used Hive queries, To classify data from multiple wireless applications and security systems.
- S3, RDS, Dynamo DB, Redshift, and Python were used to implement the AWS cloud computing platform.
- Handling the load and transformation of large amounts of organized, semi-structured, and unstructured data.
- Extensively involved in writing PL/SQL, stored procedures, functions and packages.
- Involved in Data Architecture, Data profiling, Data analysis, data mapping and Data architecture artifacts design.
- Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Erwin Data Modeler was used to implement logical and physical relational databases, as well as maintain Database Objects in the data model.
- Created tables using NoSQL databases like HBase to load massive sets of semi-structured data from source systems.
- In Scala, I created several MapReduce jobs for data cleaning and analysis in Impala.
- As part of a POC utilizing Amazon EC2, I created a Data Pipeline using Processor Groups and several processors using Apache Nifi for Flat File, RDBMS.
- Managed the metadata for the ETL operations that were used to populate the Data Warehouse.
- Created Hive queries and tables to assist line of business in identifying trends by applying strategies to historical data before releasing them to production.
- Designed Data Marts utilizing industry-leading data modeling tools like Erwin, following the Star Schema and Snowflake Schema Methodology.
- Using Amazon S3, EMR, and Spark, designed and developed end-to-end ETL processing from Oracle to AWS.
- To pull data from the cloud to a Hive table, I created a Spark streaming application.
- Developed SQL and PL/SQL scripts to extract data from databases in order to suit business needs and for testing purposes.
- Responsible for loading, extracting, and validating client data and involved in manipulating, cleansing, and processing data using Excel, Access, and SQL.
- By using parameters, I created a sheet selector that can support several chart kinds (Pie, Bar, Line, and so on) in a single dashboard.
Confidential
Big Data Engineer
Responsibilities:
- For the Informatics Team, I assisted in the design, development, and deployment of Big Data solutions
- Built and automated a data engineering ETL pipeline over Snowflake DB using Apache Spark and used Python APIs to combine data from various sources into a data mart (star schema).
- For machine learning and predictive analytics in Hadoop, I developed AWS Elastic MapReduce (EMR)/Spark Python modules. On deploying a Hadoop cluster in AWS, I generated the data cubes using hive, Pig, and Map-Reducing.
- Ensure data integrity while planning, coordinating, and extracting encounter data from numerous source systems into the data warehouse.
- Through a thorough understanding of business requirements, we improved and enlarged the encounter data warehouse model.
- Multiple deployment methodologies, CI/CD pipelines, and Jenkins pipelines were designed to assure zero downtime and reduced deployment cycles.
- Data loading from UNIX file system to HDFS was one of my responsibilities.
- Determined the needs for data-centric solutions, as well as the optimal technologies and design patterns to use.
- Assisted in the design, development, and launch of exceptionally efficient and dependable data pipelines for real-time streaming, search, and indexing applications.
- Collaboration with Cloudera Administrators to optimize cluster consumption and plan for future expansion and usage.
- Developed Oozie processes for daily incremental loads that import data from Teradata into Hive Tables.
- For making reports from Hive data, I created Unix shell/Python scripts.
- Created S3 buckets (configuration, policies, and permissions) and used AWS S3 for data storage and backup and AWS Glacier for archiving data.
- Python Boto 3 was used to configure and integrate services such as AWS Glue, EC2, and S3.
- Spark performance tuning experience includes determining the appropriate batch interval time, parallelism level, data structured tuning, and memory tuning.
- To boost Spark SQL speed, we used Apache Parquet to store enormous amounts of data.
- Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, and a wide range of machine learning algorithms such as classifications, regressions, dimensional reduction, and so on were all used.
- Developed a PySpark code to save data and Parquet formats and create Hive tables on top of them.
- Various machine learning techniques, notably the Generalized Linear Model, were implemented using Python and Spark.
- Machine learning and deep learning models were tuned, built, trained, and deployed using Sage maker.
- The top ten long-running jobs were evaluated for optimization and performance efficiencies.
- Create queues and assign cluster resources to ensure that jobs are prioritized.
- Flume was used to capture and analyze data logs from a web server.
- The entire pipeline was orchestrated using Apache Airflow, which delivered daily/weekly metric email reports from the power BI server to let business users make decisions on the fly.
- Collaborated with Solution Architects, Scrum Masters, developers, and testers in an agile setting.
Confidential
Big Data Engineer
Responsibilities:
- Understand business requirements and create a design document, as well as code, test, and deploy in a live production environment.
- Experience with cloud warehouse tools like Snowflake.
- Working with Spark Core and Spark SQL in Scala is a plus.
- For quicker data processing, I developed Spark SQL scripts utilizing PySpark to perform transformations and operations on RDDs in Spark.
- Using Spark, I performed data transformations and data analytics on a huge dataset.
- Responsible for managing data from many sources.
- Experience with the EMR cluster and various EC2 instance types depending on the needs.
- Data from UNIX file systems is loaded into HDFS by this person. Hive was installed and configured, and Hive UDFs were developed.
- Created Hive Tables, loaded data into them, and wrote Hive queries.
- Internal and external tables were used to implement partitioning, dynamic partitioning, and bucketing in Hive for more efficient data.
- Working knowledge of AWS Athena Serverless Query Services.
- Sqoop was used to import and export data from Relational Database Systems (RDBMS) to HDFS and vice versa.
- TDCH scripts for importing and exporting data into S3 and Hive were created.
- I worked on the CICD pipeline, which involved integrating code changes into a Git repository and building with Jenkins.
- Used Kafka to capture and handle streaming data in near real time.
- Oozie scheduler was used to create end-to-end data processing pipelines and schedule workflows.
Confidential
Big Data Analyst
Responsibilities:
- Developed a data transmission framework to transfer files every half hour from one Data Center to another, reducing the effort of other teams by 80%. The framework was created using Python, IBM DB2, and shell scripting.
- Implantedthe analyzed data to Tableau and displayed the regression, trend, and forecast for the datasets that were evaluated in the dashboard.
- Using shell scripting, I designed and built a program for ADHOC transfer requests.
- Using Python and Sqoop, PySpark, data was extracted from numerous sources such asRDBMS, UNIX, and HIVE.
- Designed and built a program to retain the metadata of files transmitted inIBM DB2, then constructed hive tables and kept them in sync for reporting reasons.
- Presented Tableau dashboards/reports to Business for data visualization, reporting, and analysis.
- Validate Scoop jobs, Shell scripts & perform data validation to check if the data is loaded correctly with no errors Perform static and transaction data migration and testing from one core system to another.
- BDM object imports that are automated by retrieving metadata and creating XML files, then loading with Python and Autosys.
- Data Scraping on Informatica log files using Python.
- Performed statistical analysis using SQL, Python, R Programming, and Excel.
- Responsible for daily communications to management and internal organizations regarding the status of all assigned projects and tasks.
- Provided support to various pipelines running in production
Confidential
Java Developer
Responsibilities:
- Understanding and analyzing the requirements.
- Servlets and JSP were used to implement server-side applications.
- Using HTML, JavaScript, XML, and CSS, I designed, developed, and verified a user interface.
- Struts Framework was used to implement MVC.
- Implemented Controller Servlet to handle database access.
- PL/SQL stored procedures and triggers were implemented.
- JDBC prepared statements were used to call database access from Servlets.
- The stored procedures were designed and documented.
- HTML is a widely used web-based design language.
- Working with the Spark Ecosystem with Scala and Hive Queries on various data formats such as text files and parquet.
- Unit testing for numerous components was a part of my job.
- By designing stored procedures, I worked on the database interaction layer for insertions, updates, and retrievals of data from an Oracle database.
- Involved in the development of a simulator for controllers that uses Scala programming to replicate real-time settings.
- Spring Framework was used for Dependency Injection and Hibernate was integrated.
- Involved in writing JUnit Test Cases.
- For any issues in the application, I used Log4J.